Kafka Producer and Consumer example in Python
2019-05-12
A simple guide to understand JavaScript closures
2019-05-23
Show all

Tutorial on Crosstab Operations (pivot_table and crosstab methods) in Pandas

8 mins read

Introduction

Pandas offers several options for grouping and summarizing data but this variety of options can be a blessing and a curse. These approaches are all powerful data analysis tools but it can be confusing to know whether to use a groupby , pivot_table or crosstab to build a summary table. My goal is to have this article be a resource that you can bookmark and refer to when you need to remind yourself what you can do with the crosstab function.

Overview

The pandas crosstab function builds a cross-tabulation table that can show the frequency with which certain groups of data appear. For a quick example, this table shows the number of two or four-door cars manufactured by various car makers:

num_doorsfourtwoTotal
make
honda5813
mazda7916
mitsubishi4913
nissan9918
subaru9312
toyota181432
volkswagen8412
volvo11011
Total7156127

In the table above, you can see that the data set contains 32 Toyota cars of which 18 are four-door and 14 are two-door. This is a relatively simple table to interpret and illustrates why this approach can be a powerful way to summarize large data sets. Pandas makes this process easy and allows us to customize the tables in several different manners. In the rest of the article, I will walk through how to create and customize these tables.

Start the Process

Let’s get started by importing all the modules we need. If you want to follow along on your own there is a notebook on GitHub:

import pandas as pd
import seaborn as sns

Now we’ll read in the automobile data set from the UCI Machine Learning Repository and make some label changes for clarity:

# Define the headers since the data does not have any
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

# Read in the CSV file and convert "?" to NaN
df_raw = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",
                     header=None, names=headers, na_values="?" )

# Define a list of models that we want to review
models = ["toyota","nissan","mazda", "honda", "mitsubishi", "subaru", "volkswagen", "volvo"]

# Create a copy of the data with only the top 8 manufacturers
df = df_raw[df_raw.make.isin(models)].copy()

For this example, I wanted to shorten the table so I only included the 8 models listed above. This is done solely to make the article more compact and hopefully more understandable. For the first example, let’s use pd.crosstab to look at how many different body styles these car makers made in 1985 (the year this dataset contains).

pd.crosstab(df.make, df.body_style)
body_styleconvertiblehardtophatchbacksedanwagon
make
honda00751
mazda001070
mitsubishi00940
nissan01593
subaru00354
toyota1314104
volkswagen10191
volvo00083

The crosstab function can operate on numpy arrays, series, or columns in a dataframe. For this example, I pass in df.make for the crosstab index and df.body_style for the crosstab’s columns. Pandas does that work behind the scenes to count how many occurrences there are of each combination. For example, in this data set, Volvo makes 8 sedans and 3 wagons.

Before we go much further with this example, more experienced readers may wonder why we use the crosstab instead of another pandas option. I will address that briefly by showing two alternative approaches.

First, we could use a groupby followed by an unstack to get the same results:

df.groupby(['make', 'body_style'])['body_style'].count().unstack().fillna(0)

The output for this example looks very similar to the crosstab but it took a couple of extra steps to get it formatted correctly.

It is also possible to do something similar using a pivot_table :

df.pivot_table(index='make', columns='body_style', aggfunc={'body_style':len}, fill_value=0)

Make sure to review my previous article on pivot_tables if you would like to understand how this works. The question still remains, why even use a crosstab function? The short answer is that it provides a couple of handy functions to more easily format and summarize the data. The longer answer is that sometimes it can be tough to remember all the steps to make this happen on your own. The simple crosstab API is the quickest route to the solution and provides some useful shortcuts for certain types of analysis. In my experience, it is important to know about the options and use the one that flows most naturally from the analysis. I have had experiences where I struggled trying to make a pivot_table solution and then quickly got what I wanted by using a crosstab. The great thing about pandas is that once the data is in a dataframe all these manipulations are 1 line of code so you are free to experiment.

Diving Deeper into the Crosstab

Now that we have walked through the basic crosstab process, I will explain some of the other useful changes you can make to the output by altering the parameters. One common need in a crosstab is to include subtotals. We can add them using the margins keyword:

pd.crosstab(df.make, df.num_doors, margins=True, margins_name="Total")
num_doorsfourtwoTotal
make
honda5813
mazda7916
mitsubishi4913
nissan9918
subaru9312
toyota181432
volkswagen8412
volvo11011
Total7156127

The margins keyword instructed pandas to add a total for each row as well as a total at the bottom. I also passed a value to margins_name in the function call because I wanted to label the results “Total” instead of the default “All”.

All of these examples have simply counted the individual occurrences of the data combinations. crosstab allows us to do even more summarization by including values to aggregate. To illustrate this, we can calculate the average curb weight of cars by body style and manufacturer:

pd.crosstab(df.make, df.body_style, values=df.curb_weight, aggfunc='mean').round(0)
body_styleconvertiblehardtophatchbacksedanwagon
make
hondaNaNNaN1970.02289.02024.0
mazdaNaNNaN2254.02361.0NaN
mitsubishiNaNNaN2377.02394.0NaN
nissanNaN2008.02740.02238.02452.0
subaruNaNNaN2137.02314.02454.0
toyota2975.02585.02370.02338.02708.0
volkswagen2254.0NaN2221.02342.02563.0
volvoNaNNaNNaN3023.03078.0

By using aggfunc='mean' and values=df.curb_weight we are telling pandas to apply the mean function to the curb weight of all the combinations of the data. Under the hood, pandas is grouping all the values together by make and body_style, then calculating the average. In those areas where there is no car with those values, it displays NaN . In this example, I am also rounding the results.

We have seen how to count values and determine averages of values. However, there is another common case of data summarization where we want to understand the percentage of time each combination occurs. This can be accomplished using the normalize parameter:

pd.crosstab(df.make, df.body_style, normalize=True)
body_styleconvertiblehardtophatchbacksedanwagon
make
honda0.0000000.0000000.0546880.0390620.007812
mazda0.0000000.0000000.0781250.0546880.000000
mitsubishi0.0000000.0000000.0703120.0312500.000000
nissan0.0000000.0078120.0390620.0703120.023438
subaru0.0000000.0000000.0234380.0390620.031250
toyota0.0078120.0234380.1093750.0781250.031250
volkswagen0.0078120.0000000.0078120.0703120.007812
volvo0.0000000.0000000.0000000.0625000.023438

This table shows us that 2.3% of the total population are Toyota hardtops and 6.25% are Volvo sedans.

The normalize parameter is even smarter because it allows us to perform this summary on just the columns or rows. For example, if we want to see how the body styles are distributed across makes:

pd.crosstab(df.make, df.body_style, normalize='columns')
body_styleconvertiblehardtophatchbacksedanwagon
make
honda0.00.000.1428570.0877190.0625
mazda0.00.000.2040820.1228070.0000
mitsubishi0.00.000.1836730.0701750.0000
nissan0.00.250.1020410.1578950.1875
subaru0.00.000.0612240.0877190.2500
toyota0.50.750.2857140.1754390.2500
volkswagen0.50.000.0204080.1578950.0625
volvo0.00.000.0000000.1403510.1875

Looking at just the convertible column, you can see that 50% of the convertibles are made by Toyota and the other 50% by Volkswagen.

We can do the same thing row-wise:

pd.crosstab(df.make, df.body_style, normalize='index')
body_styleconvertiblehardtophatchbacksedanwagon
make
honda0.0000000.0000000.5384620.3846150.076923
mazda0.0000000.0000000.5882350.4117650.000000
mitsubishi0.0000000.0000000.6923080.3076920.000000
nissan0.0000000.0555560.2777780.5000000.166667
subaru0.0000000.0000000.2500000.4166670.333333
toyota0.0312500.0937500.4375000.3125000.125000
volkswagen0.0833330.0000000.0833330.7500000.083333
volvo0.0000000.0000000.0000000.7272730.272727

This view of the data shows that of the Mitsubishi cars in this dataset, 69.23% are hatchbacks and the remainder (30.77%) are sedans.

I hope you will agree that these shortcuts can be helpful in many kinds of analysis.

Grouping

One of the most useful features of the crosstab is that you can pass in multiple dataframe columns and pandas does all the grouping for you. For instance, if we want to see how the data is distributed by front-wheel drive (fwd) and rear-wheel drive (rwd), we can include the drive_wheels column by including it in the list of valid columns in the second argument to the crosstab .

pd.crosstab(df.make, [df.body_style, df.drive_wheels])
body_styleconvertiblehardtophatchbacksedanwagon
drive_wheelsfwdrwdfwdrwd4wdfwdrwd4wdfwdrwd4wdfwdrwd
make
honda0000070050010
mazda0000064052000
mitsubishi0000090040000
nissan0010023090030
subaru0000120230220
toyota0103086073211
volkswagen1000010090010
volvo0000000008003

We can also do the same thing with the index:

pd.crosstab([df.make, df.num_doors], [df.body_style, df.drive_wheels],
            rownames=['Auto Manufacturer', "Doors"],
            colnames=['Body Style', "Drive Type"],
            dropna=False)
Body Styleconvertiblehardtophatchbacksedanwagon
Drive Type4wdfwdrwd4wdfwdrwd4wdfwdrwd4wdfwdrwd4wdfwdrwd
Auto ManufacturerDoors
hondafour000000000040010
two000000070010000
mazdafour000000010042000
two000000054000000
mitsubishifour000000000040000
two000000090000000
nissanfour000000010050030
two000010013040000
subarufour000000000230220
two000000120000000
toyotafour000000060071211
two001003026002000
volkswagenfour000000000070010
two010000010020000
volvofour000000000008003
two000000000000000

I have introduced a couple of extra parameters to control the way the output is displayed.

First, I included the specific rownames and colnames that I want to include in the output. This is purely for display purposes but can be useful if the column names in the dataframe are not very specific. Next, I used dropna=False at the end of the function call. The reason I included this is that I wanted to make sure to include all the rows and columns even if they had all 0’s. If I did not include it, then the final Volvo, two-door row would have been omitted from the table. I want to make one final note on this table. It does include a lot of information and maybe too difficult to interpret. That’s where the art of data science (or any analysis) comes in and you need to determine the best way to present the data. This leads to the final part of this article.

Visualizing

For the final example, I will bring it all together by showing how the output of the crosstab can be passed to a seaborn heatmap in order to visually summarize the data.

In our last table, we ended up with a table of 240 values. This is too dense to quickly analyze but if we use a heatmap, we can easily interpret the data. Fortunately, seaborn can take the output from the crosstab and visualize it:

sns.heatmap(pd.crosstab([df.make, df.num_doors], [df.body_style, df.drive_wheels]),
            cmap="YlGnBu", annot=True, cbar=False)
crosstab heatmap

One of the really useful aspects of this approach is that seaborn collapses the grouped column and row names so that they can be more easily read.

Cheat Sheet

In order to bring this all together, here is a cheat sheet showing how to use all the various components of the crosstab function. You can download the PDF version here.

Crosstab cheatsheet

Conclusion

The pandas crosstab function is a useful tool for summarizing data. The functionality overlaps with some of the other pandas tools but it occupies a useful place in your data analysis toolbox. After reading this article, you should be able to incorporate it into your own data analysis.

Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.