In this post, we will use one of Seaborn’s conveniently available datasets about the Titanic, which I’m sure many readers have seen before. Seaborn has quite a few datasets ready to be loaded into Python to practice with; they are great for practicing data processing, exploration, and basic machine learning techniques.
titanic = sns.load_dataset('titanic')
titanic.head()
titanic.info()
titanic['species'].unique()
This data set is great because it has a decent number of entries — almost 900 — while also having an interesting story to dig into. There are lots of questions to ask and relationships between variables to explore making it a great example data set. Most critical for this article is that there is also a good mix of numerical and categorical variables to explore.
We have two different kinds of categorical distribution plots, box plots, and violin plots. These kinds of plots allow us to choose a numerical variable, like age and plot the distribution of age for each category in a selected categorical variable.
Many of us have probably made quite a few box plots over the years. They are an easy and effective way to visualize groups of numerical data through their quartiles. Seaborn makes creating attractive box plots simple and allows us to easily compare an extra dimension with the hue
argument that appears in many Seaborn functions.
Basic Boxplot
Let’s take a look at the distribution of age by passenger class.
plt.figure(figsize=(8,5))
sns.boxplot(x='class',y='age',data=titanic, palette='rainbow')
plt.title("Age by Passenger Class, Titanic")
We can see that age tends to decrease as you go down in passenger class. That makes sense, young people tend to travel on a budget. Notice how little code this required to create a pretty aesthetically pleasing plot? Seaborn’s basic plots are very polished.
Adding Hue
Like many other plots available in Seaborn, box plots can take an added hue
argument to add another variable for comparison.
Adding the hue shows us that regardless of class the age of passengers that survived was generally lower than those who passed away.
Having the hue
for additional comparison allows this box plot to be quite an information dense. The more complex the plot gets the longer it will take for viewers to comprehend it, but it is nice to have the option when interesting insights are more easily shown with an added dimension.
Violin plots are not very frequently used but I have found them to be useful on occasion, and they are an interesting change from more popular options. They plot a vertical kernel density plot for each category and a small box plot to summarize important statistics.
plt.figure(figsize=(10,6))
sns.violinplot(x='class',y="age",data=titanic, hue='sex', palette='rainbow')
plt.title("Violin Plot of Age by Class, Separated by Sex")
While I like this plot, I think it is easier to compare the genders with slightly different formatting:
plt.figure(figsize=(10,6))
sns.violinplot(x='class',y="age",data=titanic, hue='sex', split='True', palette='rainbow')
plt.title("Violin Plot of Age by Class, Separated by Sex")
When we split the violin on the hue it is a lot easier to see the differences in each KDE. However, the IQR stats aren’t split by sex anymore; instead, they apply to the entire class. So there are trade-offs to styling your plot in certain ways.
The boxen plot, otherwise known as a Letter-value plot, is a box plot meant for large data sets (n > 10,000). It is similar to a traditional box plot, however, it essentially just plots more quantiles. With more quantiles, we can see more info about the distribution shape beyond the central 50% of the data; this extra detail is especially present in the tails, where box plots tend to give limited information.
plt.figure(figsize=(8,5))
sns.boxenplot(x='class', y='age', data=titanic, palette='rainbow')
plt.title("Distribution of Age by Passenger Class")
Just in case there still isn’t enough going on here for you, we can also add a hue
to a boxen plot!
plt.figure(figsize=(8,5))
sns.boxenplot(x='class', y='age', data=titanic, palette='rainbow', hue='survived')
plt.title("Distribution of Age by Passenger Class, Separated by Survival")
We can see that the boxen plot gives us much more information beyond the central 50% of the data. However, keep in mind that boxen plots are meant for larger data sets with entries between 10,000 and 100,000. This data set of under 1,000 entries is definitely not ideal. Here is a link to the paper where boxen plots were created that explains them very well.
Bar plots are classic. You get an estimate of the central tendency for a numerical variable for each class on the x-axis. Say we were interested in knowing the average fare price of passengers that embarked from different towns:
plt.figure(figsize=(8,5))
sns.barplot(x='embark_town',y='fare',data=titanic, palette='rainbow')
plt.title("Fare of Passenger by Embarked Town")
Seaborn will take the mean as default, but you can use other measures of central tendency as well. There is a noticeable difference between Cherbourg and the other two, let’s separate the bars by class to see who was boarding in each town.
plt.figure(figsize=(8,5))
sns.barplot(x='embark_town',y='fare',data=titanic, palette='rainbow', hue='class')
plt.title("Fare of Passenger by Embarked Town, Divided by Class")
Now we can see that the average fare price in Cherbourg was so high due to some very expensive first-class tickets. The large error bar on the fare price in first-class from Cherbourg is also interesting; that could mean there is a lot of separation between some very high-price outlier tickets and the rest. We’ll explore this further in the combined plots section below!
Point plots convey the same information as a bar plot with a different style. They can be good for overlaying with different plots since they have a smaller footprint in the space.
plt.figure(figsize=(8,5))
sns.pointplot(x='embark_town',y='fare',data=titanic)
plt.title("Average Fare Price by Embarked Town")
plt.figure(figsize=(8,5))
sns.pointplot(x='embark_town',y='fare',data=titanic, hue='class')
plt.title("Average Fare Price by Embarked Town, Separated by Sex")
Count Plots are essentially histograms across a categorical variable. They take all the same arguments as bar plots in Seaborn, which helps keep things simple.
plt.figure(figsize=(8,5))
sns.countplot(x='embark_town',data=titanic, palette='rainbow')
plt.title("Count of Passengers that Embarked in Each City")
plt.figure(figsize=(8,5))
sns.countplot(x='embark_town',data=titanic, palette='rainbow',hue='sex')
plt.title("Count of Passengers that Embarked in Each City, Separated by Sex")
Both strip plots and swarm plots are essentially scatter plots where one variable is categorical. I like to use them as additions to other kinds of plots, which we’ll discuss below as they are useful for quickly visualizing the number of data points in a group.
plt.figure(figsize=(12,8))
sns.stripplot(x='class', y='age', data=titanic, jitter=True, hue='alive', dodge=True, palette='viridis')
I don’t love the way strip plots look when you have a lot of data points. But swarm plots might make this a little more useful. Strip plots can look great with fewer data points and they can convey really interesting attributes of your data since they don’t hide details behind aggregation.
Swarm plots are fantastic because they offer an easy way to show the individual data points in a distribution. Instead of a big blob like the strip plot, the swarm plot simply adjusts the points along the x-axis. Although they also don’t scale well with tons of values, they offer more organized insight.
plt.figure(figsize=(10,7))
sns.swarmplot(x='class', y='age', data=titanic, hue='alive', dodge=True, palette='viridis')
plt.title("Age by Passenger Class, Separated by Survival")
Here we can more easily see where the dense age groups are rather than the difficult-to-interpret strip plot above.
One of my favorite uses for a swarm plot is to enhance another kind of plot since they convey relative volume very well. As we will see in the violin plot below even though at one point the KDE values may look similarly “large”, the volume of data points in each of the classes may be quite different. We can add a swarm plot on top of our violin plot to show the individual data points that help to give us a more complete picture.
plt.figure(figsize=(12,8))
sns.violinplot(x='class',y="age", data=titanic, hue='survived', split='True', palette='rainbow')
sns.swarmplot(x='class',y="age", data=titanic, hue='survived', dodge='True', color='grey', alpha=.8, s=4)
plt.title("Age by Passenger Class, Separated by Survival")
By adding the swarm plot we can see where the actual majority of data points are contained. I have seen Violin plots misinterpreted many times where a viewer may assume a relatively similar number of ~25-year-old third-class passengers lived and survived in third class, and the swarm plot does a great job clearing that up.
plt.figure(figsize=(12,8))
sns.boxplot(x='class',y='age',hue='survived',data=titanic, palette='rainbow')
sns.swarmplot(x='class',y='age',hue='survived', dodge=True,data=titanic, alpha=.8,color='grey',s=4)
plt.title("Age by Passenger Class, Separated by Survival")
The story is very similar to box plots as with violin plots. Summary statistics of each group are very useful, however, adding the swarm plot helps to show a more complete story.
Remember when were looking at the average ticket prices by the town embarked from and separated by passenger class earlier?
We saw that the price of Cherbourg tickets was high, which turned out was due to the mean price of first-class tickets being so high in Cherbourg. We also had this large error bar on the mean price of first-class tickets in Cherbourg. Using a strip plot, we can try to get a better understanding of what’s happening there.
plt.figure(figsize=(12,7))
sns.barplot(x='embark_town',y='fare',data=titanic, palette='rainbow', hue='class')
sns.stripplot(x='embark_town',y="fare",data=titanic, hue='class', dodge='True', color='grey', alpha=.8, s=2)
plt.title("Fare of Passenger by Embarked Town, Divided by Class")
Now we can see that there were two very expensive tickets sold in Cherbourg that skewed the mean, which is why our first-class bar plot had a large error bar. While two people paid close to double the next most expensive first-class tickets, there were also people in first class that paid a lower fare than some of those who boarded in second class! We get all kinds of new insights when we combine plots.
Catplot()
is the figure-level function that can create all of the above plots we have discussed. Figure-level functions plot a Seaborn object and interface with the Matplotlib API instead of creating a Matplotlib object like Seaborn’s axis-level functions.
While working with figure-level functions is generally more complex and has less clear documentation, there are some strengths that make them worth using in certain cases. They are particularly good at faceting data into subplots as we can see below.
g = sns.catplot(x='class',y='survived', col = 'who', data=titanic,
kind='bar', aspect=.6, palette='Set2')(g.set_axis_labels("Class", "Survival Rate")
.set_titles("{col_name}")
.set(ylim=(0,1)))plt.tight_layout()
plt.savefig('seaborn_catplot.png', dpi=1000)
Faceting data allows us to see data at different granularities. Faceting is really a fancy word for separating data into classes along a specific dimension(s). So here we are separating the data along with the “who” variable, which allows us to plot each type of person separately.
Being able to say col='<column_name>'
to automatically facet is a powerful option that most figure-level functions have access to. Accomplishing the same thing in Matplotlib requires significantly more time subsetting data and creating multiple subplots manually.
Don’t forget that we could still add a hue
argument to add even more information to this plot! Faceting data with Seaborn’s figure-level functionality can be an excellent way to make more complex plots.
You will notice that Seaborn figures require different functions for formatting, however, saving the plot can still be done via plt.savefig()
since the final Seaborn figure interfaces with the Matplotlib API.
We’ve gone through a lot of different plots in this post. I hope that you have seen how easily Seaborn can make an aesthetically pleasing plot that conveys a lot of useful information to the viewer. Once I got used to using it, Seaborn saved me a massive amount of time writing fewer lines of code to produce pleasing visualizations.
Source: