NumPy Broadcasting tutorial
2022-03-22
A comprehensive tutorial on Transformers Architecture
2022-03-23
Show all

Categorical data type in Pandas

8 mins read

You may have categorical data in your dataset. A categorical data is a type with two or more categories. If you have categorical data in the dataset, converting these data to categorical data allows you to use less memory and make easier analyzes.

I’ll talk about the following topics in this post.

  • Converting to categorical data
  • Working with categorical data
  • The performance of the category types
  • The categorical methods
  • Creating a dummy variable?

Converting to Categorical Data

You can convert a variable to a categorical variable. To show this, first, let’s import the Pandas and Numpy libraries.

import pandas as pd 
import numpy as np

If your dataset has duplicate values, you can use functions such as unique and value_counts methods. To show this, let’s create a dataframe.

data=pd.Series(["Tim","Tom","Sam","Sam"]*3)
data
0     Tim
1     Tom
2     Sam
3     Sam
4     Tim
5     Tom
6     Sam
7     Sam
8     Tim
9     Tom
10    Sam
11    Sam
dtype: object

Let’s take a look at the unique values in the data.

pd.unique(data)

Let’s see the number of duplicate values in the data using the value_counts method.

pd.value_counts(data)

You can assign numerical values to these values. To show this, let me create a values variable.

values=pd.Series([0,1,0,0]*3)

Now, let’s map this variable to data.

names=pd.Series(["Tim","Sam"])
names.take(values)
0    Tim
1    Sam
0    Tim
0    Tim
0    Tim
1    Sam
0    Tim
0    Tim
0    Tim
1    Sam
0    Tim
0    Tim
dtype: object

Categorical Type in Pandas

Pandas has special categorical types for data. To show this, let’s print the data variable again.

Let’s assign the length of this variable to variable N.

N=len(data)

Let’s create a dataframe using this name data.

df=pd.DataFrame(
    {"name":data,
     "num":np.arange(N),
     "score":np.random.randint(40,100,
                               size=N),
     "weight":np.random.uniform(50,70,
                                size=N)},
    columns=["num","name","score","weight"])

Let’s take a look at the dataset.

Let’s select the name column.

Let’s take a look at the structure of the name column.

This column is in the Series data structure. Let’s convert this series into a category.

name_cat=df["name"].astype("category")
name_cat
0     Tim
1     Tom
2     Sam
3     Sam
4     Tim
5     Tom
6     Sam
7     Sam
8     Tim
9     Tom
10    Sam
11    Sam
Name: name, dtype: category
Categories (3, object): ['Sam', 'Tim', 'Tom']

Now the values in this name_cat are categorical. To check this, let’s assign the values in name_cat to x.

x=name_cat.values

Let’s take a look at the structure of these values.

x.categories

Let’s see the codes.

x.codes

You can also convert the column in the data frame to a category.

df["name"]=df["name"].astype("category")
df.name
0     Tim
1     Tom
2     Sam
3     Sam
4     Tim
5     Tom
6     Sam
7     Sam
8     Tim
9     Tom
10    Sam
11    Sam
Name: name, dtype: category
Categories (3, object): ['Sam', 'Tim', 'Tom']

You can directly create a categorical variable.

data_cat=pd.Categorical(list("abcde"))
data_cat
['a', 'b', 'c', 'd', 'e']
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

You can directly categorize data with the Categorical method.

pd.Categorical(["banana", "apple", 
                "kiwi", "banana", "apple"])
['banana', 'apple', 'kiwi', 'banana', 'apple']
Categories (3, object): ['apple', 'banana', 'kiwi']

You can categorize the data that has categorical coding with the from_codes. To show this, let’s create people and codes variables and then map these two variables.

people=["baby", "child", "young", "old"]
codes=[0,1,2,3,1,0,0]
people_cat=pd.Categorical.from_codes(
    codes,people)

print(people_cat)
['baby', 'child', 'young', 'old', 'child', 'baby', 'baby']
Categories (4, object): ['baby', 'child', 'young', 'old']

Notice that there is no specific order in categorical data. You can categorically sort with ordered = True.

people_cat=pd.Categorical.from_codes(
    codes,people,ordered=True)
print(people_cat)
['baby', 'child', 'young', 'old', 'child', 'baby', 'baby']
Categories (4, object): ['baby' < 'child' < 'young' < 'old']

Let’s order this variable with the as_ordered method.

people_cat.as_ordered()

Working with Categorical Data.

You can easily work with functions like groupby if you categorize the data. To show this, let’s create a dataframe from a normal distribution.


Let me divide this data into four intervals.

data=np.random.randn(1000)
interval=pd.qcut(data,4)
interval

[(-2.9739999999999998, -0.668], (0.735, 3.402], (0.735, 3.402], (0.00973, 0.735], (-2.9739999999999998, -0.668], ..., (-0.668, 0.00973], (0.735, 3.402], (-0.668, 0.00973], (0.735, 3.402], (-2.9739999999999998, -0.668]]
Length: 1000
Categories (4, interval[float64]): [(-2.9739999999999998, -0.668] < (-0.668, 0.00973] < (0.00973, 0.735] < (0.735, 3.402]]

Let’s check the type of this interval variable.

This interval variable is a categorical type. You can assign a label to these ranges.

interval=pd.qcut(data,4,labels=["Q1","Q2",
                                "Q3","Q4"])
print(interval)

['Q1', 'Q4', 'Q4', 'Q3', 'Q1', ..., 'Q2', 'Q4', 'Q2', 'Q4', 'Q1']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

You can calculate some summary statistics using the groupby. First, let’s convert the ranges to a series.

interval=pd.Series(interval, name="quarter")

Now, let’s find the minimum and maximum values of the intervals.

pd.Series(
    data).groupby(
    interval).agg(["count",
                   "min",
                   "max"]).reset_index()

The Performance of Categorical Types

When working with big data, converting to categorical variables and analyzing improves performance. Categorical versions of the DataFrame column take up significantly less memory space. For example, let’s create data with ten million elements.

N=10000000
num=pd.Series(np.random.randn(N))

Let’s assign labels to these values.

label=pd.Series(["a","b","c","d"]*(N//4))

Let’s convert this data into categorical data.

cat=label.astype("category")

Now, let’s take a look at the memory usage of categorical and non-categorical data.

As you can see, categorical data uses less memory than non-categorical data.

The Categorical Methods

You can use some special methods for series. To show these methods, let’s create a series.

s=pd.Series(["a","b","c","d"]*2)

Let’s convert this data into categorical data.

s_ct=s.astype("category")

print(s_ct)

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

The cat attribute allows us to access categorical methods. For example, let’s use the codes method to see the codes of values in data.

When you want to use the categorical methods, you need to write the cat method first, and then you can use the categorical methods.

You can use the set_categories method to increase the categories.

new_ct=["a","b","c","d","e"]
s_ct.cat.set_categories(new_ct)
0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

You can use the remove_unused_categories method to remove unused categories. To show this, let’s select the values a and b in the data.

s2_ct=s_ct[s_ct.isin(["a","b"])]
s2_ct             
0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

Now, let’s remove the categories that are not used.

s2_ct.cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): ['a', 'b']

Creating a Dummy Variable

Before building a machine learning model, you need to convert categorical data into dummy variables. To show this, let me use the s_ct data again.

You can use the get_dummies function to convert categorical data into dummy variables.

That’s it. This post covered how to work with categorical data in Pandas.

Source:

https://medium.com/swlh/categorical-data-in-pandas-9eaaff71e6f3

Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.