You may have categorical data in your dataset. A categorical data is a type with two or more categories. If you have categorical data in the dataset, converting these data to categorical data allows you to use less memory and make easier analyzes.
I’ll talk about the following topics in this post.
You can convert a variable to a categorical variable. To show this, first, let’s import the Pandas and Numpy libraries.
import pandas as pd
import numpy as np
If your dataset has duplicate values, you can use functions such as unique and value_counts methods. To show this, let’s create a dataframe.
data=pd.Series(["Tim","Tom","Sam","Sam"]*3)
data
0 Tim
1 Tom
2 Sam
3 Sam
4 Tim
5 Tom
6 Sam
7 Sam
8 Tim
9 Tom
10 Sam
11 Sam
dtype: object
Let’s take a look at the unique values in the data.
pd.unique(data)
Let’s see the number of duplicate values in the data using the value_counts method.
pd.value_counts(data)
You can assign numerical values to these values. To show this, let me create a values variable.
values=pd.Series([0,1,0,0]*3)
Now, let’s map this variable to data.
names=pd.Series(["Tim","Sam"])
names.take(values)
0 Tim
1 Sam
0 Tim
0 Tim
0 Tim
1 Sam
0 Tim
0 Tim
0 Tim
1 Sam
0 Tim
0 Tim
dtype: object
Pandas has special categorical types for data. To show this, let’s print the data variable again.
Let’s assign the length of this variable to variable N.
N=len(data)
Let’s create a dataframe using this name data.
df=pd.DataFrame(
{"name":data,
"num":np.arange(N),
"score":np.random.randint(40,100,
size=N),
"weight":np.random.uniform(50,70,
size=N)},
columns=["num","name","score","weight"])
Let’s take a look at the dataset.
Let’s select the name column.
Let’s take a look at the structure of the name column.
This column is in the Series data structure. Let’s convert this series into a category.
name_cat=df["name"].astype("category")
name_cat
0 Tim
1 Tom
2 Sam
3 Sam
4 Tim
5 Tom
6 Sam
7 Sam
8 Tim
9 Tom
10 Sam
11 Sam
Name: name, dtype: category
Categories (3, object): ['Sam', 'Tim', 'Tom']
Now the values in this name_cat are categorical. To check this, let’s assign the values in name_cat to x.
x=name_cat.values
Let’s take a look at the structure of these values.
x.categories
Let’s see the codes.
x.codes
You can also convert the column in the data frame to a category.
df["name"]=df["name"].astype("category")
df.name
0 Tim
1 Tom
2 Sam
3 Sam
4 Tim
5 Tom
6 Sam
7 Sam
8 Tim
9 Tom
10 Sam
11 Sam
Name: name, dtype: category
Categories (3, object): ['Sam', 'Tim', 'Tom']
You can directly create a categorical variable.
data_cat=pd.Categorical(list("abcde"))
data_cat
['a', 'b', 'c', 'd', 'e']
Categories (5, object): ['a', 'b', 'c', 'd', 'e']
You can directly categorize data with the Categorical method.
pd.Categorical(["banana", "apple",
"kiwi", "banana", "apple"])
['banana', 'apple', 'kiwi', 'banana', 'apple']
Categories (3, object): ['apple', 'banana', 'kiwi']
You can categorize the data that has categorical coding with the from_codes. To show this, let’s create people and codes variables and then map these two variables.
people=["baby", "child", "young", "old"]
codes=[0,1,2,3,1,0,0]
people_cat=pd.Categorical.from_codes(
codes,people)
print(people_cat)
['baby', 'child', 'young', 'old', 'child', 'baby', 'baby']
Categories (4, object): ['baby', 'child', 'young', 'old']
Notice that there is no specific order in categorical data. You can categorically sort with ordered = True.
people_cat=pd.Categorical.from_codes(
codes,people,ordered=True)
print(people_cat)
['baby', 'child', 'young', 'old', 'child', 'baby', 'baby']
Categories (4, object): ['baby' < 'child' < 'young' < 'old']
Let’s order this variable with the as_ordered method.
people_cat.as_ordered()
You can easily work with functions like groupby if you categorize the data. To show this, let’s create a dataframe from a normal distribution.
Let me divide this data into four intervals.
data=np.random.randn(1000)
interval=pd.qcut(data,4)
interval
[(-2.9739999999999998, -0.668], (0.735, 3.402], (0.735, 3.402], (0.00973, 0.735], (-2.9739999999999998, -0.668], ..., (-0.668, 0.00973], (0.735, 3.402], (-0.668, 0.00973], (0.735, 3.402], (-2.9739999999999998, -0.668]]
Length: 1000
Categories (4, interval[float64]): [(-2.9739999999999998, -0.668] < (-0.668, 0.00973] < (0.00973, 0.735] < (0.735, 3.402]]
Let’s check the type of this interval variable.
This interval variable is a categorical type. You can assign a label to these ranges.
interval=pd.qcut(data,4,labels=["Q1","Q2",
"Q3","Q4"])
print(interval)
['Q1', 'Q4', 'Q4', 'Q3', 'Q1', ..., 'Q2', 'Q4', 'Q2', 'Q4', 'Q1']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']
You can calculate some summary statistics using the groupby. First, let’s convert the ranges to a series.
interval=pd.Series(interval, name="quarter")
Now, let’s find the minimum and maximum values of the intervals.
pd.Series(
data).groupby(
interval).agg(["count",
"min",
"max"]).reset_index()
When working with big data, converting to categorical variables and analyzing improves performance. Categorical versions of the DataFrame column take up significantly less memory space. For example, let’s create data with ten million elements.
N=10000000
num=pd.Series(np.random.randn(N))
Let’s assign labels to these values.
label=pd.Series(["a","b","c","d"]*(N//4))
Let’s convert this data into categorical data.
cat=label.astype("category")
Now, let’s take a look at the memory usage of categorical and non-categorical data.
As you can see, categorical data uses less memory than non-categorical data.
You can use some special methods for series. To show these methods, let’s create a series.
s=pd.Series(["a","b","c","d"]*2)
Let’s convert this data into categorical data.
s_ct=s.astype("category")
print(s_ct)
0 a
1 b
2 c
3 d
4 a
5 b
6 c
7 d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
The cat attribute allows us to access categorical methods. For example, let’s use the codes method to see the codes of values in data.
When you want to use the categorical methods, you need to write the cat method first, and then you can use the categorical methods.
You can use the set_categories method to increase the categories.
new_ct=["a","b","c","d","e"]
s_ct.cat.set_categories(new_ct)
0 a
1 b
2 c
3 d
4 a
5 b
6 c
7 d
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']
You can use the remove_unused_categories method to remove unused categories. To show this, let’s select the values a and b in the data.
s2_ct=s_ct[s_ct.isin(["a","b"])]
s2_ct
0 a
1 b
4 a
5 b
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
Now, let’s remove the categories that are not used.
s2_ct.cat.remove_unused_categories()
0 a
1 b
4 a
5 b
dtype: category
Categories (2, object): ['a', 'b']
Before building a machine learning model, you need to convert categorical data into dummy variables. To show this, let me use the s_ct data again.
You can use the get_dummies function to convert categorical data into dummy variables.
That’s it. This post covered how to work with categorical data in Pandas.
Source:
https://medium.com/swlh/categorical-data-in-pandas-9eaaff71e6f3