Understanding Expectation-Maximization (EM) algorithm
Feature selection for categorical data with Python code
Show all

Basic feature engineering tasks for numeric and categorical data with Python code

34 mins read

Machine learning pipelines

Any intelligent system basically consists of an end-to-end pipeline starting from ingesting raw data and leveraging data processing techniques to wrangle, process, and engineer meaningful features and attributes from this data. Then we usually leverage techniques like statistical models or machine learning models to model these features and then deploy this model if necessary for future usage based on the problem to be solved at hand. A typical standard machine learning pipeline based on the CRISP-DM industry-standard process model is depicted below.

A standard machine learning pipeline (source: Practical Machine Learning with Python, Apress/Springer)

Ingesting raw data and building models on top of this data directly would be foolhardy since we wouldn’t get desired results or performance and also algorithms are not intelligent enough to automatically extract meaningful features from raw data (there are automated feature extraction techniques that are enabled nowadays with deep learning methodologies to some extent, but more on that later!).

Our main area of focus falls under the data preparation aspect as pointed out in the figure above, where we deal with various methodologies to extract meaningful attributes or features from the raw data after it has gone through necessary wrangling and pre-processing.

Understanding Features

feature is typically a specific representation on top of raw data, which is an individual, measurable attribute, typically depicted by a column in a dataset. Considering a generic two-dimensional dataset, each observation is depicted by a row and each feature by a column, which will have a specific value for an observation.

A generic dataset snapshot

Thus like in the example in the figure above, each row typically indicates a feature vector and the entire set of features across all the observations forms a two-dimensional feature matrix also known as a feature-set. This is akin to data frames or spreadsheets representing two-dimensional data. Typically machine learning algorithms work with these numeric matrices or tensors and hence most feature engineering techniques deal with converting raw data into some numeric representations which can be easily understood by these algorithms.

Features can be of two major types based on the dataset. Inherent raw features are obtained directly from the dataset with no extra data manipulation or engineering. Derived features are usually obtained from feature engineering, where we extract features from existing data attributes. A simple example would be creating a new feature “Age” from an employee dataset containing “Birthdate” by just subtracting their birth date from the current date.

There are diverse types and formats of data including structured and unstructured data. In this article, we will discuss various feature engineering strategies for dealing with structured continuous numeric data. You can access relevant datasets and code used in this article on GitHub

Feature Engineering on Numeric Data

Numeric data typically represents data in the form of scalar values depicting observations, recordings, or measurements. Here, by numeric data, we mean continuous data and not discrete data which is typically represented as categorical data. Numeric data can also be represented as a vector of values where each value or entity in the vector can represent a specific feature. Integers and floats are the most common and widely used numeric data types for continuous numeric data. Even though numeric data can be directly fed into machine learning models, you would still need to engineer features that are relevant to the scenario, problem, and domain before building a model. Hence the need for feature engineering still remains. Let’s leverage python and look at some strategies for feature engineering on numeric data. We load up the following necessary dependencies first (typically in a Jupyter notebook).

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as spstats%matplotlib inline

Raw Measures

As we mentioned earlier, raw numeric data can often be fed directly to machine learning models based on the context and data format. Raw measures are typically indicated using numeric variables directly as features without any form of transformation or engineering. Typically these features can indicate values or counts. Let’s load up one of our datasets, the Pokémon dataset also available on Kaggle.

poke_df = pd.read_csv('datasets/Pokemon.csv', encoding='utf-8') 
Snapshot of our Pokemon dataset

Pokémon is a huge media franchise surrounding fictional characters called Pokémon which stands for pocket monsters. In short, you can think of them as fictional animals with superpowers! This dataset consists of these characters with various statistics for each character.


If you closely observe the data frame snapshot in the above figure, you can see that several attributes represent numeric raw values that can be used directly. The following snippet depicts some of these features with more emphasis.

poke_df[['HP', 'Attack', 'Defense']].head()
Features with (continuous) numeric data

Thus, you can directly use these attributes as features that are depicted in the above data frame. These include each Pokémon’s HP (Hit Points), Attack and Defense stats. In fact, we can also compute some basic statistical measures in these fields.

poke_df[['HP', 'Attack', 'Defense']].describe()
Basic descriptive statistics on numeric features

With this, you can get a good idea about statistical measures in these features like count, average, standard deviation, and quartiles.


Another form of raw measures includes features that represent frequencies, counts, or occurrences of specific attributes. Let’s look at a sample of data from the millionsong dataset which depicts counts or frequencies of songs that have been heard by various users.

popsong_df = pd.read_csv('datasets/song_views.csv', 
Song listen counts as a numeric feature

It is quite evident from the above snapshot that the listen_count field can be used directly as a frequency\count-based numeric feature.


Often raw frequencies or counts may not be relevant for building a model based on the problem which is being solved. For instance, if I’m building a recommendation system for song recommendations, I would just want to know if a person is interested or has listened to a particular song. This doesn’t require the number of times a song has been listened to since I am more concerned about the various songs he\she has listened to. In this case, a binary feature is preferred as opposed to a count-based feature. We can binarize our listen_count field as follows.

watched = np.array(popsong_df['listen_count']) 
watched[watched >= 1] = 1
popsong_df['watched'] = watched

You can also use scikit-learn's Binarizer class here from its preprocessing module to perform the same task instead of numpy arrays.

from sklearn.preprocessing import 
Binarizerbn = Binarizer(threshold=0.9)
pd_watched = bn.transform([popsong_df['listen_count']])[0]
popsong_df['pd_watched'] = pd_watched
Binarizing song counts

You can clearly see from the above snapshot that both the methods have produced the same result. Thus we get a binarized feature indicating if the song was listened to or not by each user which can be then further used in a relevant model.


Often when dealing with continuous numeric attributes like proportions or percentages, we may not need the raw values to have a high amount of precision. Hence it often makes sense to round off these high precision percentages into numeric integers. These integers can then be directly used as raw values or even as categorical (discrete-class based) features. Let’s try applying this concept in a dummy dataset depicting store items and their popularity percentages.

items_popularity = pd.read_csv('datasets/item_popularity.csv',  
                               encoding='utf-8')items_popularity['popularity_scale_10'] = np.array(
                   np.round((items_popularity['pop_percent'] * 10)),  
items_popularity['popularity_scale_100'] = np.array(
                  np.round((items_popularity['pop_percent'] * 100)),    
Rounding popularity to different scales

Based on the above outputs, you can guess that we tried two forms of rounding. The features depict the item popularities now both on a scale of 1–10 and on a scale of 1–100. You can use these values both as numerical or categorical features based on the scenario and problem.


Supervised machine learning models usually try to model the output responses (discrete classes or continuous values) as a function of the input feature variables. For example, a simple linear regression equation can be depicted as

where the input features are depicted by variables

having weights or coefficients denoted by

respectively and the goal is to predict the response y.

In this case, this simple linear model depicts the relationship between the output and inputs, purely based on the individual, separate input features.

However, often in several real-world scenarios, it makes sense to also try and capture the interactions between these feature variables as a part of the input feature set. A simple depiction of the extension of the above linear regression formulation with interaction features would be

where the features represented by

denote the interaction features. Let’s try engineering some interaction features on our Pokémon dataset now.

atk_def = poke_df[['Attack', 'Defense']]

From the output data frame, we can see that we have two numeric (continuous) features, Attack and Defence. We will now build features up to the 2nd degree by leveraging scikit-learn.

from sklearn.preprocessing import 
PolynomialFeaturespf = PolynomialFeatures(degree=2, interaction_only=False,  
res = pf.fit_transform(atk_def)
------array([[    49.,     49.,   2401.,   2401.,   2401.],
       [    62.,     63.,   3844.,   3906.,   3969.],
       [    82.,     83.,   6724.,   6806.,   6889.],
       [   110.,     60.,  12100.,   6600.,   3600.],
       [   160.,     60.,  25600.,   9600.,   3600.],
       [   110.,    120.,  12100.,  13200.,  14400.]])

The above feature matrix depicts a total of five features including the new interaction features. We can see the degree of each feature in the above matrix as follows.

pd.DataFrame(pf.powers_, columns=['Attack_degree',  

Looking at this output, we now know what each feature actually represents from the degrees depicted here. Armed with this knowledge, we can assign a name to each feature now as follows. This is just for ease of understanding and you should name your features with better, easy-to-access, and simple names.

intr_features = pd.DataFrame(res, columns=['Attack', 'Defense',  
                                           'Attack x Defense',  
Numeric features with their interactions

Thus the above data frame represents our original features along with their interaction features.


The problem with working with raw, continuous numeric features is that often the distribution of values in these features will be skewed. This signifies that some values will occur quite frequently while some will be quite rare. Besides this, there is also another problem of the varying range of values in any of these features. For instance view counts of specific music videos could be abnormally large (Despacito we’re looking at you!) and some could be really small. Directly using these features can cause a lot of issues and adversely affect the model. Hence there are strategies to deal with this, which include binning and transformations.

Binning, also known as quantization is used for transforming continuous numeric features into discrete ones (categories). These discrete values or numbers can be thought of as categories or bins into which the raw, continuous numeric values are binned or grouped into. Each bin represents a specific degree of intensity and hence a specific range of continuous numeric values fall into it. Specific strategies for binning data include fixed-width and adaptive binning. Let’s use a subset of data from a dataset extracted from the 2016 FreeCodeCamp Developer\Coder survey which talks about various attributes pertaining to coders and software developers.

fcc_survey_df = pd.read_csv('datasets/fcc_2016_coder_survey_subset.csv', 
fcc_survey_df[['ID.x', 'EmploymentField', 'Age', 'Income']].head()
Sample attributes from the FCC coder survey dataset

The ID.x variable is basically a unique identifier for each coder\developer who took the survey and the other fields are pretty self-explanatory.

Fixed-Width Binning

Just like the name indicates, in fixed-width binning, we have specific fixed widths for each of the bins which are usually pre-defined by the user analyzing the data. Each bin has a pre-fixed range of values that should be assigned to that bin on the basis of some domain knowledge, rules, or constraints. Binning based on rounding is one of the ways, where you can use the rounding operation which we discussed earlier to bin raw values.

Let’s now consider the Age feature from the coder survey dataset and look at its distribution.

fig, ax = plt.subplots()
fcc_survey_df['Age'].hist(color='#A9C5D3', edgecolor='black',  
ax.set_title('Developer Age Histogram', fontsize=12)
ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
Histogram depicting developer age distribution

The above histogram depicting developer ages is slightly right-skewed as expected (lesser aged developers). We will now assign these raw age values into specific bins based on the following scheme

Age Range: Bin
 0 -  9  : 0
10 - 19  : 1
20 - 29  : 2
30 - 39  : 3
40 - 49  : 4
50 - 59  : 5
60 - 69  : 6
  ... and so on

We can easily do this using what we learned in the Rounding section earlier where we round off these raw age values by taking the floor value after dividing it by 10.

fcc_survey_df['Age_bin_round'] = np.array(np.floor(
                              np.array(fcc_survey_df['Age']) / 10.))
fcc_survey_df[['ID.x', 'Age', 'Age_bin_round']].iloc[1071:1076]
Binning by rounding

You can see the corresponding bins for each age have been assigned based on rounding. But what if we need more flexibility? What if we want to decide and fix the bin widths based on our own rules\logic? Binning based on custom ranges will help us achieve this. Let’s define some custom age ranges for binning developer ages using the following scheme.

Age Range : Bin
 0 -  15  : 1
16 -  30  : 2
31 -  45  : 3
46 -  60  : 4
61 -  75  : 5
75 - 100  : 6

Based on this custom binning scheme, we will now label the bins for each developer’s age value and we will store both the bin range as well as the corresponding label.

bin_ranges = [0, 15, 30, 45, 60, 75, 100]
bin_names = [1, 2, 3, 4, 5, 6]fcc_survey_df['Age_bin_custom_range'] = pd.cut(
fcc_survey_df['Age_bin_custom_label'] = pd.cut(
# view the binned features 
fcc_survey_df[['ID.x', 'Age', 'Age_bin_round', 
Custom binning scheme for developer ages

Adaptive Binning

The drawback in using fixed-width binning is that due to us manually deciding the bin ranges, we can end up with irregular bins which are not uniform based on the number of data points or values which fall in each bin. Some of the bins might be densely populated and some of them might be sparsely populated or even empty! Adaptive binning is a safer strategy in these scenarios where we let the data speak for itself! That’s right, we use the data distribution itself to decide our bin ranges.

Quantile-based binning is a good strategy to use for adaptive binning. Quantiles are specific values or cut-points which help in partitioning the continuous-valued distribution of a specific numeric field into discrete contiguous bins or intervals. Thus, q-Quantiles help in partitioning a numeric attribute into q equal partitions. Popular examples of quantiles include the 2-Quantile known as the median which divides the data distribution into two equal bins, 4-Quantiles known as the quartiles which divide the data into 4 equal bins, and 10-Quantiles also known as the deciles which create 10 equal width bins. Let’s now look at the data distribution for the developer Income field.

fig, ax = plt.subplots()
fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3', 
                             edgecolor='black', grid=False)
ax.set_title('Developer Income Histogram', fontsize=12)
ax.set_xlabel('Developer Income', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
Histogram depicting developer income distribution

The above distribution depicts a right skew in the income with lesser developers earning more money and vice versa. Let’s take a 4-Quantile or a quartile based adaptive binning scheme. We can obtain the quartiles easily as follows.

quantile_list = [0, .25, .5, .75, 1.]
quantiles = fcc_survey_df['Income'].quantile(quantile_list)
0.00      6000.0
0.25     20000.0
0.50     37000.0
0.75     60000.0
1.00    200000.0
Name: Income, dtype: float64

Let’s now visualize these quantiles in the original distribution histogram!

fig, ax = plt.subplots()
fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3', 
                             edgecolor='black', grid=False)for quantile in quantiles:
    qvl = plt.axvline(quantile, color='r')
ax.legend([qvl], ['Quantiles'], fontsize=10)ax.set_title('Developer Income Histogram with Quantiles', 
ax.set_xlabel('Developer Income', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
Histogram depicting developer income distribution with quartile values

The red lines in the distribution above depict the quartile values and our potential bins. Let’s now leverage this knowledge to build our quartile-based binning scheme.

quantile_labels = ['0-25Q', '25-50Q', '50-75Q', '75-100Q']
fcc_survey_df['Income_quantile_range'] = pd.qcut(
fcc_survey_df['Income_quantile_label'] = pd.qcut(

fcc_survey_df[['ID.x', 'Age', 'Income', 'Income_quantile_range', 
Quantile-based bin ranges and labels for developer incomes

This should give you a good idea of how quantile-based adaptive binning works. An important point to remember here is that the resultant outcome of binning leads to discrete-valued categorical features and you might need an additional step of feature engineering on the categorical data before using it in any model. We will cover feature engineering strategies for categorical data shortly in the next part!

Statistical Transformations

We talked about the adverse effects of skewed data distributions briefly earlier. Let’s look at a different strategy of feature engineering now by making use of statistical or mathematical transformations. We will look at the Log transform as well as the Box-Cox transform. Both of these transform functions belong to the Power Transform family of functions, typically used to create monotonic data transformations. Their main significance is that they help in stabilizing variance, adhering closely to the normal distribution and making the data independent of the mean based on its distribution

Log Transform

The log transform belongs to the power transform family of functions. This function can be mathematically represented as

which reads as log of to the base is equal to y. This can then be translated into

which indicates to what power must the base be raised to in order to get x. The natural logarithm uses b=e where = 2.71828 popularly known as Euler’s number. You can also use base b=10 used popularly in the decimal system.

Log transforms are useful when applied to skewed distributions as they tend to expand the values which fall in the range of lower magnitudes and tend to compress or reduce the values which fall in the range of higher magnitudes. This tends to make the skewed distribution as normal-like as possible. Let’s use log transform on our developer Income feature which we used earlier.

fcc_survey_df['Income_log'] = np.log((1+ fcc_survey_df['Income']))
fcc_survey_df[['ID.x', 'Age', 'Income', 'Income_log']].iloc[4:9]
Log transform on developer income

The Income_log field depicts the transformed feature after log transformation. Let’s look at the data distribution in this transformed field now.

income_log_mean = np.round(np.mean(fcc_survey_df['Income_log']), 2)fig, ax = plt.subplots()
fcc_survey_df['Income_log'].hist(bins=30, color='#A9C5D3', 
                                 edgecolor='black', grid=False)
plt.axvline(income_log_mean, color='r')
ax.set_title('Developer Income Histogram after Log Transform', 
ax.set_xlabel('Developer Income (log scale)', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.text(11.5, 450, r'$\mu$='+str(income_log_mean), fontsize=10)
Histogram depicting developer income distribution after log transform

Based on the above plot, we can clearly see that the distribution is more normal-like or gaussian as compared to the skewed distribution on the original data.

Box-Cox Transform

The Box-Cox transform is another popular function belonging to the power transform family of functions. This function has a pre-requisite that the numeric values to be transformed must be positive (similar to what log transform expects). In case they are negative, shifting using a constant value helps. Mathematically, the Box-Cox transform function can be denoted as follows.

Such that the resulted transformed output is a function of input and the transformation parameter λ such that when λ = 0, the resultant transform is the natural log transform which we discussed earlier. The optimal value of λ is usually determined using a maximum likelihood or log-likelihood estimation. Let’s now apply the Box-Cox transform to our developer income feature. First, we get the optimal lambda value from the data distribution by removing the non-null values as follows.

income = np.array(fcc_survey_df['Income'])
income_clean = income[~np.isnan(income)]
l, opt_lambda = spstats.boxcox(income_clean)
print('Optimal lambda value:', opt_lambda)
Optimal lambda value: 0.117991239456

Now that we have obtained the optimal λ value, let us use the Box-Cox transform for two values of λ such that λ = 0 and λ = λ(optimal) and transform the developer Income feature.

fcc_survey_df['Income_boxcox_lambda_0'] = spstats.boxcox(
fcc_survey_df['Income_boxcox_lambda_opt'] = spstats.boxcox(

fcc_survey_df[['ID.x', 'Age', 'Income', 'Income_log', 
Developer income distribution after Box-Cox transform

The transformed features are depicted in the above data frame. Just like we expected, Income_log and Income_boxcox_lamba_0 have the same values. Let’s look at the distribution of the transformed Income feature after transforming with the optimal λ.

income_boxcox_mean = np.round(
fig, ax = plt.subplots()
                     color='#A9C5D3', edgecolor='black', grid=False)
plt.axvline(income_boxcox_mean, color='r')
ax.set_title('Developer Income Histogram after Box–Cox Transform', 
ax.set_xlabel('Developer Income (Box–Cox transform)', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.text(24, 450, r'$\mu$='+str(income_boxcox_mean), fontsize=10)
Histogram depicting developer income distribution after Box-Cox transform

The distribution looks more normal-like similar to what we obtained after the log transform.

Feature Engineering on Categorical Data

In this section, we will look at another type of structured data, which is discrete in nature and is popularly termed as categorical data. Dealing with numeric data is often easier than categorical data given that we do not have to deal with additional complexities of the semantics pertaining to each category value in any data attribute which is of a categorical type. We will use a hands-on approach to discuss several encoding schemes for dealing with categorical data and also a couple of popular techniques for dealing with large-scale feature explosion, often known as the “curse of dimensionality”.

Understanding Categorical Data

Let’s get an idea about categorical data representations before diving into feature engineering strategies. Typically, any data attribute which is categorical in nature represents discrete values that belong to a specific finite set of categories or classes. These are also often known as classes or labels in the context of attributes or variables which are to be predicted by a model (popularly known as response variables). These discrete values can be text or numeric in nature (or even unstructured data like images!). There are two major classes of categorical data, nominal and ordinal.

In any nominal categorical data attribute, there is no concept of ordering amongst the values of that attribute. Consider a simple example of weather categories, as depicted in the following figure. We can see that we have six major classes or categories in this particular scenario without any concept or notion of order (windy doesn’t always occur before sunny nor is it smaller or bigger than sunny).

Weather as a categorical attribute

Similarly, movie, music, and video game genres, country names, food, and cuisine types are other examples of nominal categorical attributes. Ordinal categorical attributes have some sense or notion of order amongst their values. For instance look at the following figure for shirt sizes. It is quite evident that order or in this case ‘size’ matters when thinking about shirts (is smaller than which is smaller than and so on).

Shirt size as an ordinal categorical attribute

Shoe sizes, education level, and employment roles are some other examples of ordinal categorical attributes. Having a decent idea about categorical data, let’s now look at some feature engineering strategies.

A lot of advancements have been made in various machine learning frameworks to accept complex categorical data types like text labels. Typically any standard workflow in feature engineering involves some form of transformation of these categorical values into numeric labels and then applying some encoding scheme to these values. We load up the necessary essentials before getting started.

import pandas as pd
import numpy as np

Transforming Nominal Attributes

Nominal attributes consist of discrete categorical values with no notion or sense of order amongst them. The idea here is to transform these attributes into a more representative numerical format that can be easily understood by downstream code and pipelines. Let’s look at a new dataset pertaining to video game sales. This dataset is also available on Kaggle as well as this GitHub repository.

vg_df = pd.read_csv('datasets/vgsales.csv', encoding='utf-8')
vg_df[['Name', 'Platform', 'Year', 'Genre', 'Publisher']].iloc[1:7]
Dataset for video game sales

Let’s focus on the video game Genre attribute as depicted in the above data frame. It is quite evident that this is a nominal categorical attribute just like Publisher and Platform. We can easily get the list of unique video game genres as follows.

genres = np.unique(vg_df['Genre'])

array(['Action', 'Adventure', 'Fighting', 'Misc', 'Platform',  
       'Puzzle', 'Racing', 'Role-Playing', 'Shooter', 'Simulation',  
       'Sports', 'Strategy'], dtype=object)

This tells us that we have 12 distinct video game genres. We can now generate a label encoding scheme for mapping each category to a numeric value by leveraging scikit-learn.

from sklearn.preprocessing import 
LabelEncodergle = LabelEncoder()
genre_labels = gle.fit_transform(vg_df['Genre'])
genre_mappings = {index: label for index, label in 
{0: 'Action', 1: 'Adventure', 2: 'Fighting', 3: 'Misc',
 4: 'Platform', 5: 'Puzzle', 6: 'Racing', 7: 'Role-Playing',
 8: 'Shooter', 9: 'Simulation', 10: 'Sports', 11: 'Strategy'}

Thus a mapping scheme has been generated where each genre value is mapped to a number with the help of the LabelEncoder object gle. The transformed labels are stored in the genre_labels value which we can write back to our data frame.

vg_df['GenreLabel'] = genre_labels
vg_df[['Name', 'Platform', 'Year', 'Genre', 'GenreLabel']].iloc[1:7]
Video game genres with their encoded labels

These labels can be used directly often, especially with frameworks like scikit-learn if you plan to use them as response variables for prediction, however as discussed earlier, we will need an additional step of encoding on these before we can use them as features.

Transforming Ordinal Attributes

Ordinal attributes are categorical attributes with a sense of order amongst the values. Let’s consider our Pokémon dataset. Let’s focus more specifically on the Generation attribute.

poke_df = pd.read_csv('datasets/Pokemon.csv', encoding='utf-8')
poke_df = poke_df.sample(random_state=1, 
array(['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6'], 

Based on the above output, we can see there are a total of generations and each Pokémon typically belongs to a specific generation based on the video games (when they were released) and also the television series follows a similar timeline. This attribute is typically ordinal (domain knowledge is necessary here) because most Pokémon belonging to Generation 1 were introduced earlier in the video games and the television shows than Generation 2 and so on. Fans can check out the following figure to remember some of the popular Pokémon of each generation (views may differ among fans!).

Popular Pokémon based on generation and type (source: https://www.reddit.com/r/pokemon/comments/2s2upx/heres_my_favorite_pokemon_by_type_and_gen_chart)

Hence they have a sense of order amongst them. In general, there is no generic module or function to map and transform these features into numeric representations based on order automatically. Hence we can use a custom encoding\mapping scheme.

gen_ord_map = {'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3, 
               'Gen 4': 4, 'Gen 5': 5, 'Gen 6': 6}
poke_df['GenerationLabel'] = poke_df['Generation'].map(gen_ord_map)
poke_df[['Name', 'Generation', 'GenerationLabel']].iloc[4:10]
Pokémon generation encoding

It is quite evident from the above code that the map(…) function from pandas is quite helpful in transforming this ordinal feature.

Encoding Categorical Attributes

If you remember what we mentioned earlier, typically feature engineering on categorical data involves a transformation process which we depicted in the previous section, and a compulsory encoding process where we apply specific encoding schemes to create dummy variables or features for each category\value in a specific categorical attribute.

You might be wondering, we just converted categories to numerical labels in the previous section, why on earth do we need this now? The reason is quite simple. Considering video game genres, if we directly fed the GenreLabel attribute as a feature in a machine learning model, it would consider it to be a continuous numeric feature thinking value 10 (Sports) is greater than (Racing) but that is meaningless because the Sports genre is certainly not bigger or smaller than Racing, these are essentially different values or categories which cannot be compared directly. Hence we need an additional layer of encoding schemes where dummy features are created for each unique value or category out of all the distinct categories per attribute.

One-hot Encoding Scheme

Considering we have the numeric representation of any categorical attribute with labels (after transformation), the one-hot encoding scheme, encodes or transforms the attribute into binary features which can only contain a value of 1 or 0. Each observation in the categorical feature is thus converted into a vector of size with only one of the values as (indicating it as active). Let’s take a subset of our Pokémon dataset depicting two attributes of interest.

poke_df[['Name', 'Generation', 'Legendary']].iloc[4:10]
A subset of our Pokémon dataset

The attributes of interest are Pokémon Generation and their Legendary status. The first step is to transform these attributes into numeric representations based on what we learned earlier.

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
# transform and map pokemon generations
gen_le = LabelEncoder()
gen_labels = gen_le.fit_transform(poke_df['Generation'])
poke_df['Gen_Label'] = gen_labels
# transform and map pokemon legendary status
leg_le = LabelEncoder()
leg_labels = leg_le.fit_transform(poke_df['Legendary'])
poke_df['Lgnd_Label'] = leg_labels
poke_df_sub = poke_df[['Name', 'Generation', 'Gen_Label',  
                       'Legendary', 'Lgnd_Label']]
Attributes with transformed (numeric) labels

The features Gen_Label and Lgnd_Label now depict the numeric representations of our categorical features. Let’s now apply the one-hot encoding scheme to these features.

# encode generation labels using one-hot encoding scheme
gen_ohe = OneHotEncoder()
gen_feature_arr = gen_ohe.fit_transform(
gen_feature_labels = list(gen_le.classes_)
gen_features = pd.DataFrame(gen_feature_arr, 
# encode legendary status labels using one-hot encoding scheme
leg_ohe = OneHotEncoder()
leg_feature_arr = leg_ohe.fit_transform(
leg_feature_labels = ['Legendary_'+str(cls_label) 
                           for cls_label in leg_le.classes_]
leg_features = pd.DataFrame(leg_feature_arr, 

In general, you can always encode both the features together using the fit_transform(…) function by passing it a two-dimensional array of the two features together (Check out the documentation!). But we encode each feature separately, to make things easier to understand. Besides this, we can also create separate data frames and label them accordingly. Let’s now concatenate these feature frames and see the final result.

poke_df_ohe = pd.concat([poke_df_sub, gen_features, leg_features], axis=1)
columns = sum([['Name', 'Generation', 'Gen_Label'],   
               gen_feature_labels, ['Legendary', 'Lgnd_Label'], 
               leg_feature_labels], [])
One-hot encoded features for Pokémon generation and legendary status

Thus you can see that dummy variables or binary features have been created for Generation and for Legendary since those are the total number of distinct categories in each of these attributes respectively. Active state of a category is indicated by the value in one of these dummy variables which is quite evident from the above data frame.

Consider you built this encoding scheme on your training data and built some model and now you have some new data that has to be engineered for features before predictions as follows.

new_poke_df = pd.DataFrame([['PikaZoom', 'Gen 3', True], 
                           ['CharMyToast', 'Gen 4', False]],
                       columns=['Name', 'Generation', 'Legendary'])
Sample new data

You can leverage scikit-learn’s excellent API here by calling the transform(…) function of the previously build LabeLEncoder and OneHotEncoder objects on the new data. Remember our workflow, first, we do the transformation.

new_gen_labels = gen_le.transform(new_poke_df['Generation'])
new_poke_df['Gen_Label'] = new_gen_labels
new_leg_labels = leg_le.transform(new_poke_df['Legendary'])
new_poke_df['Lgnd_Label'] = new_leg_labels
new_poke_df[['Name', 'Generation', 'Gen_Label', 'Legendary', 
Categorical attributes after transformation

Once we have numerical labels, let’s apply the encoding scheme now!

new_gen_feature_arr = gen_ohe.transform(new_poke_df[['Gen_Label']]).toarray()
new_gen_features = pd.DataFrame(new_gen_feature_arr, 
new_leg_feature_arr = leg_ohe.transform(new_poke_df[['Lgnd_Label']]).toarray()
new_leg_features = pd.DataFrame(new_leg_feature_arr, 
new_poke_ohe = pd.concat([new_poke_df, new_gen_features, new_leg_features], axis=1)
columns = sum([['Name', 'Generation', 'Gen_Label'], 
               ['Legendary', 'Lgnd_Label'], leg_feature_labels], [])
Categorical attributes after one-hot encoding

Thus you can see it’s quite easy to apply this scheme to new data easily by leveraging scikit-learn’s powerful API.

You can also apply the one-hot encoding scheme easily by leveraging the to_dummies(…) function from pandas.

gen_onehot_features = pd.get_dummies(poke_df['Generation'])
pd.concat([poke_df[['Name', 'Generation']], gen_onehot_features], 
One-hot encoded features by leveraging pandas

The above data frame depicts the one-hot encoding scheme applied to the Generation attribute and the results are the same as compared to the earlier results as expected.

Dummy Coding Scheme

The dummy coding scheme is similar to the one-hot encoding scheme, except in the case of the dummy coding scheme when applied on a categorical feature with distinct labels, we get m – 1 binary features. Thus each value of the categorical variable gets converted into a vector of size m – 1. The extra feature is completely disregarded and thus if the category values range from {0, 1, …, m-1} the 0th or the m – 1th feature column is dropped and corresponding category values are usually represented by a vector of all zeros (0). Let’s try applying a dummy coding scheme to Pokémon Generation by dropping the first level binary encoded feature (Gen 1).

gen_dummy_features = pd.get_dummies(poke_df['Generation'], 
pd.concat([poke_df[['Name', 'Generation']], gen_dummy_features], 
Dummy coded features for Pokémon generation

If you want, you can also choose to drop the last level binary encoded feature (Gen 6) as follows.

gen_onehot_features = pd.get_dummies(poke_df['Generation'])
gen_dummy_features = gen_onehot_features.iloc[:,:-1]
pd.concat([poke_df[['Name', 'Generation']], gen_dummy_features],  
Dummy coded features for Pokémon generation

Based on the above depictions, it is quite clear that categories belonging to the dropped feature are represented as a vector of zeros (0) like we discussed earlier.

Effect Coding Scheme

The effect coding scheme is actually very similar to the dummy coding scheme, except during the encoding process, the encoded features or feature vector, for the category values which represent all in the dummy coding scheme, is replaced by -1 in the effect coding scheme. This will become clearer with the following example.

gen_onehot_features = pd.get_dummies(poke_df['Generation'])
gen_effect_features = gen_onehot_features.iloc[:,:-1]
gen_effect_features.loc[np.all(gen_effect_features == 0, 
                               axis=1)] = -1.
pd.concat([poke_df[['Name', 'Generation']], gen_effect_features], 
Effect coded features for Pokémon generation

The above output clearly shows that the Pokémon belonging to Generation 6 are now represented by a vector of -1 values as compared to zeros in dummy coding.

Bin-counting Scheme

The encoding schemes we discussed so far, work quite well on categorical data in general, but they start causing problems when the number of distinct categories in any feature becomes very large. Essential for any categorical feature of distinct labels, you get separate features. This can easily increase the size of the feature set causing problems like storage issues, and model training problems with regard to time, space, and memory. Besides this, we also have to deal with what is popularly known as the ‘curse of dimensionality’ where basically with an enormous number of features and not enough representative samples, model performance starts getting affected often leading to overfitting.

Hence we need to look towards other categorical data feature engineering schemes for features having a large number of possible categories (like IP addresses). The bin-counting scheme is a useful scheme for dealing with categorical variables having many categories. In this scheme, instead of using the actual label values for encoding, we use probability-based statistical information about the value and the actual target or response value which we aim to predict in our modeling efforts. A simple example would be based on past historical data for IP addresses and the ones which were used in DDOS attacks; we can build probability values for a DDOS attack being caused by any of the IP addresses. Using this information, we can encode an input feature that depicts that if the same IP address comes in the future, what is the probability value of a DDOS attack being caused. This scheme needs historical data as a pre-requisite and is an elaborate one. Depicting this with a complete example would be currently difficult here but there are several resources online that you can refer to for the same.

Feature Hashing Scheme

The feature hashing scheme is another useful feature engineering scheme for dealing with large-scale categorical features. In this scheme, a hash function is typically used with the number of encoded features pre-set (as a vector of pre-defined length) such that the hashed values of the features are used as indices in this pre-defined vector, and values are updated accordingly. Since a hash function maps a large number of values into a small finite set of values, multiple different values might create the same hash which is termed collisions. Typically, a signed hash function is used so that the sign of the value obtained from the hash is used as the sign of the value which is stored in the final feature vector at the appropriate index. This should ensure lesser collisions and lesser accumulation of error due to collisions.

Hashing schemes work on strings, numbers, and other structures like vectors. You can think of hashed outputs as a finite set of b bins such that when the hash function is applied on the same values\categories, they get assigned to the same bin (or a subset of bins) out of the bins based on the hash value. We can pre-define the value of which becomes the final size of the encoded feature vector for each categorical attribute that we encode using the feature hashing scheme.

Thus even if we have over 1000 distinct categories in a feature and we set b=10 as the final feature vector size, the output feature set will still have only 10 features as compared to 1000 binary features if we used a one-hot encoding scheme. Let’s consider the Genre attribute in our video game dataset.

unique_genres = np.unique(vg_df[['Genre']])
print("Total game genres:", len(unique_genres))
Total game genres: 12
['Action' 'Adventure' 'Fighting' 'Misc' 'Platform' 'Puzzle' 'Racing'
 'Role-Playing' 'Shooter' 'Simulation' 'Sports' 'Strategy']

We can see that there are a total of 12 genres of video games. If we used a one-hot encoding scheme on the Genre feature, we would end up having 12 binary features. Instead, we will now use a feature hashing scheme by leveraging scikit-learn’s FeatureHasher class, which uses a signed 32-bit version of the Murmurhash3 hash function. We will pre-define the final feature vector size to be in this case.

from sklearn.feature_extraction import FeatureHasher
fh = FeatureHasher(n_features=6, input_type='string')
hashed_features = fh.fit_transform(vg_df['Genre'])
hashed_features = hashed_features.toarray()
pd.concat([vg_df[['Name', 'Genre']], pd.DataFrame(hashed_features)], 
Feature Hashing on the Genre attribute

Based on the above output, the Genre categorical attribute has been encoded using the hashing scheme into features instead of 12. We can also see that rows and denote the same genre of games, Platform which have been rightly encoded into the same feature vector.


Feature engineering is a very important aspect of machine learning and data science and should never be ignored. While we have automated feature engineering methodologies like deep learning as well as automated machine learning frameworks like AutoML (which still stresses that it requires good features to work well!). Feature engineering is here to stay and even some of these automated methodologies often require specific engineered features based on the data type, domain and the problem to be solved.

We looked at popular strategies for feature engineering on continuous numeric and discrete data in this post.



Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.