Data analysis is an essential part of any research or business endeavor, and one of the most fundamental techniques is measuring the correlation between variables. While there are various methods for measuring correlation, in this post, we will focus on using two statistical tests – Chi-Square and ANOVA – to measure the correlation between numerical and categorical variables and the correlation between two categorical variables in Python. By the end of this post, you will have a deeper understanding of how to use these tests to analyze your data and make more informed decisions based on your findings.
This scenario can happen when we are doing regression or classification in machine learning:
The ANOVA test is used to measure the strength of the correlation between variables in both cases. ANOVA is an abbreviation for Analysis Of Variance and determines if there are significant differences between the means of numeric variables for each categorical value. A box plot is also a visual representation of this data.
Null hypothesis(H0) ANOVA hypothesis test: The variables are not correlated with each other
In the below example, we are trying to measure if there is any correlation between FuelType on CarPrices. Here FuelType is a categorical predictor and CarPrices is the numeric target variable.
# Generating sample data
import pandas as pd
ColumnNames=['FuelType','CarPrice']
DataValues= [
[ 'Petrol', 2000],
[ 'Petrol', 2100],
[ 'Petrol', 1900],
[ 'Petrol', 2150],
[ 'Petrol', 2100],
[ 'Petrol', 2200],
[ 'Petrol', 1950],
[ 'Diesel', 2500],
[ 'Diesel', 2700],
[ 'Diesel', 2900],
[ 'Diesel', 2850],
[ 'Diesel', 2600],
[ 'Diesel', 2500],
[ 'Diesel', 2700],
[ 'CNG', 1500],
[ 'CNG', 1400],
[ 'CNG', 1600],
[ 'CNG', 1650],
[ 'CNG', 1600],
[ 'CNG', 1500],
[ 'CNG', 1500]
]
#Create the Data Frame
CarData=pd.DataFrame(data=DataValues,columns=ColumnNames)
print(CarData.head())
########################################################
# f_oneway() function takes the group data as input and
# returns F-statistic and P-value
from scipy.stats import f_oneway
# Running the one-way anova test between CarPrice and FuelTypes
# Assumption(H0) is that FuelType and CarPrices are NOT correlated
# Finds out the Prices data for each FuelType as a list
CategoryGroupLists=CarData.groupby('FuelType')['CarPrice'].apply(list)
# Performing the ANOVA test
# We accept the Assumption(H0) only when P-Value > 0.05
AnovaResults = f_oneway(*CategoryGroupLists)
print('P-Value for Anova is: ', AnovaResults[1])
Sample Output
As the output of the P-value is almost zero, hence, we reject H0. This means the variables are correlated with each other.
In classification machine learning, it is common to encounter a scenario where the target variable is categorical and the predictors can be either continuous or categorical. When both the target variable and predictors are categorical, a Chi-square test can be employed to gauge the strength of their relationship.
The Chi-square test finds the probability of a Null hypothesis(H0).
It can help to understand whether both the categorical variables are correlated with each other or not. In the below scenario, we try to measure the correlation between GENDER and LOAN_APPROVAL.
# Creating a sample data frame
import pandas as pd
ColumnNames=['CIBIL','AGE','GENDER' ,'SALARY', 'APPROVE_LOAN']
DataValues=[ [480, 28, 'M', 610000, 'Yes'],
[480, 42, 'M',140000, 'No'],
[480, 29, 'F',420000, 'No'],
[490, 30, 'M',420000, 'No'],
[500, 27, 'M',420000, 'No'],
[510, 34, 'F',190000, 'No'],
[550, 24, 'M',330000, 'Yes'],
[560, 34, 'M',160000, 'Yes'],
[560, 25, 'F',300000, 'Yes'],
[570, 34, 'M',450000, 'Yes'],
[590, 30, 'F',140000, 'Yes'],
[600, 33, 'M',600000, 'Yes'],
[600, 22, 'M',400000, 'Yes'],
[600, 25, 'F',490000, 'Yes'],
[610, 32, 'M',120000, 'Yes'],
[630, 29, 'F',360000, 'Yes'],
[630, 30, 'M',480000, 'Yes'],
[660, 29, 'F',460000, 'Yes'],
[700, 32, 'M',470000, 'Yes'],
[740, 28, 'M',400000, 'Yes']]
#Create the Data Frame
LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)
print(LoanData.head())
#########################################################
# Cross tabulation between GENDER and APPROVE_LOAN
CrosstabResult=pd.crosstab(index=LoanData['GENDER'],columns=LoanData['APPROVE_LOAN'])
print(CrosstabResult)
# importing the required function
from scipy.stats import chi2_contingency
# Performing Chi-sq test
ChiSqResult = chi2_contingency(CrosstabResult)
# P-Value is the Probability of H0 being True
# If P-Value>0.05 then only we Accept the assumption(H0)
print('The P-Value of the ChiSq Test is:', ChiSqResult[1])
Sample Output:
H0: The variables are not correlated with each other. This is the H0 used in the Chi-square test.
In the above example, the p-value came higher than 0.05. Hence H0 will be accepted. This means the variables are not correlated with each other. This means that if two variables are correlated, then the p-value will come very close to zero.
In this section, I will explain how we can test two categorical columns in a dataset to determine if they are dependent on each other (i.e. correlated). We will use a statistics test known as chi-square (commonly written as χ2). Before we start our discussion on chi-square, here is a quick summary of the test methods that can be used for testing the various types of variables:
The chi-square (χ2) statistics is a way to check the relationship between two categorical nominal variables. Nominal variables contain values that have no intrinsic ordering. Examples of nominal variables are sex, race, eye color, skin color, etc. Ordinal variables, on the other hand, contain values that are ordered. Examples of ordinal variables are grade, education level, economic status, etc.
The key idea behind the chi-square test is to compare the observed values in the data to the expected values and see if they are related or not. In particular, it is a useful way to check if two categorical nominal variables are correlated. This is particularly important in machine learning where we only want features that are correlated to the target to be used for training.
There are two types of chi-square tests:
Check out https://www.jmp.com/en_us/statistics-knowledge-portal/chi-square-test.html for a more detailed discussion of the above two chi-square tests. When comparing to see if two categorical variables are correlated, we will use the Chi-Square Test of Independence.
To use the chi-square test, we need to perform the following steps:
2. Decide on the α value. This is the risk that we are willing to take in drawing the wrong conclusion. As an example, say we set α=0.05 when testing for independence. This means we are undertaking a 5% risk of concluding that two variables are independent when in reality they are not.
3. Calculate the chi-square score using the two categorical variables and use it to calculate the p-value. A low p-value means there is a high correlation between two categorical variables (they are dependent on each other). The p-value is calculated from the chi-square score. The p-value will tell if the test results are significant or not.
In a chi-square analysis, the p-value is the probability of obtaining a chi-square as large or larger than that in the current experiment and yet the data will still support the hypothesis. It is the probability of deviations from what was expected being due to mere chance. In general a p-value of 0.05 or greater is considered critical, anything less means the deviations are significant and the hypothesis being tested must be rejected.
Source: https://passel2.unl.edu/view/lesson/9beaa382bf7e/8
To calculate the p-value, we need two pieces of information:
If the p-value obtained is:
In the case of feature selection for machine learning, we would want the feature that is being compared to the target to have a low p-value (less than 0.05), as this means that the feature is dependent on (correlated to) the target.
With the chi-square score that is calculated, we can also use it to refer to a chi-square table to see if the score falls within the rejection region or the acceptance region. In the next we, I will use the Titanic dataset and apply the chi-square test on a few of the features and see how they are correlated to the target.
A good way to understand a new topic is to go through the concepts using an example. For this, I am going to use the classic Titanic dataset (https://www.kaggle.com/tedllh/titanic-train).
The Titanic dataset is often used in machine learning to demonstrate how to build a machine-learning model and use it to make predictions. In particular, the dataset contains several features (Pclass, Sex, Age, Embarked, etc) and one target (Survived). Several features in the dataset are categorical variables:
Because this section explores the relationships between categorical features and targets, we are only interested in those columns that contain categorical values.
Let’s load the dataset in a Pandas DataFrame:
import pandas as pd
import numpy as np
df = pd.read_csv('titanic_train.csv')
df.sample(5)
There are some columns that are not really useful and hence we will proceed to drop them. Also, there are some missing values so let’s drop all those rows with empty values:
df.drop(columns=['PassengerId','Name', 'Ticket','Fare','Cabin'],
inplace=True)
df.dropna(inplace=True)
df
We will also add one more column named Alone, based on the Parch (Parent or children) and Sibsp (Siblings or spouse) columns. The idea we want to explore is if being alone affects the survivability of the passenger. So Alone is 1 if both Parch and Sibsp are 0, else it is 0:
df['Alone'] = (df['Parch'] + df['SibSp']).apply(
lambda x: 1 if x == 0 else 0)
df
Now that the data is cleaned, let’s try to visualize how the sex of passengers is related to their survival in the accident:
import seaborn as sns
sns.barplot(x='Sex', y='Survived', data=df, ci=None)
The Sex column contains nominal data(i.e. ranking is not important).
From the above figure, we can see that of all the female passengers, more than 70% survived; of all the men, about 20% survived. Seems like there exists a very strong relationship between Sex and Survived features. To confirm this, we will use the chi-square test to confirm later on.
How about Pclass and Survived? Are they related?
sns.barplot(x='Pclass', y='Survived', data=df, ci=None)
Perhaps unsurprisingly, it shows that the higher the Pclass that the passenger was in, the higher the survival rate of the passenger. The next feature of interest is if the place of embarkation determines who survives and who doesn’t:
sns.barplot(x='Embarked', y='Survived', data=df, ci=None)
From the chart, it seems like more people who embarked from C (Cherbourg) survived.
C = Cherbourg; Q = Queenstown; S = Southampton
We also want to know if being alone on the trip makes one more survivable:
ax = sns.barplot(x='Alone', y='Survived', data=df, ci=None)
ax.set_xticklabels(['Not Alone','Alone'])
We can see that if one is with their family, he/she will have a higher chance of survival.
Now that we have visualized the relationships between the categorical features against the target (Survived), we want to now visualize the relationships between each feature. Before we can do that, we need to convert the label values in the Sex and Embarked columns to numeric. To do that, we can make use of the LabelEncoder class in sklearn:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsfrom sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df['Sex'])
df['Sex'] = le.transform(df['Sex'])
sex_labels = dict(zip(le.classes_, le.transform(le.classes_)))
print(sex_labels)
le.fit(df['Embarked'])
df['Embarked'] = le.transform(df['Embarked'])
embarked_labels = dict(zip(le.classes_,
le.transform(le.classes_)))
print(embarked_labels)
The above code snippet label-encodes the Sex and Embarked columns. The output shows the mappings of the values for each column, which is very useful later when performing predictions:
{'female': 0, 'male': 1}
{'C': 0, 'Q': 1, 'S': 2}
The following statements show the relationship between Embarked and Sex:
ax = sns.barplot(x='Embarked', y='Sex', data=df, ci=None)
ax.set_xticklabels(embarked_labels.keys())
Seems like more males boarded from Southampton (S) than in Queenstown (Q) and Cherbourg (C).
How about Embarked and Alone?
ax = sns.barplot(x='Embarked', y='Alone', data=df, ci=None)
ax.set_xticklabels(embarked_labels.keys())
Seems like a large proportion of those who embarked from Queenstown are alone.
And finally, let’s see the relationship between Sex and Alone:
ax = sns.barplot(x='Sex', y='Alone', data=df, ci=None)
ax.set_xticklabels(sex_labels.keys())
As we can see, there are more males than females who are alone on the trip.
We now define the null hypothesis and alternate hypothesis. As explained earlier, they are:
And we draw conclusions based on the following p-value conditions:
Let’s manually go through the steps in calculating the χ2 values. The first step is to create a contingency table. Using the Sex and Survived columns as an example, we first create a contingency table:
The contingency table above displays the frequency distribution of the two categorical columns — Sex and Survived.
The Degrees of Freedom is next calculated as (number of rows -1) * (number of columns -1). In this example, the degree of freedom is (2–1)*(2–1) = 1.
Once the contingency table is created, sum up all the rows and columns, like this:
The above is the Observed values.
Next, we are going to calculate the Expected values. Here is how they are calculated:
The following figure shows how the first value is calculated:
The next figure shows how the second value is calculated:
Here is the result for the Expected values:
Then, calculate the chi-square value for each cell using the formula for χ2:
Applying this formula to the Observed and Expected values, we get the chi-square values:
The chi-square score is the grand total of the chi-square values:
We can use the following websites to verify if the numbers are correct:
The Python implementation for the above steps is contained within the following chi2_by_hand() function:
def chi2_by_hand(df, col1, col2):
#---create the contingency table---
df_cont = pd.crosstab(index = df[col1], columns = df[col2])
display(df_cont) #---calculate degree of freedom---
degree_f = (df_cont.shape[0]-1) * (df_cont.shape[1]-1) #---sum up the totals for row and columns---
df_cont.loc[:,'Total']= df_cont.sum(axis=1)
df_cont.loc['Total']= df_cont.sum()
print('---Observed (O)---')
display(df_cont) #---create the expected value dataframe---
df_exp = df_cont.copy()
df_exp.iloc[:,:] = np.multiply.outer(
df_cont.sum(1).values,df_cont.sum().values) /
df_cont.sum().sum()
print('---Expected (E)---')
display(df_exp)
# calculate chi-square values
df_chi2 = ((df_cont - df_exp)**2) / df_exp
df_chi2.loc[:,'Total']= df_chi2.sum(axis=1)
df_chi2.loc['Total']= df_chi2.sum()
print('---Chi-Square---')
display(df_chi2) #---get chi-square score---
chi_square_score = df_chi2.iloc[:-1,:-1].sum().sum()
return chi_square_score, degree_f
The chi2_by_hand() function takes in three arguments — the dataframe containing all columns, followed by two strings containing the names of the two columns we are comparing against. It returns a tuple — the chi-square score, plus the degrees of freedom.
Let’s now test the above function using the Titanic dataset. First, let’s compare the Sex and the Survived columns:
chi_score, degree_f = chi2_by_hand(df,'Sex','Survived')
print(f'Chi2_score: {chi_score}, Degrees of freedom: {degree_f}')
Chi2_score: 205.1364846934008, Degrees of freedom: 1
Using the chi-square score, we can now decide if we will accept or reject the null hypothesis using the chi-square distribution curve:
The x-axis represents the χ2 score. The area that is to the right of the critical chi-square region is known as the rejection region. The area to the left of it is known as the acceptance region. If the chi-square score that we have obtained falls in the acceptance region, the null hypothesis is accepted; else the alternate hypothesis is accepted.
So how do we obtain the critical chi-square region? For this, we have to check the chi-square table:
We can check out the Chi-Square Table at https://www.mathsisfun.com/data/chi-square-table.html
This is how we use the chi-square table. With the α set to be 0.05, and 1 degree of freedom, the critical chi-square region is 3.84 (refer to the chart above). Putting this value into the chi-square distribution curve, we can conclude that:
This means that the Sex and Survived columns are dependent on each other. We can use the chi2_by_hand() function on the other features.
The previous section shows how we can accept or reject the null hypothesis by examining the chi-square score and comparing it with the chi-square distribution curve. An alternative way to accept or reject the null hypothesis is by using the p-value. Remember, the p-value can be calculated using the chi-square score and the degrees of freedom. For simplicity, we shall not go into the details of how to calculate the p-value by hand.
In Python, we can calculate the p-value using the stats module’s sf() function:
def chi2_by_hand(df, col1, col2):
#---create the contingency table---
df_cont = pd.crosstab(index = df[col1], columns = df[col2])
display(df_cont) ... chi_square_score = df_chi2.iloc[:-1,:-1].sum().sum()
#---calculate the p-value---
from scipy import stats
p = stats.distributions.chi2.sf(chi_square_score, degree_f)
return chi_square_score, degree_f, p
We can now call the chi2_by_hand() function and get both the chi_square score, degrees of freedom, and p-value:
chi_score, degree_f, p = chi2_by_hand(df,'Sex','Survived')
print(f'Chi2_score: {chi_score}, Degrees of freedom: {degree_f}, p-value: {p}')
The above code results in the following p-value:
Chi2_score: 205.1364846934008, Degrees of freedom: 1, p-value: 1.581266384342472e-46
As a quick recap, we accept or reject the hypotheses and form the conclusion based on the following p-value conditions:
And since p < 0.05 — this means the two categorical variables are correlated.
Let’s try out the categorical columns that contain nominal values:
chi_score, degree_f, p = chi2_by_hand(df,'Embarked','Survived')
print(f'Chi2_score: {chi_score}, Degrees of freedom: {degree_f}, p-value: {p}')
# Chi2_score: 27.918691003688615, Degrees of freedom: 2,
# p-value: 8.660306799267924e-07
chi_score, degree_f, p = chi2_by_hand(df,'Alone','Survived')
print(f'Chi2_score: {chi_score}, Degrees of freedom: {degree_f}, p-value: {p}')
# Chi2_score: 28.406341862069905, Degrees of freedom: 1,
# p-value: 9.834262807301776e-08
Since the p-values for both Embarked and Alone are < 0.05, we can conclude that both the Embarked and Alone features are correlated to the Survived target, and should be included for training in our model.
A few notes of caution would be useful here:
The chi-square test is used when both the independent and dependent variables are all categorical variables. However, what if the independent variable is categorical and the dependent variable is numerical? In this case, we have to use another statistic test known as ANOVA — Analysis of Variance.
And so in this section, our discussion will revolve around ANOVA and how we use it in machine learning for feature selection. Before we get started, it is useful to summarize the different methods that we have discussed so far:
ANOVA is used for testing two variables, where:
ANOVA is used when the categorical variable has at least 3 groups (i.e three different unique values). If we want to compare just two groups, we use the t-test. ANOVA lets us know if a numerical variable changes according to the level of the categorical variable. ANOVA uses the f-tests to statistically test the equality of means. F-tests are named after their test statistic, F, which was named in honor of Sir Ronald Fisher.
Here are some examples that make it easier to understand when we can use ANOVA.
We have a dataset containing information about a group of people pertaining to their social media usage and the number of hours they sleep:
We want to find out if the amount of social media usage (categorical variable) has a direct impact on the number of hours of sleep (numerical variable).
We have a dataset containing three different brands of medication and the number of days for the medication to take effect:
We want to find out if there is a direct relationship between a specific brand and its effectiveness.
ANOVA checks whether there is equal variance between groups of categorical features with respect to the numerical response. If there is equal variance between groups, it means this feature has no impact on the response and hence it (the categorical variable) cannot be considered for model training.
The best way to understand ANOVA is to use an example. In the following example, I use a fictitious dataset where I recorded the reaction time of a group of people when they are given a specific type of drink.
I have a sample dataset named drinks.csv containing the following content:
team,drink_type,reaction_time
1,water,14
2,water,25
3,water,23
4,water,27
5,water,28
6,water,21
7,water,26
8,water,30
9,water,31
10,water,34
1,coke,25
2,coke,26
3,coke,27
4,coke,29
5,coke,25
6,coke,23
7,coke,22
8,coke,27
9,coke,29
10,coke,21
1,coffee,8
2,coffee,20
3,coffee,26
4,coffee,36
5,coffee,39
6,coffee,23
7,coffee,25
8,coffee,28
9,coffee,27
10,coffee,25
There are 10 teams in all — each team comprises 3 people. Each person in the team is given three different types of drinks — water, coke, and coffee. After consuming the drink, they were asked to perform some activities and their reaction time was recorded. The aim of this experiment is to determine if the drinks have any effect on a person’s reaction time.
Let’s first load the dataset into a Pandas DataFrame:
import pandas as pd
df = pd.read_csv('drinks.csv')
Record the observation size, which we will make use of later:
observation_size = df.shape[0] # number of observations
It is useful to visualize the distribution of the data using a Boxplot:
_ = df.boxplot('reaction_time', by='drink_type')
We can see that the three types of drinks have about the same median reaction time.
To facilitate the calculation for ANOVA, we need to pivot the dataframe:
df = df.pivot(columns='drink_type', index='team')
display(df)
The columns represent the three different types of drinks and the rows represent the 10 teams. We will also use this chance to record the number of items in each group, as well as the number of groups, which we will make use of later:
n = df.shape[0] # 10; number of items in each group
k = df.shape[1] # 3; number of groups
We now define the null hypothesis and alternate hypothesis, just like the chi-square test. They are:
We are now ready to begin our calculations for ANOVA. First, let’s find the mean for each group:
df.loc['Group Means'] = df.mean()
df
From here, we can now calculate the overall mean:
overall_mean = df.iloc[-1].mean()
overall_mean # 25.666666666666668
Now that we have calculated the overall mean, we can proceed to calculate the following:
The sum of squares of all observations is calculated by deducting each observation from the overall mean, and then summing all the squares of the differences:
Programmatically, SS_total is computed as:
SS_total = (((df.iloc[:-1] - overall_mean)**2).sum()).sum()
SS_total # 1002.6666666666667
The sum of squares within is the sum of squared deviations of scores around their group’s mean:
Programmatically, SS_within is computed as:
SS_within = (((df.iloc[:-1] - df.iloc[-1])**2).sum()).sum()
SS_within # 1001.4
Next, we calculate the sum of squares of the group means from the overall mean:
Programmatically, SS_between is computed as:
SS_between = (n * (df.iloc[-1] - overall_mean)**2).sum()
SS_between # 1.266666666666667
We can verify that:
SS_total = SS_between + SS_within
With all the values computed, we can now complete the ANOVA table. Recall we have the following variables:
We can compute the various degrees of freedoms as follows:
df_total = observation_size - 1 # 29
df_within = observation_size - k # 27
df_between = k - 1 # 2
From the above, compute the various mean squared values:
mean_sq_between = SS_between / (k - 1) # 0.6333333333333335
mean_sq_within = \
SS_within / (observation_size - k) # 37.08888888888889
Finally, we can calculate the F-value, which is the ratio of two variances:
F = mean_sq_between / mean_sq_within # 0.017076093469143204
Recall earlier that I mentioned ANOVA uses the f-tests to statistically test the equality of means.
Once the F-value is obtained, we now have to refer to the f-distribution table (see http://www.socr.ucla.edu/Applets.dir/F_Table.html for one example) to obtain the f-critical value. The f-distribution table is organized based on the α value (usually 0.05). So we need to first locate the table based on α=0.05:
Next, observe that the columns of the f-distribution table are based on df1 while the rows are based on df2. We can get df1 and df2 from the previous variables that we have created:
df1 = df_between # 2
df2 = df_within # 27
Using the values of df1 and df2, we can now locate the f-critical value by locating the df1 column and df2 row:
From the above figure, we can see that the f-critical value is 3.3541. Using this value, we can now decide if we will accept or reject the null hypothesis using the F-distribution curve:
Since the f-value (0.0171, which is what we can calculate) is less than the f-critical value in the f-distribution table, we accept the null hypothesis — this means there is no variance in different groups — all the means are the same. For machine learning, this feature — drink_type, should not be included for training as it seems the different types of drinks have no effect on the reaction time. We should only include a feature for training only if we reject the null hypothesis as this means that the values in the drink types affect the reaction time.
In the previous section, we manually calculated the f-value for our dataset. Actually, there is an easier way — use the stats module’s f_oneway() function to calculate the f-value and p-value:
import scipy.stats as stats
fvalue, pvalue = stats.f_oneway(
df.iloc[:-1,0],
df.iloc[:-1,1],
df.iloc[:-1,2])
print(fvalue, pvalue) # 0.0170760934691432 0.9830794846682348
The f_oneway() function takes the groups as input and returns the ANOVA F and p-value:
In the above, the f-value is 0.0170760934691432 (identical to the one we calculated manually) and the p-value is 0.9830794846682348.
Observe that the f_oneway() function takes in a variable number of arguments:
If we have many groups, it would be quite tedious to pass in the values of all the groups one by one. So, there is an easier way:
fvalue, pvalue = stats.f_oneway(
*df.iloc[:-1,0:3].T.values
)
Another way to calculate the f-value is to use the statsmodel module. We first build the model using the ols() function, and then call the fit() function on the instance of the model. Finally, we call the anova_lm() function on the fitted model and specify the type of ANOVA test to perform on it. There are 3 types of ANOVA tests to perform, but their discussion is beyond the scope of this article.
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.read_csv('drinks.csv')
model = ols('reaction_time ~ drink_type', data=df).fit()
sm.stats.anova_lm(model, typ=2)
The above code snippet produces the following result, which is the same as the f-value that we calculated earlier (0.017076):
The anova_lm() function also returns the p-value (0.983079). We can make use of the following rules to determine if the categorical variable has any influence on the numerical variable:
Since the p-value is now 0.983079 (>0.05), this means that the drink_type has no significant influence on the reaction_time.
ANOVA helps to determine if a categorical variable has an influence on a numerical variable. So far the ANOVA test that we have discussed is known as the one-way ANOVA test. There are a few variations of ANOVA:
Using a two-way ANOVA or multi-way ANOVA, we can investigate the combined impact of two (or more) independent categorical variables on one dependent numerical variable.
Resources:
https://thinkingneuron.com/how-to-measure-the-correlation-between-a-numeric-and-a-categorical-variable-in-python/
https://towardsdatascience.com/statistics-in-python-using-anova-for-feature-selection-b4dc876ef4f0