Association between two categorical variables: Chi-Square We must apply the chi square test for the association when analyzingrelationships between two categorical variables because we cannot utilizethe same correlation approach for continuous variables. To determine howlikely it is that an observed distribution is the result of chance, researchersuse the Chi-square test. The degree to which the observed data distributionresembles the distribution that would be anticipated if the variables wereindependent is measured. Important Information Let's review a few key aspects before moving on to an example. TheChi-square test is a statistical analysis that assesses the likelihood that theobserved data happened by chance by comparing observed and expecteddata. This test contrasts the values of the observed data with what the modelpredicts if the data were randomly distributed across the various categories.The chance that the variables are dependent increases whenever theobserved data does not ﬁt inside the model of the expected values,disproving the null hypothesis. The Chi-square just indicates that there is a relationship between the twovariables. It does not specify what kind of relationship there is. The dataanalyst must determine the nature of the link between the two variables. Using the Cars Dataset as an Example Let's examine the association between fuel type and aspiration, twocategorical variables, using the automobiles dataset.
Gas or diesel is the only two options for the vehicle's fuel, and either anormal or turbo engine is desired. We will discover the observed numbers ofautos in each group to accomplish this. This may be accomplished by usingthe pandas package to create a crosstab. A table displaying the relationshipbetween two or more variables is called a crosstab. The crosstab is alsoreferred to as a contingency table when it just displays the associationbetween two category variables. In our case, the crosstab or contingencytable displays the counts for each group, including standard cars running ondiesel, standard cars running on gas, turbocharged cars running on diesel,and turbocharged cars running on gas. How to Calculate Expected Values The following is the chi-square formula: “The square root of the observed value, or the numbers in each group,multiplied by the predicted value, divided by the observed value”.Expected values are based on the provided totals; in other words, what canwe predict the values of individual cells if we were unaware of the observedvalues? The predicted value of a normal car with diesel is determined by
multiplying the row total of 20 by the column total of one 168 and dividingthe result by the sum of 205. You will receive 16.39 after doing this. If we apply the same calculation to turbocharged cars running on gas, theresult is 33.39. To get at this result, we multiply the row total 185 by thecolumn total 37 and divide it by the sum 250. We may obtain these values byapplying the same process to each of them. We will obtain the same valuesas the totals as the observed values if we add the row totals, column totals,and grand total. Results of the Chi-Square Test Interpreted Returning to the calculation, the chi-square value would be 29.06 if weadded up all the observed values and subtracted the predicted values, thendivided by the expected value. We check that the degree of freedom equalsone row on the chi-square table and locate the value that is most similar to29.6. Now, we can see that a p-value of less than 0.05 will be between 29.6.As a result, we can state that the p-value is less than 0.05.
We may infer from the chi-square test we conducted in the preceding slidesthat there is a connection between fuel type and ambition. This is becausethe p-value we found was less than 0.05, which indicates that we reject theidea that the two variables are independent. The chi square contingency function from the SciPy statistics package canbe used to run this test in Python. This function will return the degree offreedom, the p-value, and the chi-square test result (29.6, extremely nearto 0). (1). Python provides the precise p-value as opposed to the chi-squaretable, which merely provides a range of p-values. The anticipated values,which we had previously estimated manually, are also visible. We can rule out the null hypothesis because the p-value is so close to zeroand draw the conclusion that there is evidence of a link between fuel typeand aspiration.