Turning categorical variables into quantitative variables in Python

University
IBM Skills Network
Course
Data Analyst
Pages

3

Academic year

2023
Author

GregoryB
Views

18

With Pandas, how to Encode Categorical Variables Categorical variables that cannot be utilized directly as inputs for statisticalmodels are frequently encountered while working with data. For instance,the fuel type feature in the car data set is a string-formatted categoricalvariable with the two possible values of gas or diesel. We need to convert this variable into a numeric format so that we may use itfor future analysis or model training. One-hot encoding is a well-likedtechnique for accomplishing this. In one-hot encoding, we encode the values of a categorical variable bygenerating new features for each distinct component of the original feature.We design two new features, gas and diesel, for the fuel type example. Whena value appears in the original feature, we set the equivalent value in thenew feature to one instead of zero and leave the remainder of the features atzero. Dummy variables or indicator variables are common names for thismethod.

The "get dummies method" in Pandas makes it simple to execute one-hotencoding. With this technique, a new data frame containing binary variablesis produced for each distinct category out of a column of categoricalvariables. Using the example as a guide, we may employ the "pd.getdummies" method to generate a new data frame containing columns for gasand diesel, where each row represents an automobile and the values areeither 0 or 1. The resulting "dummy variable 1" data frame will resemble this:We may now feed our statistical models with the new data frame containingbinary variables. One thing to bear in mind is that in order to avoid the "dummy variabletrap," we must remove one of the columns from the data frame thatcontains the dummy variable. In this case, we include all the columns in the

model, which leads to multicollinearity problems. To prevent this, weremove one of the columns, which makes the remaining columns' referencecategory. Several encoding techniques for categorical variables exist in addition toone-hot encoding, such as label encoding and ordinal encoding. Labelencoding substitutes a numeric value from 0 to n-1 for each category, wheren is the total number of categories. The order or rank of the categories isused in ordinal encoding to assign a numerical value. The method that captures non-linear correlations between the categoryvariable and the response variable, one-hot encoding, is frequently usedsince it does not enforce any ordinal relationship between the categories. To summarize, one-hot encoding is a helpful method for transformingcategorical variables into a numeric format that may be utilized as inputsfor statistical models when dealing with categorical variables. We canquickly carry out this encoding procedure and produce a new data framewith binary variables for each distinct category using Pandas' "getdummies" method.