Lecture Note
An Essential Data Pre-Processing Method to Understand Data pre-processing is an important component of data analysis, andnormalization is an important pre-processing approach. The technique ofnormalizing numerical data creates a homogeneous range that makes itpossible to fairly compare various aspects. It's Important to Normalize It is usual to encounter data sets having several features, each of which mayhave a varied range of values, when evaluating data. Consider a data set ofused vehicles, for instance, where the feature length varies from 150 to 250,and the feature width and height varies from 50 to 100. Normalization isrequired to guarantee that these variables have a stable range of values.
Future statistical analyses are made simpler by normalization, whichensures that the e ects of various features are equal. This is crucial forcomputations since features with varying ranges might lead to numericalinstability in some methods. Let's have a look at another illustration to help illustrate the significance ofnormalization. Consider a data collection with the two characteristics of ageand income. While the income feature has a range of 0-20,000 and above,the age feature has a range of 0-100. The income feature runs from 20,000to 500,000 and is 1,000 times larger than the age feature. These twocharacteristics have quite distinct ranges, and because the income featurehas a larger value, it will have a bigger impact on the outcome of additionalresearch, such linear regression. This does not necessarily imply that it is abetter predictor, though. The linear regression model unfairly favorsincome over age due to the structure of the data. This can be avoided by normalizing the two variables to values between 0and 1. These variables have a comparable impact on the models we willcreate later after normalization. Various Normalization Methods We will describe three of the most popular methods for normalizing dataout of the many available methods.
Simple Feature Scaling Simple feature scaling is the first method. By dividing each value by thefeature's maximum, this approach creates new values that fall between 0and 1. This approach is straightforward to use and appropriate when theminimum and maximum values are known. Take the length feature from the used automobile data set as an illustration.We divide each value by the maximum value in the feature to normalize thelength feature using basic feature scaling. Using the pandas method max,this may be accomplished in a single line of code. Min-Max Method The min-max approach is the name of the second technique. This methoddivides each value by the range of that feature after subtracting each valueby the minimum value of that feature. The new values that arise also fallbetween 0 and 1. Take the length feature from the used automobile data set as an illustration.By subtracting each value from the minimum of each column and dividing itby the range of that column, we may normalize the length feature using themin-max method (max minus min).
Z-Score or Standard Score The z-score, often known as the standard score, is the third method. In thisapproach, we divide by the standard deviation sigma after dividing eachvalue by the average of the feature, mu. The resulting values normally rangefrom negative three to positive three and are close to zero, however they canalso be higher or lower. Take the length feature from the used automobile data set as an illustration.We use the mean and STD method on the length feature to normalize itusing the z-score approach. The STD function will return the standarddeviation, while the mean method will return the average value of thefeature in the data set.
Data Normalization in Python
Please or to post comments