UNIT 2 DATA PREPROCESSING - PART 1 Prepared by: Trivedi Khushboo
What is data preprocessing? • Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues.
Why is Data Preprocessing Important? Data Preprocessing is an important step in the Data Preparation stage of a Data Science development lifecycle that will ensure reliable, robust, and consistent results. The main objective of this step is to ensure and check the quality of data before applying any Machine Learning or Data Mining methods.
Data Preprocessing steps of Data Science • Step 1 : Import the libraries • Step 2 : Import the data-set • Step 3 : Check out the missing values • Step 4 : See the Categorical Values • Step 5 : Splitting the data-set into Training and Test Set • Step 6 : Feature Scaling
Step 3 : Check out the Missing Values • The concept of missing values is important to understand in order to successfully manage data. If the missing values are not handled properly by the researcher, then he/she may end up drawing an inaccurate inference about the data.
Step 4 : See the Categorical Values
Like in our data set Country column will cause problem, so will convert into numerical values. To convert Categorical variable into Numerical data we can use LabelEncoder() class from preprocessing library.
• Use LabelEncoder class to convert Categorical data into numerical one • label_encoder is object which is used and help us in transferring Categorical data into Numerical data. Next, It fits this label_encoder object to the first column of our matrix X and with all this, it returns the first column country of the matrix X encoded
Step 5 : Splitting the data-set into Training and Test Set • Training Set • Test Set • Why we need splitting ? • your algorithm model that is going to learn from your data to make predictions. Generally we split the data-set into 70:30 ratio or 80:20 what does it mean, 70 percent data take in train and 30 percent data take in test.
Splitting the Data-set into two set — Train and Test Set
• X_train is the training part of the matrix of features. • X_test is the test part of the matrix of features. • y_train is the training part of the dependent variable that is associated to X_train here. • y_test is the test part of the dependent variable that is associated to X_train here.
Step 6 : Feature Scaling • Feature scaling is the method to limit the range of variables so that they can be compared on common grounds.
• You can easily noticed Salary and Age variable don’t have the same scale and this will cause some issue in your machine learning model. • Let’s say we take two values from Age and Salary column • Age- 40 and 27 • Salary- 72000 and 48000