Live Code, Data Cleaning

University
University of California San Diego
Course
DSC 207R | Python for Data Science
Pages

2

Academic year

2023
Author

anon
Views

14

Live Code, Data Cleaning Our live coding workshop on data cleaning will now begin. Before beginning any analysis, a data analyst must have a clean dataset. Any data analysis process must start with datacleaning, which entails locating and removing any erroneous or unnecessary data. In this article, we will focus on the data cleaning process for a movie dataset. We will use Python's Pandas library to clean and prepare the data for analysis. The article will cover thefollowing steps: 1. Checking the shape of the dataset2. Identifying missing values3. Dropping missing values4. Simple visualization of the dataset Checking the dataset's form should come first. The shape function in Pandas provides the dataset's number of rows and columns. The movies.csv file was loaded into a data framewith the name movies. We may determine the data frame's number of rows and columns byusing the shape function on this data frame. movies.shape The output shows that there are about 27,000 rows, records of movies, and three columns. Next, we need to identify missing values in the dataset. Missing values can cause issues in data analysis and lead to incorrect results. We will use the isnull function in Pandas toidentify missing values in our data frames. movies.isnull() Running the isnull function on the movies data frame returns a boolean data frame with three series for each column for user ID, movie ID, and rating. We need to check if any ofthose values are true. We have no null values in the movies collection, therefore the fact that they are all false is fantastic news. We will repeat the procedure with the following data frame, ratings, and see ifany of the ratings' values are null. Fortunately, they are not, and values can be found inevery column. However, as we go through the same steps for the tags data frame, we notice that in the tags data frame, there is a true value. Instead of all three falses, we see that the tag columnhas some missing values or NAN values. This is a relatively clean dataset, but the tags column has some missing values. Let's try to drop those using the dropna function in Pandas. We will use the default option of eliminatingrecords, rows with missing values in access zero, or simply rows with NAN values in them.

tags = tags.dropna() Next, we will perform the same check on isnull on tags. We will see all three return false, indicating that we have successfully gotten rid of those null values or NAN values, missingvalues in tags. Recall that we discovered there were 465,564 rows and records when we ran the form of tags previously. Run the same form again to verify that we did actually get rid of some ofthem. tags.shape The output shows that there are now 465,548 rows, records, and it looks like we really removed 16 rows with missing data values in them. We may now explore our data using straightforward Pandas visualizations after it has been cleaned. For data analysts, data visualization is a crucial tool since it helps us comprehendthe dataset more fully. The distribution of ratings in our dataset can be visualized as a place to start. We'll plot a histogram of the ratings using Pandas' hist function. movies.hist(column='rating') The output shows a histogram of the ratings, indicating that most movies have a rating between 3 and 4. We can also create a scatter plot of the ratings and the number of reviews to see if there is a correlation between them. movies.plot(kind='scatter', x='rating', y