Lecture Note
University
University of California San DiegoCourse
DSC 207R | Python for Data SciencePages
2
Academic year
2023
anon
Views
17
Live Code, Data Ingestion How to Load and Analyze a MovieLens Data Set Using Pandas Are you looking for ways to load and analyze a MovieLens data set using Pandas? Well, youhave come to the right place! In this article, we will show you how to load a data set, whichincludes movies, ratings, tags, and genome tags into a Pandas DataFrame. You will alsolearn how to view the contents of the data files and check the data formatting. Let's begin by checking the contents of the MovieLens directory. If you are following along,please make sure that you have downloaded the data set and located it in the correctdirectory. To view the contents of the directory, we will use the ls command. In this case, wewill specify the MovieLens directory using the ./movielens command. Once we have verified the contents of the directory, we can start to load the data into aPandas DataFrame. We will start by loading the movies.csv file. We will use the read_csvfunction and specify the separator for the data as a comma. This will create a newDataFrame object that we will call movies. import pandas as pd movies = pd.read_csv( './movielens/movies.csv' , sep= ',' ) Now that we have loaded the data into a DataFrame, let's take a look at the contents of thefile. We can use the head function to display the first five elements in the DataFrame. print (movies.head()) This will display the first five elements in the movies DataFrame. movieId title \ 0 1 Toy Story ( 1995 ) 1 2 Jumanji ( 1995 ) 2 3 Grumpier Old Men ( 1995 ) 3 4 Waiting to Exhale ( 1995 ) 4 5 Father of the Bride Part II ( 1995 ) genres 0 Adventure| Animation |Children|Comedy|Fantasy 1 Adventure|Children|Fantasy 2 Comedy|Romance 3 Comedy|Drama|Romance 4 Comedy
As you can see, there are three columns in the movies DataFrame: movieId, title, andgenres. The labels for the columns are found in the first row of the DataFrame. These labelswill be indexed numerically if they are not present in the data file. Next, we will load the tags.csv file into a new DataFrame object that we will call tags. tags = pd.read_csv( './movielens/tags.csv' , sep= ',' ) We can view the contents of the tags DataFrame using the head function. print (tags.head()) This will display the first five elements in the tags DataFrame. userId movieId tag timestamp 0 15 339 sandra 'boring 1138537770 1 15 1955 dentist ' 1193435061 2 15 7478 Cambodia ' 1170560997 3 15 32892 Russian ' 1170626366 4 15 34162 forgettable ' 1141391765 The tags DataFrame contains four columns: userId, movieId, tag, and timestamp. We will also load the ratings.csv file into a new DataFrame object that we will call ratings. ratings = pd.read_csv( './movielens/ratings.csv' , sep= ',
Live Code, Data Ingestion
Please or to post comments