Lecture Note
University
University of California San DiegoCourse
DSC 207R | Python for Data SciencePages
2
Academic year
2023
anon
Views
23
Useful UNIX Commands for Data Science In the demonstration of pipes and filters, we saw how Unix commands can be linked together to enable complex data manipulations. In general, the filtered procedures give dataanalysts a rapid and effective approach to examine and transform data. We'll go into acouple more of these commands and give more examples in this article. You will be able tosort, clean, trim, and explore text data using Unix commands after finishing this reading. When working with text files or the outputs of Unix commands, the primary goal is text data manipulation and searches. Several useful commands can assist in achieving theseobjectives, such as grep, cat, word count (wc), sort, uniq, head, tail, cut, sed, and find.Although we discussed the first five commands in previous coding sessions, we have yet toreview the latter five. A text file's or input stream's first n lines are listed using the head command, and its last n lines are listed using the tail command. With the help of the potent cut command, we mayremove a portion of each line from a file. A stream editor known as SED or sed is capable ofapplying fundamental text transformations to an input stream, such as a file or input from apipeline. Our file system or hierarchy may be quickly searched using the find command. Thebest way to understand these commands is by using examples, therefore let's look at twoproblems we can address in Unix using these commands. One of the challenges asks you to compile a plot using the top 15 terms from all of Shakespeare's works. We can also concentrate on the top three user IDs running the mostprocesses on our Unix-based system. We can also change one of the files, say fruits.txt, tocapitalize every word. In our forthcoming live coding session, we'll pay special attention tothese three issues. To solve the first question, we will use a pipe and filter statement. In the first command, we redirect standard input to come from Shakespeare.txt. Since this file has multiple wordson a single line, we first focus on the sed command to convert each space between wordsinto a new line. After this command, we should have a standard output stream of one wordper line and some blank lines that existed before running this command. We then sort thisoutput and remove the remaining blank lines. The order of these last commands for sortingand removing the remaining blank lines does not matter since they prepare the outputstream for the upcoming uniq command. Next, we need to find the number of unique wordsin the file. The minus c option for the uniq command gives us the counts together with theindividual words the counts belong to. After obtaining the count for each word, we need toperform a numerical sort using the minus nr option for the sort command. The output of thesort command is then written to a file called count versus words using the head command,which retrieves the top 15 lines in the output. Without having to create a big application, thiswordy but useful one-line command provides quick data exploration on the shell. Please note that in a Unix-based system, other than Mac OS, the sed regular expression may appear simpler than the one used in this example. It merely replaces every backslash scharacter for space with a backslash n character for a new line.
Next, we will change the text in fruit.txt to all capital letters using the tr command. After that, we'll visualize the results to finish our exploratory investigation. Gnuplot is the name ofa straightforward plotting tool under Unix. Although in this course we will use moresophisticated programs for visualization such as matplotlib in Python, we would want to giveone gnuplot example to highlight what you can achieve using the shell.
Useful UNIX Commands for Data Science
Please or to post comments