Lecture Note
University
University of California San DiegoCourse
DSC 207R | Python for Data SciencePages
2
Academic year
2023
anon
Views
17
Live Code, Useful UNIX Commands for DataScience We can begin utilizing Python 4 for data science now that we are back on the shell. We can find out where we are in our initial working directory by using the PWD command. Let's usethe LS command to review this directory's contents. As you may recall, in our previoussession, we looked at the shakespeare.txt and fruits.txt files. Using pipelines, filters, andfundamental commands like head and tail, let's continue our exploration. First, let's use the head command to view the first five lines of the fruits.txt file. We can do this by typing "head -5 fruits.txt". As we remember, the fruits.txt file contains a list of fruitnames. Next, let's view the first five lines of all the fruit files by using the asterisk wildcardoperator. We can do this by typing "head -5 fruits*". We'll see all the fruit files displayed, andthe apples are nicely sorted. Let's use the unique command to make sure that the fruit files don't have any duplicate names. In order to do this, type "uniq fruits*". The list of all the unusual fruit names is nowcomplete. We may pipe the output to the cat command to inspect the contents of these files.As an illustration, the command "head -5 fruits* | cat" will show the first five lines of each fruitfile. Let's use the tail command to display the final three lines of each fruit file after clearing our screen. By entering "tail -3 fruits* | cat," we can accomplish this. The final three lines fromeach file will be presented side by side for comparison. Use a composite command with thehead and tail commands to examine the first and final two lines of a specified file, such asfruits-sorted.txt. Entering "(head -2 fruits-sorted.txt ; tail -2 fruits-sorted.txt)" will accomplishthis. Which of these top 15 words are most frequently employed in Shakespeare's writings, shall we turn to our first question? By entering "cat shakespeare.txt," we may see what is in theshakespeare.txt file. However, since there are 124,000 lines in this lengthy file, using theword count command "wc shakespeare.txt" to estimate its size can be beneficial. The morecommand can also be used to view the file page by page. For instance, typing "catshakespeare.txt | more" will enable us to browse the file using the Q Let's utilize the sed regular expression to convert each word into a line. In this input, each line contains multiple words, separated by spaces. So, we need to find a sed expression thatwill replace spaces with new lines. We can begin with an S slash, followed by a slash G, todo this globally throughout the file. The sed expression for Mac is somewhat complex, sowe'll switch back to avoid making mistakes. Now, let's examine the sed expression for Mac. I'm using sed minus E, starting with S slash, ending with slash G, and with two slash characters in between. The first part of the
slash is just a space since we need to replace the spaces. The characters between thesecond and third slash are to account for some special characters in Mac and Unix shell. Inanother Unix shell, we could use sed backslash N, which would work just as well. Let's use sed to replace those characters by feeding it the text of shakespeare.txt. We observe that each word in our input file is on a different line, as one might anticipate. We can pipe the words into a sort to sort them now that they are all on distinct lines. We're going through each word twice, so depending on your machine, it might take a little bit. Aftersorting the words, we see all the words properly sorted. All of the blank lines will be visible ifwe go back to the beginning. So we need to erase all those blank lines now. We are adding another S-E-D command to this pipe in order to remove the blank lines. Our output stream from the pipe has now been prepared for the special command. After gettingrid of the blanks and sorting it, we execute the unique command with the minus C option tosee the output, which shows all words and their count. We want to order the numbers numerically because we have already sorted the words. If we say sort N-R, the same output will be sorted so that the count is listed for each word in asorted order. To observe the output stream more clearly, we will next pipe it into "more". Wemay see the final ones in increasing order after sorting. Our question was to take the top 15 counts, so instead of more, we will use head 15. Once we get this output, we will redirect it to a file that we will call count versus where it is. Later,we'll use that file to redirect those top 15 counts into a plot using gnuplot. The output of the entire pipe is what we have left. The top 15 counts are visible, from "the," "I," and "and," down to "with." We may pipe that into "count versus words" and sendeverything to that file. Learning Python, which enables scalable execution and libraries, isnecessary because executing this in a Unix shell alone will be slow as the data sizeincreases.
Live Code, Useful UNIX Commands for Data Science
Please or to post comments