Tokenization in Natural Language Processing

University
Princeton University
Course
Bitcoin and Cryptocurrency Technologies
Pages

2

Academic year

16
Author

anon
Views

28

tokenize Now let's talk about Tokenizing the words and text. Tokenization is an essential part of Natural Language Processing (NLP) and is the process of breaking down a text into its individual words or tokens. In this article, we will explore theconcept of tokenization, and learn how to use the nltk word tokenizer to tokenize a text. Corner Cases in Tokenization The text is often divided into words as the first stage in NLP. The handling of all possible corner cases is exceedingly arduous, despite the procedure' apparent simplicity. Cornerinstances contain erratic punctuation, contractions, or abbreviated word forms. They canalso be hyphenated words with characters like in this instance's sample from New York. Tokenization is a critical step in text analysis, as it is used to prepare text for further analysis. In this article, we will discuss how to handle these corner cases and ensure that weare tokenizing our text effectively. Nltk Tokenization Nltk offers libraries to remedy these challenges. When we switch to a notebook, we will first use a simple white space based Tokenizer. Then, we will learn how to do it better and easierusing nltk. Let's utilize Romeo's brief text as our tokenization example. As may be seen, this sample uses some punctuation. We may get a list of these terms by using the Python string-splitterfunction in the following line. After line 12, or code cell 12, we'll use this divided Romeo text. Using Nltk Word Tokenizer To tokenize our text effectively, we need to use the nltk word tokenizer. This tokenizer is an advanced tool that can handle corner cases such as punctuations and contractionseffectively. To use the nltk word tokenizer, we need to first download an already-trained English Tokenizer. That one is called punkt, which has these punctuations already defined, and weare using the word Tokenize from this to come up with a list of words. Once we have downloaded the punkt tokenizer, we can use it to tokenize our text. If we display these lists of words, we indeed see that love exclamation mark we had above isseparated nicely. Tokenization in Movie Database

The good news is that all corpora in nltk already provide a way to generate Tokenized words for each data file. So for our movie database, we can access the words for the firstpositive files using the code block here. We have movie reviews dot words and positive file IDs, and zero is our first record. We have a list here, right. We have a list of words from that first positive record. Building a Bag-of-Words Model Now that we have our words, let's look at how we can utilize them to create a straightforward bag-of-words model for them in the following series of videos. Tokenization is a critical NLP step that is necessary for getting text ready for additional analysis. Effective tokenization of text can be difficult, particularly when dealing withspecialized circumstances like contractions and punctuation. But, we can quickly tokenizeour language and make sure we're receiving accurate results by using the nltk wordtokenizer and downloading the punkt tokenizer.