[1] "a" "a's" "able" "about" "above"
[6] "according" "accordingly" "across" "actually" "after"
[11] "afterwards" "again" "against" "ain't" "all"
[16] "allow" "allows" "almost" "alone" "along"
[21] "already" "also" "although" "always" "am"
52 Stop words
52.1 Stop words
The term stop words is a very loaded term, that is often under-explained and therefore misunderstood by practitioners. The original definition is a set of words that contains no information and are safe to remove. This is too simplistic of a view. Instead, we can think of words as having an amount of signal in them related to the task we are trying to accomplish. Removing low-signal words can be fruitful, if we are able to identify them. The term was first coined in 1960 (Luhn 1960).
find good diagrams
Under this definition, we need to find a way to quantify how important each word is. One can think about the scopes as global, domain, and document. Global stop words are words that are almost almost low information words. These are the words like "and"
and "or"
. They are highly unlikely to provide a signal to the modeling task at hand, especially when using a counting method. This list is inherently going to be quite small. The domain-level stop words are more interesting but can be harder to precisely define. Suppose you are looking at hotel listings. The words "bed"
and "room"
are likely stop words as they donβt provide much value to any talk that you would imagine. Hotel listings are all but guaranteed to be talking about the style of bed, and the room they are offering. The word "beach"
would generally be a word of interest when talking about hotel listings as a whole, but could be a stop word when we are scoped to beach hotel listings. There is a lot of grey area and domain knowledge happening in this type of stop words. Lastly, we have document-level stop words. These are words in a document that donβt contain information, and they are really hard to figure out effectively because it is determined on a document-by-document basis. These stop words technically exist but will likely not be picked up on the methods we will be using.
The first application of stop words that people encounter is premade stop word lists. The list of stop word lists is very long and we wonβt bother looking over all of them. But some of the well-known and popular ones are: SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, an information retrieval system developed at Cornell University in the 1960s (Lewis et al. 2004) and the English Snowball stop word list (Porter 2001).
The SMART stop word list contains 571 words, and the Snowball list contains length(stopwords::data_stopwords_smart$en)
. The first 25 stop words alphabetically from the SMART list are shown here
with the first 25 words from the Snowball list shown here.
[1] "a" "about" "above" "after" "again" "against" "all"
[8] "am" "an" "and" "any" "are" "aren't" "as"
[15] "at" "be" "because" "been" "before" "being" "below"
[22] "between" "both" "but" "by"
We notice that the SMART list contains a lot more words, some of which may feel on the verge of importance in certain cases. And there may be a reason for that. The Snowball list is meticulously constructed by looking at words by their classes as seen here. On the other hand, is it not known how the SMART list was constructed. But there are hints, we can look at the words that are included as well as those which arenβt included. By digging around for a little bit, we notice that the word "he's"
is in the list but "she's"
isnβt. One explanation for this is that this list is partly frequency-based. And it wouldnβt be surprising to see, especially with some of the other words in the list like "wish"
and "thanx"
. But there is some evidence of this list being manually curated as well, as all the letters of the alphabet are in there, which is quite unlikely to all have been frequent enough in the original corpus.
All of this is to say that it is important to thoroughly investigate the stop word list you are using, as they can have unexpected results. Nothman, Qin, and Yurchak (2018) explores a selection of 52 stop word lists with alarming results. Among some of the more grave issues were misspellings (βfifyβ
instead of βfiftyβ
), the inclusion of clearly informative words such as βcomputerβ
and βcryβ
and internal inconsistencies, such as including the word βhasβ
but not the word βdoesβ
. Some of these mistakes have crept into popular libraries without notice for several years.
We can create homemade lists in a few different ways. One of which is to take the training data set, count the tokens and sort the results. Then we can go through them and manually select which ones to put on the list. You donβt have to pick the top 50, you can leave some in and some out, as some frequent words could be informative. We will talk about TF-IDF in Chapter 55, but that can also be helpful in finding noninformative words. Calculating the IDF of all the words and sorting them lets us find tokens with low IDF that are good candidates for removal.
The final advice is to combine the two approaches you see above. Pick a premade list, and make modifications to it, based on your domain expertise and the data set you are working with.
52.2 Pros and Cons
52.2.1 Pros
- is a way to cut down on computation times and improve performance
52.2.2 Cons
- can be quite time-consuming to get right
- using off the shelf and default options are often not ideal