54 Term Frequency
54.1 Term Frequency
The main reason why we do these text feature engineering steps is to produce numeric variables. Once we have a selection of tokens we are happy with it is time to turn them into these numeric variables, and term frequency (TF) is the first and simplest method we will showcase to accomplish this.
In essence, it works by taking the tokens we have generated, and counting how many times they appear for each observation. If there are 500 unique tokens, then 500 columns are generated. If there are 50,000 unique tokens, then 50,000 columns are created.
add a diagram of this happening
The traditional way is to count how many times each token appears. But it is not the only way. We could also create an indicator of whether the token was there or not. This is another way of encoding x > 0
. We can use this if we think that the number of times a token appears doesnβt matter that much, just that it appears. If you have a mix of long and short text fields, you might be interested in calculating the frequency instead of the raw counts. Here we take the number of times a token appears, divided by the number of tokens in the observation. This is similar but not exactly the same as Chapter 8, as we are making it so the values across the row sum to 1, not making it so each column has values between 0 and 1. Since we are working with counts, there are also times when log normalization is done, which means that we take the columns produced here, and take the logarithm of them as per Chapter 2, Typically with an offset because this method tends to produce a lot of zeroes. Lastly, we also see people perform double normalization, which is done by taking the raw frequency divided by the raw frequency of the most occurring token in the observation. This is then multiplied by some offset and the offset is added to the result. This is again done to prevent a bias towards longer documents. The offset can be anything, with 0.5 being an option some prefer as it doesnβt have more influence than than a single count.
talk about sparsity and token numbers
these columns are diagonal to each other
not going to be a good representation of data
talk about unseen tokens