21  Frequency Encoding

Frequency encoding takes a categorical variable and replaces each level with its frequency in the training data set. This results in a single numeric variable, with values between 0 and 1. This is a trained method since we need to keep a record of the frequencies from the training data set.

This method isn’t a silver bullet, as it will only sometimes be useful. It is useful when the frequency/rarity of a category level is related to our outcome. Imagine we have data about wines and their producers, some big producers produce many wines, and small producers only produce a couple. This information could potentially be useful and would be easily captured in frequency encoding. This method is not able to distinguish between two levels that have the same frequency.

Unseen levels can be automatically handled by giving them a value of 0 as they are unseen in the training data set. Thus no extra treatment is necessary. Sometimes taking the logarithm can be useful if you are having a big difference between the number of occurrences in your levels.

Note

This is similar to count encoding in the sense that both these encodings calculate the same quantity, the difference is just what you put in the denominator. Since we divide by a constant value in frequency encoding, these will be treated as identical methods.

21.2 Pros and Cons

21.2.1 Pros

  • Powerful and simple when used correctly
  • High interpretability

21.2.2 Cons

  • Is not able to distinguish between two levels that have the same frequency
  • May not add predictive power

21.3 R Examples

We will be using the ames data set for these examples. The step_encoding_frequency() function from the extrasteps package allows us to perform frequency encoding.

library(recipes)
library(extrasteps)
library(modeldata)
data("ames")

ames |>
  select(Sale_Price, MS_SubClass, MS_Zoning)
# A tibble: 2,930 Γ— 3
   Sale_Price MS_SubClass                         MS_Zoning               
        <int> <fct>                               <fct>                   
 1     215000 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 
 2     105000 One_Story_1946_and_Newer_All_Styles Residential_High_Density
 3     172000 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 
 4     244000 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 
 5     189900 Two_Story_1946_and_Newer            Residential_Low_Density 
 6     195500 Two_Story_1946_and_Newer            Residential_Low_Density 
 7     213500 One_Story_PUD_1946_and_Newer        Residential_Low_Density 
 8     191500 One_Story_PUD_1946_and_Newer        Residential_Low_Density 
 9     236500 One_Story_PUD_1946_and_Newer        Residential_Low_Density 
10     189000 Two_Story_1946_and_Newer            Residential_Low_Density 
# β„Ή 2,920 more rows

We can take a quick look at the possible values MS_SubClass takes

ames |>
  count(MS_SubClass, sort = TRUE)
# A tibble: 16 Γ— 2
   MS_SubClass                                   n
   <fct>                                     <int>
 1 One_Story_1946_and_Newer_All_Styles        1079
 2 Two_Story_1946_and_Newer                    575
 3 One_and_Half_Story_Finished_All_Ages        287
 4 One_Story_PUD_1946_and_Newer                192
 5 One_Story_1945_and_Older                    139
 6 Two_Story_PUD_1946_and_Newer                129
 7 Two_Story_1945_and_Older                    128
 8 Split_or_Multilevel                         118
 9 Duplex_All_Styles_and_Ages                  109
10 Two_Family_conversion_All_Styles_and_Ages    61
11 Split_Foyer                                  48
12 Two_and_Half_Story_All_Ages                  23
13 One_and_Half_Story_Unfinished_All_Ages       18
14 PUD_Multilevel_Split_Level_Foyer             17
15 One_Story_with_Finished_Attic_All_Ages        6
16 One_and_Half_Story_PUD_All_Ages               1

We can then apply frequency encoding using step_encoding_frequency(). Notice how we only get 1 numeric variable for each categorical variable

dummy_rec <- recipe(Sale_Price ~ ., data = ames) |>
  step_encoding_frequency(all_nominal_predictors()) |>
  prep()

dummy_rec |>
  bake(new_data = NULL, starts_with("MS_SubClass"), starts_with("MS_Zoning")) |>
  glimpse()
Rows: 2,930
Columns: 2
$ MS_SubClass <dbl> 0.36825939, 0.36825939, 0.36825939, 0.36825939, 0.19624573…
$ MS_Zoning   <dbl> 0.775767918, 0.009215017, 0.775767918, 0.775767918, 0.7757…

We can pull the frequencies for each level of each variable by using tidy().

dummy_rec |>
  tidy(1)
# A tibble: 283 Γ— 4
   terms       level                                  frequency id              
   <chr>       <chr>                                      <dbl> <chr>           
 1 MS_SubClass One_Story_1946_and_Newer_All_Styles      0.368   encoding_freque…
 2 MS_SubClass One_Story_1945_and_Older                 0.0474  encoding_freque…
 3 MS_SubClass One_Story_with_Finished_Attic_All_Ages   0.00205 encoding_freque…
 4 MS_SubClass One_and_Half_Story_Unfinished_All_Ages   0.00614 encoding_freque…
 5 MS_SubClass One_and_Half_Story_Finished_All_Ages     0.0980  encoding_freque…
 6 MS_SubClass Two_Story_1946_and_Newer                 0.196   encoding_freque…
 7 MS_SubClass Two_Story_1945_and_Older                 0.0437  encoding_freque…
 8 MS_SubClass Two_and_Half_Story_All_Ages              0.00785 encoding_freque…
 9 MS_SubClass Split_or_Multilevel                      0.0403  encoding_freque…
10 MS_SubClass Split_Foyer                              0.0164  encoding_freque…
# β„Ή 273 more rows

21.4 Python Examples

We are using the ames data set for examples. {category_encoders} provided the CountEncoder() method we can use. This performs count encoding, which we know is functionally equivalent to frequency encoding.

from feazdata import ames
from sklearn.compose import ColumnTransformer
from category_encoders.count import CountEncoder

ct = ColumnTransformer(
    [('count', CountEncoder(), ['MS_Zoning'])], 
    remainder="passthrough")

ct.fit(ames)
ColumnTransformer(remainder='passthrough',
                  transformers=[('count',
                                 CountEncoder(combine_min_nan_groups=True),
                                 ['MS_Zoning'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ct.transform(ames).filter(regex="count.*")
      count__MS_Zoning
0                 2273
1                   27
2                 2273
3                 2273
4                 2273
...                ...
2925              2273
2926              2273
2927              2273
2928              2273
2929              2273

[2930 rows x 1 columns]