20  Binary Encoding

Binary encoding encodes each category by encoding it as its binary representation. From the categorical variables, you assign an integer value to each level, in the same way as in Chapter 18. That value will then be converted to its binary representation, and that value will be returned.

Suppose we have the following variable, and the values they take are (cat = 11, dog = 3, horse = 20). We are using a subset to gain a better understanding of what is happening.

[1] "dog"   "cat"   "horse" "dog"   "cat"  

The first thing we need to do is to calculate the binary representation of these numbers. And we should do it to 5 digits since it is the highest we need in this hypothetical example. 11 = 01011, 3 = 00011, 20 = 10100. We can then encode this in the following matrix

     16 8 4 2 1
[1,]  0 0 0 1 1
[2,]  0 1 0 1 1
[3,]  1 0 1 0 0
[4,]  0 0 0 1 1
[5,]  0 1 0 1 1

Each we would be able to uniquely encode 2^5=32 different values with just 5 columns compared to the 32 it would take if you used dummy encoding from Chapter 17. In general, you will be able to encode n variables in ceiling(log2(n)) columns.

Note

This style of encoding is generalized to other bases. Binary encoding is a base-2 encoder. You could just as well have a base 3, or base 10 encoding. We will not cover these methods further than this mentioned as they are similar in function to binary encoding.

This method isn’t widely used. It does a good job of showing the midpoint between dummy encoding and label encoding in terms of how sparse we want to store our data. Its limitations come in terms of how interpretable the final model ends up being. Further, if you want to encode your data more compactly than dummy encoding, you will find better luck using some of the later described methods.

TODO

link to actual methods

TODO

talk about grey encoding

20.2 Pros and Cons

20.2.1 Pros

  • uses fewer variables to store the same information as dummy encoding

20.2.2 Cons

  • Less interpretability compared to dummy variables

20.3 R Examples

We will be using the ames data set for these examples. The step_encoding_binary() function from the extrasteps package allows us to perform binary encoding.

library(recipes)
library(extrasteps)
library(modeldata)
data("ames")

ames |>
  select(Sale_Price, MS_SubClass, MS_Zoning)
# A tibble: 2,930 Γ— 3
   Sale_Price MS_SubClass                         MS_Zoning               
        <int> <fct>                               <fct>                   
 1     215000 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 
 2     105000 One_Story_1946_and_Newer_All_Styles Residential_High_Density
 3     172000 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 
 4     244000 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 
 5     189900 Two_Story_1946_and_Newer            Residential_Low_Density 
 6     195500 Two_Story_1946_and_Newer            Residential_Low_Density 
 7     213500 One_Story_PUD_1946_and_Newer        Residential_Low_Density 
 8     191500 One_Story_PUD_1946_and_Newer        Residential_Low_Density 
 9     236500 One_Story_PUD_1946_and_Newer        Residential_Low_Density 
10     189000 Two_Story_1946_and_Newer            Residential_Low_Density 
# β„Ή 2,920 more rows

We can take a quick look at the possible values MS_SubClass takes

ames |>
  count(MS_SubClass, sort = TRUE)
# A tibble: 16 Γ— 2
   MS_SubClass                                   n
   <fct>                                     <int>
 1 One_Story_1946_and_Newer_All_Styles        1079
 2 Two_Story_1946_and_Newer                    575
 3 One_and_Half_Story_Finished_All_Ages        287
 4 One_Story_PUD_1946_and_Newer                192
 5 One_Story_1945_and_Older                    139
 6 Two_Story_PUD_1946_and_Newer                129
 7 Two_Story_1945_and_Older                    128
 8 Split_or_Multilevel                         118
 9 Duplex_All_Styles_and_Ages                  109
10 Two_Family_conversion_All_Styles_and_Ages    61
11 Split_Foyer                                  48
12 Two_and_Half_Story_All_Ages                  23
13 One_and_Half_Story_Unfinished_All_Ages       18
14 PUD_Multilevel_Split_Level_Foyer             17
15 One_Story_with_Finished_Attic_All_Ages        6
16 One_and_Half_Story_PUD_All_Ages               1

We can then apply binary encoding using step_encoding_binary(). Notice how we only get 1 numeric variable for each categorical variable

dummy_rec <- recipe(Sale_Price ~ ., data = ames) |>
  step_encoding_binary(all_nominal_predictors()) |>
  prep()

dummy_rec |>
  bake(new_data = NULL, starts_with("MS_SubClass"), starts_with("MS_Zoning")) |>
  glimpse()
Rows: 2,930
Columns: 9
$ MS_SubClass_1  <int> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1…
$ MS_SubClass_2  <int> 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0…
$ MS_SubClass_4  <int> 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0…
$ MS_SubClass_8  <int> 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
$ MS_SubClass_16 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ MS_Zoning_1    <int> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ MS_Zoning_2    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ MS_Zoning_4    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ MS_Zoning_8    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

We can pull the number of distinct levels of each variable by using tidy().

dummy_rec |>
  tidy(1)
# A tibble: 40 Γ— 3
   terms        value id                   
   <chr>        <int> <chr>                
 1 MS_SubClass     16 encoding_binary_Bp5vK
 2 MS_Zoning        7 encoding_binary_Bp5vK
 3 Street           2 encoding_binary_Bp5vK
 4 Alley            3 encoding_binary_Bp5vK
 5 Lot_Shape        4 encoding_binary_Bp5vK
 6 Land_Contour     4 encoding_binary_Bp5vK
 7 Utilities        3 encoding_binary_Bp5vK
 8 Lot_Config       5 encoding_binary_Bp5vK
 9 Land_Slope       3 encoding_binary_Bp5vK
10 Neighborhood    29 encoding_binary_Bp5vK
# β„Ή 30 more rows

20.4 Python Examples

We are using the ames data set for examples. {category_encoders} provided the BinaryEncoder() method we can use.

from feazdata import ames
from sklearn.compose import ColumnTransformer
from category_encoders.binary import BinaryEncoder

ct = ColumnTransformer(
    [('binary', BinaryEncoder(), ['MS_Zoning'])], 
    remainder="passthrough")

ct.fit(ames)
ColumnTransformer(remainder='passthrough',
                  transformers=[('binary', BinaryEncoder(), ['MS_Zoning'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ct.transform(ames).filter(regex="binary.*")
      binary__MS_Zoning_0  binary__MS_Zoning_1  binary__MS_Zoning_2
0                       0                    0                    1
1                       0                    1                    0
2                       0                    0                    1
3                       0                    0                    1
4                       0                    0                    1
...                   ...                  ...                  ...
2925                    0                    0                    1
2926                    0                    0                    1
2927                    0                    0                    1
2928                    0                    0                    1
2929                    0                    0                    1

[2930 rows x 3 columns]