[1] "dog" "cat" "horse" "dog" "cat"
20 Binary Encoding
20.1 Binary Encoding
Binary encoding encodes each category by encoding it as its binary representation. From the categorical variables, you assign an integer value to each level, in the same way as in Chapter 18. That value will then be converted to its binary representation, and that value will be returned.
Suppose we have the following variable, and the values they take are (cat = 11, dog = 3, horse = 20). We are using a subset to gain a better understanding of what is happening.
The first thing we need to do is to calculate the binary representation of these numbers. And we should do it to 5 digits since it is the highest we need in this hypothetical example. 11 = 01011, 3 = 00011, 20 = 10100. We can then encode this in the following matrix
16 8 4 2 1
[1,] 0 0 0 1 1
[2,] 0 1 0 1 1
[3,] 1 0 1 0 0
[4,] 0 0 0 1 1
[5,] 0 1 0 1 1
Each we would be able to uniquely encode 2^5=32
different values with just 5 columns compared to the 32 it would take if you used dummy encoding from Chapter 17. In general, you will be able to encode n
variables in ceiling(log2(n))
columns.
This style of encoding is generalized to other bases. Binary encoding is a base-2 encoder. You could just as well have a base 3, or base 10 encoding. We will not cover these methods further than this mentioned as they are similar in function to binary encoding.
This method isnβt widely used. It does a good job of showing the midpoint between dummy encoding and label encoding in terms of how sparse we want to store our data. Its limitations come in terms of how interpretable the final model ends up being. Further, if you want to encode your data more compactly than dummy encoding, you will find better luck using some of the later described methods.
link to actual methods
talk about grey encoding
20.2 Pros and Cons
20.2.1 Pros
- uses fewer variables to store the same information as dummy encoding
20.2.2 Cons
- Less interpretability compared to dummy variables
20.3 R Examples
We will be using the ames
data set for these examples. The step_encoding_binary()
function from the extrasteps package allows us to perform binary encoding.
library(recipes)
library(extrasteps)
library(modeldata)
data("ames")
|>
ames select(Sale_Price, MS_SubClass, MS_Zoning)
# A tibble: 2,930 Γ 3
Sale_Price MS_SubClass MS_Zoning
<int> <fct> <fct>
1 215000 One_Story_1946_and_Newer_All_Styles Residential_Low_Density
2 105000 One_Story_1946_and_Newer_All_Styles Residential_High_Density
3 172000 One_Story_1946_and_Newer_All_Styles Residential_Low_Density
4 244000 One_Story_1946_and_Newer_All_Styles Residential_Low_Density
5 189900 Two_Story_1946_and_Newer Residential_Low_Density
6 195500 Two_Story_1946_and_Newer Residential_Low_Density
7 213500 One_Story_PUD_1946_and_Newer Residential_Low_Density
8 191500 One_Story_PUD_1946_and_Newer Residential_Low_Density
9 236500 One_Story_PUD_1946_and_Newer Residential_Low_Density
10 189000 Two_Story_1946_and_Newer Residential_Low_Density
# βΉ 2,920 more rows
We can take a quick look at the possible values MS_SubClass
takes
|>
ames count(MS_SubClass, sort = TRUE)
# A tibble: 16 Γ 2
MS_SubClass n
<fct> <int>
1 One_Story_1946_and_Newer_All_Styles 1079
2 Two_Story_1946_and_Newer 575
3 One_and_Half_Story_Finished_All_Ages 287
4 One_Story_PUD_1946_and_Newer 192
5 One_Story_1945_and_Older 139
6 Two_Story_PUD_1946_and_Newer 129
7 Two_Story_1945_and_Older 128
8 Split_or_Multilevel 118
9 Duplex_All_Styles_and_Ages 109
10 Two_Family_conversion_All_Styles_and_Ages 61
11 Split_Foyer 48
12 Two_and_Half_Story_All_Ages 23
13 One_and_Half_Story_Unfinished_All_Ages 18
14 PUD_Multilevel_Split_Level_Foyer 17
15 One_Story_with_Finished_Attic_All_Ages 6
16 One_and_Half_Story_PUD_All_Ages 1
We can then apply binary encoding using step_encoding_binary()
. Notice how we only get 1 numeric variable for each categorical variable
<- recipe(Sale_Price ~ ., data = ames) |>
dummy_rec step_encoding_binary(all_nominal_predictors()) |>
prep()
|>
dummy_rec bake(new_data = NULL, starts_with("MS_SubClass"), starts_with("MS_Zoning")) |>
glimpse()
Rows: 2,930
Columns: 9
$ MS_SubClass_1 <int> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1β¦
$ MS_SubClass_2 <int> 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0β¦
$ MS_SubClass_4 <int> 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0β¦
$ MS_SubClass_8 <int> 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0β¦
$ MS_SubClass_16 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0β¦
$ MS_Zoning_1 <int> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1β¦
$ MS_Zoning_2 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1β¦
$ MS_Zoning_4 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0β¦
$ MS_Zoning_8 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0β¦
We can pull the number of distinct levels of each variable by using tidy()
.
|>
dummy_rec tidy(1)
# A tibble: 40 Γ 3
terms value id
<chr> <int> <chr>
1 MS_SubClass 16 encoding_binary_Bp5vK
2 MS_Zoning 7 encoding_binary_Bp5vK
3 Street 2 encoding_binary_Bp5vK
4 Alley 3 encoding_binary_Bp5vK
5 Lot_Shape 4 encoding_binary_Bp5vK
6 Land_Contour 4 encoding_binary_Bp5vK
7 Utilities 3 encoding_binary_Bp5vK
8 Lot_Config 5 encoding_binary_Bp5vK
9 Land_Slope 3 encoding_binary_Bp5vK
10 Neighborhood 29 encoding_binary_Bp5vK
# βΉ 30 more rows
20.4 Python Examples
We are using the ames
data set for examples. {category_encoders} provided the BinaryEncoder()
method we can use.
from feazdata import ames
from sklearn.compose import ColumnTransformer
from category_encoders.binary import BinaryEncoder
= ColumnTransformer(
ct 'binary', BinaryEncoder(), ['MS_Zoning'])],
[(="passthrough")
remainder
ct.fit(ames)
ColumnTransformer(remainder='passthrough', transformers=[('binary', BinaryEncoder(), ['MS_Zoning'])])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(remainder='passthrough', transformers=[('binary', BinaryEncoder(), ['MS_Zoning'])])
['MS_Zoning']
BinaryEncoder()
['MS_SubClass', 'Lot_Frontage', 'Lot_Area', 'Street', 'Alley', 'Lot_Shape', 'Land_Contour', 'Utilities', 'Lot_Config', 'Land_Slope', 'Neighborhood', 'Condition_1', 'Condition_2', 'Bldg_Type', 'House_Style', 'Overall_Cond', 'Year_Built', 'Year_Remod_Add', 'Roof_Style', 'Roof_Matl', 'Exterior_1st', 'Exterior_2nd', 'Mas_Vnr_Type', 'Mas_Vnr_Area', 'Exter_Cond', 'Foundation', 'Bsmt_Cond', 'Bsmt_Exposure', 'BsmtFin_Type_1', 'BsmtFin_SF_1', 'BsmtFin_Type_2', 'BsmtFin_SF_2', 'Bsmt_Unf_SF', 'Total_Bsmt_SF', 'Heating', 'Heating_QC', 'Central_Air', 'Electrical', 'First_Flr_SF', 'Second_Flr_SF', 'Gr_Liv_Area', 'Bsmt_Full_Bath', 'Bsmt_Half_Bath', 'Full_Bath', 'Half_Bath', 'Bedroom_AbvGr', 'Kitchen_AbvGr', 'TotRms_AbvGrd', 'Functional', 'Fireplaces', 'Garage_Type', 'Garage_Finish', 'Garage_Cars', 'Garage_Area', 'Garage_Cond', 'Paved_Drive', 'Wood_Deck_SF', 'Open_Porch_SF', 'Enclosed_Porch', 'Three_season_porch', 'Screen_Porch', 'Pool_Area', 'Pool_QC', 'Fence', 'Misc_Feature', 'Misc_Val', 'Mo_Sold', 'Year_Sold', 'Sale_Type', 'Sale_Condition', 'Sale_Price', 'Longitude', 'Latitude']
passthrough
filter(regex="binary.*") ct.transform(ames).
binary__MS_Zoning_0 binary__MS_Zoning_1 binary__MS_Zoning_2
0 0 0 1
1 0 1 0
2 0 0 1
3 0 0 1
4 0 0 1
... ... ... ...
2925 0 0 1
2926 0 0 1
2927 0 0 1
2928 0 0 1
2929 0 0 1
[2930 rows x 3 columns]