library(recipes)
data("ames", package = "modeldata")
<- recipe(Sale_Price ~ ., data = ames) |>
zv_rec step_zv(all_predictors()) |>
prep()
66 Zero Variance Filter
66.1 Zero Variance Filter
Zero-variance predictors is a fancy way of saying that a predictor only takes 1 value. Another word for this is constant predictors. A zero variance predictor by definition contains no information as there isnβt a relationship between the outcome and the predictor. These types of predictors come in many different data sets. And are sometimes created in the course of the feature engineering process, such as when we do dummy variables on categorical predictors with known possible levels in Chapter 17.
The reason why this chapter exists is two-fold. Firstly, since these predictors have zero information in them, they are safe to remove which would lead to simpler and faster models. Secondly, many model implementations will error if zero-variance predictors are present in the data. Even some methods in this book donβt handle zero-variance predictors gracefully. Take the normalization methods in Chapter 7, some of these require division with the standard deviation, which is zero thus resulting in division by 0. Other methods like PCA in Chapter 67 can get in trouble as zero variance predictors can yield non-invertible matrices that they canβt normally handle.
The solution to this problem is very simple. For each variable in the data set, count the number of unique values. If the number is 1, then mark the variable for removal.
write as an algorithm
the zero-variance only matters on the training data set. So you could be in a situation where the testing data contained other values. This doesnβt matter as zero-variance predictors only affect the fitting of the model, which is done on the training data set.
There are a couple of variants to this problem. Some models require multiple values for predictors across groups. And we need to handle that accordingly. Another more complicated problem is working with predictors that have almost zero variance but not quite. Say a predictor has 999 instances of 10 and 1 instance of 15. According to the above definition, it doesnβt have zero variance. But it feels very close to it. These might be considered so low in information that they would be worth removing as well.
More care has to be taken as these predictors could have information in them, but they have low evidence. The way we flag near-zero variance predictors isnβt going to be as straightforward as how we did it above. We canβt just look at the number of unique values, as having 2 unique values by itself isnβt bad, as a 50/50 split of a variable is far from constant. We need to find a way to indicate that the variable takes few values. One metric could be looking at the percentage that the most common value is taken, if this is high it would be a prime candidate for near-zero variance. One could calculate the variance and pick a threshold. This would be harder to do since the calculated variance depends on scale. We could look at the ratio of the frequency of the most common value to the frequency of the second most common value. If this value is large then we have another contender for near-zero variance.
These different characteristics can be combined in different ways to suit your need for your data. you will likely need to tune the threshold values as well.
66.2 Pros and Cons
66.2.1 Pros
- Removing zero variance predictors should provide no downside
- Faster and smaller models
- Easy to explain and execute
66.2.2 Cons
- Removal of near-zero predictors requires care to avoid removing useful predictors
66.3 R Examples
We will use the step_zv()
and step_nzv()
steps which are used to remove zero variance and near-zero variance preditors respectively.
find a good data set
Below we are using the step_zv()
function to remove
We can use the tidy()
method to find out which variables were removed
|>
zv_rec tidy(1)
# A tibble: 0 Γ 2
# βΉ 2 variables: terms <chr>, id <chr>
We can remove non-zero variance predictors in the same manner using step_nzv()
<- recipe(Sale_Price ~ ., data = ames) |>
nzv_rec step_nzv(all_predictors()) |>
prep()
|>
nzv_rec tidy(1)
# A tibble: 21 Γ 2
terms id
<chr> <chr>
1 Street nzv_RUieL
2 Alley nzv_RUieL
3 Land_Contour nzv_RUieL
4 Utilities nzv_RUieL
5 Land_Slope nzv_RUieL
6 Condition_2 nzv_RUieL
7 Roof_Matl nzv_RUieL
8 Bsmt_Cond nzv_RUieL
9 BsmtFin_Type_2 nzv_RUieL
10 BsmtFin_SF_2 nzv_RUieL
# βΉ 11 more rows
66.4 Python Examples
We are using the ames
data set for examples. {sklearn} provided the VarianceThreshold()
method we can use. With this, we can set the threshold
argument to specify the threshold of when to remove. The default 0
will remove zero-variance columns.
from feazdata import ames
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import VarianceThreshold
from sklearn.compose import make_column_selector
import numpy as np
= ColumnTransformer(
ct 'onehot', VarianceThreshold(threshold=0), make_column_selector(dtype_include=np.number))],
[(="passthrough")
remainder
ct.fit(ames)
ColumnTransformer(remainder='passthrough', transformers=[('onehot', VarianceThreshold(threshold=0), <sklearn.compose._column_transformer.make_column_selector object at 0x29764fe60>)])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(remainder='passthrough', transformers=[('onehot', VarianceThreshold(threshold=0), <sklearn.compose._column_transformer.make_column_selector object at 0x29764fe60>)])
<sklearn.compose._column_transformer.make_column_selector object at 0x29764fe60>
VarianceThreshold(threshold=0)
['MS_SubClass', 'MS_Zoning', 'Street', 'Alley', 'Lot_Shape', 'Land_Contour', 'Utilities', 'Lot_Config', 'Land_Slope', 'Neighborhood', 'Condition_1', 'Condition_2', 'Bldg_Type', 'House_Style', 'Overall_Cond', 'Roof_Style', 'Roof_Matl', 'Exterior_1st', 'Exterior_2nd', 'Mas_Vnr_Type', 'Exter_Cond', 'Foundation', 'Bsmt_Cond', 'Bsmt_Exposure', 'BsmtFin_Type_1', 'BsmtFin_Type_2', 'Heating', 'Heating_QC', 'Central_Air', 'Electrical', 'Functional', 'Garage_Type', 'Garage_Finish', 'Garage_Cond', 'Paved_Drive', 'Pool_QC', 'Fence', 'Misc_Feature', 'Sale_Type', 'Sale_Condition']
passthrough
ct.transform(ames)
onehot__Lot_Frontage ... remainder__Sale_Condition
0 141 ... Normal
1 80 ... Normal
2 81 ... Normal
3 93 ... Normal
4 74 ... Normal
... ... ... ...
2925 37 ... Normal
2926 0 ... Normal
2927 62 ... Normal
2928 77 ... Normal
2929 74 ... Normal
[2930 rows x 74 columns]
but we can change that threshold to remove near-zero variance columns.
= ColumnTransformer(
ct 'onehot', VarianceThreshold(threshold=0.2), make_column_selector(dtype_include=np.number))],
[(="passthrough")
remainder
ct.fit(ames)
ColumnTransformer(remainder='passthrough', transformers=[('onehot', VarianceThreshold(threshold=0.2), <sklearn.compose._column_transformer.make_column_selector object at 0x297c17fe0>)])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(remainder='passthrough', transformers=[('onehot', VarianceThreshold(threshold=0.2), <sklearn.compose._column_transformer.make_column_selector object at 0x297c17fe0>)])
<sklearn.compose._column_transformer.make_column_selector object at 0x297c17fe0>
VarianceThreshold(threshold=0.2)
['MS_SubClass', 'MS_Zoning', 'Street', 'Alley', 'Lot_Shape', 'Land_Contour', 'Utilities', 'Lot_Config', 'Land_Slope', 'Neighborhood', 'Condition_1', 'Condition_2', 'Bldg_Type', 'House_Style', 'Overall_Cond', 'Roof_Style', 'Roof_Matl', 'Exterior_1st', 'Exterior_2nd', 'Mas_Vnr_Type', 'Exter_Cond', 'Foundation', 'Bsmt_Cond', 'Bsmt_Exposure', 'BsmtFin_Type_1', 'BsmtFin_Type_2', 'Heating', 'Heating_QC', 'Central_Air', 'Electrical', 'Functional', 'Garage_Type', 'Garage_Finish', 'Garage_Cond', 'Paved_Drive', 'Pool_QC', 'Fence', 'Misc_Feature', 'Sale_Type', 'Sale_Condition']
passthrough
ct.transform(ames)
onehot__Lot_Frontage ... remainder__Sale_Condition
0 141 ... Normal
1 80 ... Normal
2 81 ... Normal
3 93 ... Normal
4 74 ... Normal
... ... ... ...
2925 37 ... Normal
2926 0 ... Normal
2927 62 ... Normal
2928 77 ... Normal
2929 74 ... Normal
[2930 rows x 70 columns]