# A tibble: 5 Γ 3
a b c
<dbl> <dbl> <dbl>
1 1 6 3
2 4 NA 3
3 0 NA NA
4 NA 3 5
5 5 NA 3
44 Missing Values Indicators
44.1 Missing Values Indicators
While imputation can be useful, as we saw in Chapter 42 and Chapter 43. That by itself isnβt always enough to extract all the information. As was described in Chapter 41, missing values can come in different variants, and depending on the variant, imputation might not give enough information. Suppose you are working with non-MCAR data (non Missing Completely At Random). Then we have some mechanism that determines when missing values occur. This mechanism might be known or unknown. From a predictive standpoint whether or not it is known doesnβt matter as much, what matters is whether the mechanism is related to the outcome or not.
This is where missing value indicators come in. Used in combination with imputation, missing value indicators will try to capture that signal. For each chosen variable, create another Boolean variable that is 1 when a missing value is seen, and 0 otherwise.
The following sample data set
Will look like the data set below, once missing value indicators have been added.
# A tibble: 5 Γ 6
a b c a_na b_na c_na
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 6 3 0 0 0
2 4 NA 3 0 1 0
3 0 NA NA 0 1 1
4 NA 3 5 1 0 0
5 5 NA 3 0 1 0
From here on, you are potentially adding information, otherwise we are adding a lot of noise. The noise here can be filtered by other methods seen in this book. If variables with no missing data were used, then we create zero variance predictors, which we can deal with as seen in Chapter 66.
44.2 Pros and Cons
44.2.1 Pros
- No performance harm when added to variables with no missing data
- Simple and interpretable
44.2.2 Cons
- Will produce zero variance columns when used on data with no missing values
- Can create a sizable increase in data set size
44.3 R Examples
find a better data set
From the recipes package, can we use the step_indicate_na()
function to create indicator variables based on missing data
library(recipes)
<- recipe(mpg ~ disp + vs + am, data = mtcars) |>
na_ind_rec step_indicate_na(all_predictors()) |>
prep()
|>
na_ind_rec bake(new_data = mtcars)
# A tibble: 32 Γ 7
disp vs am mpg na_ind_disp na_ind_vs na_ind_am
<dbl> <dbl> <dbl> <dbl> <int> <int> <int>
1 160 0 1 21 0 0 0
2 160 0 1 21 0 0 0
3 108 1 1 22.8 0 0 0
4 258 1 0 21.4 0 0 0
5 360 0 0 18.7 0 0 0
6 225 1 0 18.1 0 0 0
7 360 0 0 14.3 0 0 0
8 147. 1 0 24.4 0 0 0
9 141. 1 0 22.8 0 0 0
10 168. 1 0 19.2 0 0 0
# βΉ 22 more rows
44.4 Python Examples
We are using the ames
data set for examples. {sklearn} provided the MissingIndicator()
method we can use.
from feazdata import ames
from sklearn.compose import ColumnTransformer
from sklearn.impute import MissingIndicator
= ColumnTransformer(
ct 'na_indicator', MissingIndicator(), ['Sale_Price', 'Lot_Area', 'Wood_Deck_SF', 'Mas_Vnr_Area'])],
[(="passthrough")
remainder
ct.fit(ames)
ColumnTransformer(remainder='passthrough', transformers=[('na_indicator', MissingIndicator(), ['Sale_Price', 'Lot_Area', 'Wood_Deck_SF', 'Mas_Vnr_Area'])])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(remainder='passthrough', transformers=[('na_indicator', MissingIndicator(), ['Sale_Price', 'Lot_Area', 'Wood_Deck_SF', 'Mas_Vnr_Area'])])
['Sale_Price', 'Lot_Area', 'Wood_Deck_SF', 'Mas_Vnr_Area']
MissingIndicator()
['MS_SubClass', 'MS_Zoning', 'Lot_Frontage', 'Street', 'Alley', 'Lot_Shape', 'Land_Contour', 'Utilities', 'Lot_Config', 'Land_Slope', 'Neighborhood', 'Condition_1', 'Condition_2', 'Bldg_Type', 'House_Style', 'Overall_Cond', 'Year_Built', 'Year_Remod_Add', 'Roof_Style', 'Roof_Matl', 'Exterior_1st', 'Exterior_2nd', 'Mas_Vnr_Type', 'Exter_Cond', 'Foundation', 'Bsmt_Cond', 'Bsmt_Exposure', 'BsmtFin_Type_1', 'BsmtFin_SF_1', 'BsmtFin_Type_2', 'BsmtFin_SF_2', 'Bsmt_Unf_SF', 'Total_Bsmt_SF', 'Heating', 'Heating_QC', 'Central_Air', 'Electrical', 'First_Flr_SF', 'Second_Flr_SF', 'Gr_Liv_Area', 'Bsmt_Full_Bath', 'Bsmt_Half_Bath', 'Full_Bath', 'Half_Bath', 'Bedroom_AbvGr', 'Kitchen_AbvGr', 'TotRms_AbvGrd', 'Functional', 'Fireplaces', 'Garage_Type', 'Garage_Finish', 'Garage_Cars', 'Garage_Area', 'Garage_Cond', 'Paved_Drive', 'Open_Porch_SF', 'Enclosed_Porch', 'Three_season_porch', 'Screen_Porch', 'Pool_Area', 'Pool_QC', 'Fence', 'Misc_Feature', 'Misc_Val', 'Mo_Sold', 'Year_Sold', 'Sale_Type', 'Sale_Condition', 'Longitude', 'Latitude']
passthrough
ct.transform(ames)
remainder__MS_SubClass ... remainder__Latitude
0 One_Story_1946_and_Newer_All_Styles ... 42.054
1 One_Story_1946_and_Newer_All_Styles ... 42.053
2 One_Story_1946_and_Newer_All_Styles ... 42.053
3 One_Story_1946_and_Newer_All_Styles ... 42.051
4 Two_Story_1946_and_Newer ... 42.061
... ... ... ...
2925 Split_or_Multilevel ... 41.989
2926 One_Story_1946_and_Newer_All_Styles ... 41.988
2927 Split_Foyer ... 41.987
2928 One_Story_1946_and_Newer_All_Styles ... 41.991
2929 Two_Story_1946_and_Newer ... 41.989
[2930 rows x 70 columns]