5 Yeo-Johnson
5.1 Yeo-Johnson
You have likely heard a lot of talk about having normally distributed predictors. This isnβt that common of an assumption, and having a fairly non-skewed symmetric predictor is often enough. Linear Discriminant Analysis assumes Gaussian data, and that is about it (TODO add a reference here). Still, it is worthwhile to have more symmetric predictors, and this is where the Yeo-Johnson transformation comes into play.
This method is very similar to the Box-Cox method in Chapter 4, except it doesnβt have the restriction that the variable \(x\) needs to be positive.
It works by using maximum likelihood estimation to estimate a transformation parameter \(\lambda\) in the following equation that would optimize the normality of \(x^*\)
\[ x^* = \left\{ \begin{array}{ll} \dfrac{(x + 1) ^ \lambda - 1}{\lambda} & \lambda \neq 0, x \geq 0 \\ \log(x + 1) & \lambda = 0, x \geq 0 \\ - \dfrac{(-x + 1) ^ {2 - \lambda} - 1}{2 - \lambda} & \lambda \neq 2, x < 0 \\ - \log(-x + 1) & \lambda = 2, x < 0 \end{array} \right. \] It is worth noting again, that what we are optimizing over is the value of \(\lambda\). This is also a case of a trained preprocessing method when used on the predictors. We need to estimate the parameter \(\lambda\) on the training data set, then use the estimated value to apply the transformation to the training and test data set to avoid data leakage.
If the values of \(x\) are strictly positive, then the Yeo-Johnson transformation is the same as the Box-Cox transformation of \(x + 1\), if the values of \(x\) are strictly negative then the transformation is the Box-Cox transformation of \(-x + 1\) with the power \(2 - \lambda\). The interpretation of \(\lambda\) isnβt as easy as for the Box-Cox method.
Let us see some examples of Yeo-Johnson at work. Below is three different simulated distribution, before and after they have been transformed by Yeo-Johnson.
We have the original distributions that have some left or right skewness. And the transformed columns look better, in the sense that they are less skewed and they are fairly symmetric around the center. Are they perfectly normal? No! but these transformations might be beneficial. We also notice how these methods work, even when there are negative values.
The Yeo-Johnson method isnβt magic and will only give you something more normally distributed if the distribution can be made more normally distributed by applying the above formula would give you some more normally distributed values.
The first distribution here is uniformly random. The resulting transformation ends up more skewed, even if only a little bit, than the original distribution because this method is not intended for this type of data. We are seeing similar results with the bi-modal distributions.
5.2 Pros and Cons
5.2.1 Pros
- More flexible than individually chosen power transformations such as logarithms and square roots
- Can handle negative values
5.2.2 Cons
- Isnβt a universal fix
5.3 R Examples
We will be using the ames
data set for these examples.
library(recipes)
library(modeldata)
data("ames")
|>
ames select(Lot_Area, Wood_Deck_SF, Sale_Price)
# A tibble: 2,930 Γ 3
Lot_Area Wood_Deck_SF Sale_Price
<int> <int> <int>
1 31770 210 215000
2 11622 140 105000
3 14267 393 172000
4 11160 0 244000
5 13830 212 189900
6 9978 360 195500
7 4920 0 213500
8 5005 0 191500
9 5389 237 236500
10 7500 140 189000
# βΉ 2,920 more rows
{recipes} provides a step to perform Yeo-Johnson transformations, which out of the box uses \(e\) as the base with an offset of 0.
<- recipe(Sale_Price ~ Lot_Area, data = ames) |>
yeojohnson_rec step_YeoJohnson(Lot_Area) |>
prep()
|>
yeojohnson_rec bake(new_data = NULL)
# A tibble: 2,930 Γ 2
Lot_Area Sale_Price
<dbl> <int>
1 21.8 215000
2 18.2 105000
3 18.9 172000
4 18.1 244000
5 18.8 189900
6 17.7 195500
7 15.5 213500
8 15.5 191500
9 15.8 236500
10 16.8 189000
# βΉ 2,920 more rows
We can also pull out the value of the estimated \(\lambda\) by using the tidy()
method on the recipe step.
|>
yeojohnson_rec tidy(1)
# A tibble: 1 Γ 3
terms value id
<chr> <dbl> <chr>
1 Lot_Area 0.129 YeoJohnson_3gJXR
5.4 Python Examples
We are using the ames
data set for examples. {feature_engine} provided the YeoJohnsonTransformer()
that we can use.
from feazdata import ames
from sklearn.compose import ColumnTransformer
from feature_engine.transformation import YeoJohnsonTransformer
= ColumnTransformer(
ct 'yeojohnson', YeoJohnsonTransformer(), ['Lot_Area'])],
[(="passthrough")
remainder
ct.fit(ames)
ColumnTransformer(remainder='passthrough', transformers=[('yeojohnson', YeoJohnsonTransformer(), ['Lot_Area'])])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(remainder='passthrough', transformers=[('yeojohnson', YeoJohnsonTransformer(), ['Lot_Area'])])
['Lot_Area']
YeoJohnsonTransformer()
['MS_SubClass', 'MS_Zoning', 'Lot_Frontage', 'Street', 'Alley', 'Lot_Shape', 'Land_Contour', 'Utilities', 'Lot_Config', 'Land_Slope', 'Neighborhood', 'Condition_1', 'Condition_2', 'Bldg_Type', 'House_Style', 'Overall_Cond', 'Year_Built', 'Year_Remod_Add', 'Roof_Style', 'Roof_Matl', 'Exterior_1st', 'Exterior_2nd', 'Mas_Vnr_Type', 'Mas_Vnr_Area', 'Exter_Cond', 'Foundation', 'Bsmt_Cond', 'Bsmt_Exposure', 'BsmtFin_Type_1', 'BsmtFin_SF_1', 'BsmtFin_Type_2', 'BsmtFin_SF_2', 'Bsmt_Unf_SF', 'Total_Bsmt_SF', 'Heating', 'Heating_QC', 'Central_Air', 'Electrical', 'First_Flr_SF', 'Second_Flr_SF', 'Gr_Liv_Area', 'Bsmt_Full_Bath', 'Bsmt_Half_Bath', 'Full_Bath', 'Half_Bath', 'Bedroom_AbvGr', 'Kitchen_AbvGr', 'TotRms_AbvGrd', 'Functional', 'Fireplaces', 'Garage_Type', 'Garage_Finish', 'Garage_Cars', 'Garage_Area', 'Garage_Cond', 'Paved_Drive', 'Wood_Deck_SF', 'Open_Porch_SF', 'Enclosed_Porch', 'Three_season_porch', 'Screen_Porch', 'Pool_Area', 'Pool_QC', 'Fence', 'Misc_Feature', 'Misc_Val', 'Mo_Sold', 'Year_Sold', 'Sale_Type', 'Sale_Condition', 'Sale_Price', 'Longitude', 'Latitude']
passthrough
ct.transform(ames)
yeojohnson__Lot_Area ... remainder__Latitude
0 21.823 ... 42.054
1 18.218 ... 42.053
2 18.915 ... 42.053
3 18.082 ... 42.051
4 18.808 ... 42.061
... ... ... ...
2925 16.969 ... 41.989
2926 17.332 ... 41.988
2927 17.861 ... 41.987
2928 17.721 ... 41.991
2929 17.593 ... 41.989
[2930 rows x 74 columns]