Data Preparation for Predictive Modeling: Resolving Skewness

Skewness is a measure of asymmetry of distribution. Many model building techniques have the assumption that predictor values are distributed normally and have a symmetrical shape. In this post we discuss data preparation methods for resolving skewness.

Data preparation is a critical first step for building high performance predictive models. In a series of posts we discuss important data preparation techniques that improve the performance of predictive models. Below are techniques we are planning to cover in Data Preparation For Predictive Modeling Series:

In this post we discuss techniques for resolving skewness.

Skewness
The fundamental assumption in many predictive models is that the predictors have normal distributions. Normal distribution is un-skewed. An un-skewed distribution is the one which is roughly symmetric. It means the probability of falling in the right side of mean is equal to probability of falling on left side of mean. The statistics for sample skewness is being calculated using below formula:
$latex Skewness = \frac{\sum(x_{i}-\overline{x})^{3}}{(n-1)\nu^{3/2}}&s=3$
$latex where\quad \nu = \frac{\sum(x_{i}-\overline{x})^{2}}{(n-1)}&s=3$
Below section shows the methods for calculating Skewness in Python and R. We use the skew from spicy.stats to calculate skewness in Python. For R code we use the skewness from e1071 package:
#Airline Data Example
#Calculating Skewness Statistic For Flight Time Using Python
#You can download dataset from http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

import pandas as pd
import matplotlib.pylab as plt
from sklearn import preprocessing
from scipy.stats import skew
#First we import the data 
data = pd.read_csv('On_Time_On_Time_Performance_2015_1.csv') 

#Replace Missing Values with zero
data['AirTime'].fillna(0,inplace=True)

#The next line uses scale method from scikit-learn to transform the distribution 
#This will not impact Skewness Statistic calculation
#We have included this for sake of completion
AirTime = preprocessing.scale(data['AirTime']) 

#Next We calculate Skewness using skew in spicy.stats
skness = skew(AirTime)
#We draw the histograms 
figure = plt.figure()
figure.add_subplot(121)   
plt.hist(AirTime,facecolor='blue',alpha=0.75) 
plt.xlabel("AirTime - Transformed") 
plt.title("Transformed AirTime Histogram") 
plt.text(2,100000,"Skewness: {0:.2f}".format(skness)) 

figure.add_subplot(122) 
plt.boxplot(AirTime)
plt.title("Large Skewness shows Right Skewed Distribution")
plt.show()

Python Script Output:

Skewness greater than zero shows large skewed distribution

#Airline Data Example
#Centering & Scaling Flight Time using R
# You can download the data from http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

#We use functions in Caret package for transformations
library("caret");
library("e1071");

#Load Dataset
data <- read.csv('On_Time_On_Time_Performance_2015_1.csv',header = TRUE);

#Replace NA values with zero
data$AirTime[is.na(data$AirTime)] <- 0


#We use preProcess function from caret package to apply center and scale transformations
#Notice that preProcess works on Data Frames or Matrix
#If you need to apply the transformation on a single column use as.data.frame

centerScaleModel <- preProcess(as.data.frame(data$AirTime),method = c("center","scale"));
AirTime <- predict(centerScaleModel,as.data.frame(data$AirTime));

#Calculating the Skewness
skness <- apply(AirTime,2,skewness);

#Here we plot the Actual and Transformed values for AirTime column, side by side
#par function is being used to create subplots (an alternative is layout())

par(mfrow=c(1,2));

hist(AirTime[,1],main = "Transformed Flight Time Histogram",xlab = "AirTime - Transformed",col = "dark blue",xlim = c(-2,5),ylim=c(0,150000));
text(5.0, 100000, paste("Skewness = ", round(skness, 2), sep = ''), pos = 2)

boxplot(AirTime,main = "Large Skewness shows Right Skewed Distribution")

R Script Output:

Large positive Skewness Statistic shows Right Skewed Distribution

Large positive Skewness Statistic shows Right Skewed Distribution

How to deal with Skewness?
Replacing the data with the log, square root, or inverse may help to remove the skew.  In the following code we show that how using the square root of flight time instead of actual flight time can reduce the Right Skewness.
#Airline Data Example
#Calculating Skewness Statistic For Flight Time Using Python
#You can download dataset from http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

import pandas as pd
import matplotlib.pylab as plt
from sklearn import preprocessing
from scipy.stats import skew
import numpy as np
#First we import the data 
data = pd.read_csv('On_Time_On_Time_Performance_2015_1.csv') 

#Replace Missing Values with zero
data['AirTime'].fillna(0,inplace=True)

#The next line uses scale method from scikit-learn to transform the distribution 
#This will not impact Skewness Statistic calculation
#We have included this for sake of completion
#Note that we changed the following line to process the square roots instead of actuals
AirTime = preprocessing.scale(np.sqrt(data['AirTime'])) 
AirTimeOrig = preprocessing.scale(data['AirTime'])

#Next We calculate Skewness using skew in spicy.stats
skness = skew(AirTime)
sknessOrig = skew(AirTimeOrig)
#We draw the histograms 
figure = plt.figure()
figure.add_subplot(131)   
plt.hist(AirTime,facecolor='red',alpha=0.75) 
plt.xlabel("AirTime - Transformed(Using Sqrt)") 
plt.title("Transformed AirTime Histogram") 
plt.text(2,100000,"Skewness: {0:.2f}".format(skness)) 

figure.add_subplot(132) 
plt.hist(AirTimeOrig,facecolor='blue',alpha=0.75) 
plt.xlabel("AirTime - Based on Original Flight Times") 
plt.title("AirTime Histogram - Right Skewed") 
plt.text(2,100000,"Skewness: {0:.2f}".format(sknessOrig))
figure.add_subplot(133) 
plt.boxplot(AirTime)
plt.title("Un-Skewed Distribution")
plt.show()

Python Script Output:

Using the Square Root of Flight Time reduces the skewness

Using the Square Root of Flight Time reduces the skewness

#Airline Data Example
#Centering & Scaling Flight Time using R
# You can download the data from http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

#We use functions in Caret package for transformations
library("caret");
library("e1071");

#Load Dataset
data <- read.csv('On_Time_On_Time_Performance_2015_1.csv',header = TRUE);

#Replace NA values with zero
data$AirTime[is.na(data$AirTime)] <- 0

#Calculate Square root of flight time
data$sqrtAirTime <- sqrt(data$AirTime)

#We use preProcess function from caret package to apply center and scale transformations
#Notice that preProcess works on Data Frames or Matrix
#If you need to apply the transformation on a single column use as.data.frame

centerScaleModel <- preProcess(as.data.frame(data$sqrtAirTime),method = c("center","scale"));
AirTime <- predict(centerScaleModel,as.data.frame(data$sqrtAirTime));

centerScaleModelOrig <- preProcess(as.data.frame(data$AirTime),method = c("center","scale"));
AirTimeOrig <- predict(centerScaleModelOrig,as.data.frame(data$AirTime));

#Calculating the Skewness

skness <- apply(AirTime,2,skewness);
sknessOrig <- apply(AirTimeOrig,2,skewness);

#Here we plot the Actual and Transformed values for AirTime column, side by side
# par function is being used to create subplots (an alternative is layout())

par(mfrow=c(1,3));

hist(AirTime[,1],main = "Transformed Flight Time Histogram",xlab = "AirTime - Transformed(Sqrt)",col = "dark red",xlim = c(-2,5),ylim=c(0,150000));
text(5.0, 100000, paste("Skewness = ", round(skness, 2), sep = ''), pos = 2)

hist(AirTimeOrig[,1],main = "Flight Time Histogram Right Skewed",xlab = "AirTime",col = "dark blue",xlim = c(-2,5),ylim=c(0,150000));
text(5.0, 100000, paste("Skewness = ", round(sknessOrig, 2), sep = ''), pos = 2)

boxplot(AirTime,main = "Un-Skewed Distribution")

R Script Output:

Using Square Root of flight time instead of actual reduces skewness

Using Square Root of flight time instead of actual reduces skewness

Boxcox Transformation

Finding the right transformation to resolve Skewness can be tedious. Box and Cox in their 1964 paper proposed a statistical method to find the right transformation. They suggested using below family of transformations and finding the λ:

$latex x^{*} = \begin{cases}\frac{x^{\lambda}-1}{\lambda} & \lambda \neq 0\\log(x) & \lambda = 0\end{cases}&s=3$

Notice that because of the log term, this transformation requires x values to be positive. So, if there are zero and negative values, all values need to be shifted before applying this method.

In the following section we are going to add Boxcox transformation to our code in previous section:

#Airline Data Example
#Calculating Skewness Statistic For Flight Time Using Python
#You can download dataset from http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

import pandas as pd
import matplotlib.pylab as plt
from sklearn import preprocessing
from scipy.stats import skew
import numpy as np
from scipy.stats import boxcox
#First we import the data 
data = pd.read_csv('On_Time_On_Time_Performance_2015_1.csv') 

#Replace Missing Values with zero
data['AirTime'].fillna(0,inplace=True)

#The next line uses scale method from scikit-learn to transform the distribution 
#This will not impact Skewness Statistic calculation
#We have included this for sake of completion
#Note that we changed the following line to process the square roots instead of actuals
AirTime = preprocessing.scale(np.sqrt(data['AirTime'])) 

#Note that we shift the values by 1 to get rid of zeros
AirTimeBoxCox = preprocessing.scale(boxcox(data['AirTime']+1)[0])
AirTimeOrig = preprocessing.scale(data['AirTime'])

#Next We calculate Skewness using skew in spicy.stats
skness = skew(AirTime)
sknessBoxCox = skew(AirTimeBoxCox)
sknessOrig = skew(AirTimeOrig)
#We draw the histograms 
figure = plt.figure()
figure.add_subplot(131)   
plt.hist(AirTime,facecolor='red',alpha=0.75) 
plt.xlabel("AirTime - Transformed(Using Sqrt)") 
plt.title("Transformed AirTime Histogram") 
plt.text(2,100000,"Skewness: {0:.2f}".format(skness)) 

figure.add_subplot(132) 
plt.hist(AirTimeBoxCox,facecolor='blue',alpha=0.75) 
plt.xlabel("AirTime - Using BoxCox Transformation") 
plt.title("AirTime Histogram - Un-Skewed(BoxCox)") 
plt.text(2,100000,"Skewness: {0:.2f}".format(sknessBoxCox))

figure.add_subplot(133) 
plt.hist(AirTimeOrig,facecolor='green',alpha=0.75) 
plt.xlabel("AirTime - Based on Original Flight Times") 
plt.title("AirTime Histogram - Right Skewed") 
plt.text(2,100000,"Skewness: {0:.2f}".format(sknessOrig))

plt.show()

Python Script Output:

Comparison of Skewness statistic for BoxCox transformation with original Flight Times

Comparison of Skewness statistic for BoxCox transformation with original Flight Times

#Airline Data Example
#Centering & Scaling Flight Time using R
# You can download the data from http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

#We use functions in Caret package for transformations
library("caret");
library("e1071");

#Load Dataset
data <- read.csv('On_Time_On_Time_Performance_2015_1.csv',header = TRUE);

#Replace NA values with zero
data$AirTime[is.na(data$AirTime)] <- 0

#Calculate BoxCox Transformation
pr = BoxCoxTrans(y =(data$AirTime+1) )
data$AirTimeBC <- predict(pr,data$AirTime)

#Calculate Square root of flight time
data$sqrtAirTime <- sqrt(data$AirTime)

#We use preProcess function from caret package to apply center and scale transformations
#Notice that preProcess works on Data Frames or Matrix
#If you need to apply the transformation on a single column use as.data.frame

centerScaleModel <- preProcess(as.data.frame(data$sqrtAirTime),method = c("center","scale"));
AirTime <- predict(centerScaleModel,as.data.frame(data$sqrtAirTime));

centerScaleModelBC <- preProcess(as.data.frame(data$AirTimeBC),method = c("center","scale"));
AirTimeBoxCox <- predict(centerScaleModelBC,as.data.frame(data$AirTimeBC));

centerScaleModelOrig <- preProcess(as.data.frame(data$AirTime),method = c("center","scale"));
AirTimeOrig <- predict(centerScaleModelOrig,as.data.frame(data$AirTime));

#Calculating the Skewness

skness <- apply(AirTime,2,skewness);
sknessBoxCox <- apply(AirTimeBoxCox,2,skewness);
sknessOrig <- apply(AirTimeOrig,2,skewness);

#Here we plot the Actual and Transformed values for AirTime column, side by side
# par function is being used to create subplots (an alternative is layout())

par(mfrow=c(1,3));

hist(AirTime[,1],main = "Transformed Flight Time Histogram",xlab = "AirTime - Transformed(Sqrt)",col = "dark red",xlim = c(-2,5),ylim=c(0,150000));
text(5.0, 100000, paste("Skewness = ", round(skness, 2), sep = ''), pos = 2)

hist(AirTimeBoxCox[,1],main = "Transformed Flight Time Histogram(BoxCox)",xlab = "AirTime - Transformed(BoxCox)",col = "dark blue",xlim = c(-2,5),ylim=c(0,150000));
text(5.0, 100000, paste("Skewness = ", round(sknessBoxCox, 2), sep = ''), pos = 2)

hist(AirTimeOrig[,1],main = "Flight Time Histogram Right Skewed",xlab = "AirTime",col = "dark green",xlim = c(-2,5),ylim=c(0,150000));
text(5.0, 100000, paste("Skewness = ", round(sknessOrig, 2), sep = ''), pos = 2)

R Script Output:

Comparison of Skewness Statistic for Flight Time with transformed Flight Time with Box Cox method

Comparison of Skewness Statistic for Flight Time with transformed Flight Time with Box Cox method

Interesting Finding: Poor performance of BoxCox transformation in Scipy.Stats

Python Script comes up with λ = ~0.46 and reduces the Skewness statistics for this data to 0.08. This is not as good as λ = 0.5 (or basically Square Root transformation) that BoxCox transformation method in Caret package of R provided. I’ll need to research why this happens?

Further Readings:

7 responses

  1. Regarding your last comment: The implementation in Python / scipy uses the brent optimizer to maximize the log-likelihood for the boxcox-transformation. The optimizer uses an iterative process which ends when a) the maximum number of iterations is reached (defaults to 500) or b) the improvement in log-likelihood between iterations becomes very small. It’s basically a trade-off between accuracy and speed / performance. This might explain your sub-optimal result of lambda = 0.46.

    Hope this helps!

Leave a Reply

Your email address will not be published. Required fields are marked *