Data Preparation for Predictive Modeling: Centering & Scaling

Data preparation is a critical first step for building high performance predictive models. In a series of posts we discuss important data preparation techniques that improve the performance of predictive models. Below are techniques we are planning to cover in Data Preparation For Predictive Modeling Series:

In the first part we discuss Centering & Scaling transformation.

Centering & Scaling

Centering transformation is basically reducing the Mean value of samples from all observations. So, the observations will have a mean value of Zero after this transformation. Scaling transformation is dividing value of predictor for each observation by standard deviation of all samples. This will cause the transformed values to have a standard deviation of One.

When & Why we need Centering & Scaling (Standardization):

  • Standardization is recommended when regression models are being built. When there are predictors with different units and ranges, the final model will have coefficients which are very small for some predictors and it makes it difficult to interpret
  • Centering & Scaling will improve the numerical stability of some models(i.e PLS)
  • Many predictive modeling techniques use the predictor variance as an important factor for assigning importance to each predictor(PLS,…). In this situation, since variables with larger units usually have higher variance compare to predictors which have smaller units, the models will favor variables with larger units. The Centering & Scaling transformation ensures that unit differences don’t impact predictor selection and final model.
  • For penalized models in regression (Lasso, Ridge Regression,…) the penalty is calculated based on estimated coefficient for each parameter. So, centering and scaling is critical for those models because otherwise predictors with smaller units will receive lower cost and the model will be impacted

In order to illustrate the impact of Centering & Scaling transformations we will use the Airlines On-Time performance Dataset. There is a field in this dataset named AirTime which shows flight time in Minutes. We show the histograms, standard deviation and mean of this field before and after Centering & Scaling transformations:

#Airline Data Example
#Centering & Scaling Flight Time Using Python
#You can download dataset from http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

import pandas as pd
import matplotlib.pylab as plt
from sklearn import preprocessing

#First we import the data 
data = pd.read_csv('On_Time_On_Time_Performance_2015_1.csv') 

#Replace Missing Values with zero
data['AirTime'].fillna(0,inplace=True)

#The next line uses scale method from scikit-learn to transform the distribution 
AirTime = preprocessing.scale(data['AirTime']) 

#We draw the histograms side by side 
figure = plt.figure() 
ax1 = figure.add_subplot(121) 
plt.hist(data['AirTime'],facecolor='red',alpha=0.75) 
plt.xlabel("AirTime(Minutes)") 
plt.ylabel("Frequency") 
plt.title("Original Flight Time Histogram") 
ax1.text(300,100000,"Mean: {0:.2f} \n Std: {1:.2f}".format(data['AirTime'].mean(),data['AirTime'].std())) 

ax2 = figure.add_subplot(122) 
plt.hist(AirTime,facecolor='blue',alpha=0.75) 
plt.xlabel("AirTime - Transformed") 
plt.title("Transformed AirTime Histogram") 
ax2.text(2,100000,"Mean: {0:.2f} \n Std: {1:.2f}".format(AirTime.mean(),AirTime.std())) 
plt.show()

Python Script Output:

Centering Scaling Airline Flight Time - Python

Centering & Scaling does not change the shape of distribution

#Airline Data Example
#Centering & Scaling Flight Time using R
# You can download the data from http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

#We use functions in Caret package for transformations
library("caret");

#Load Dataset
data <- read.csv('On_Time_On_Time_Performance_2015_1.csv',header = TRUE);

#Replace NA values with zero
data$AirTime[is.na(data$AirTime)] <- 0


#We use preProcess function from caret package to apply center and scale transformations
#Notice that preProcess works on Data Frames or Matrix
#If you need to apply the transformation on a single column use as.data.frame

centerScaleModel <- preProcess(as.data.frame(data$AirTime),method = c("center","scale"));
AirTime <- predict(centerScaleModel,as.data.frame(data$AirTime));

#Here we plot the Actual and Transformed values for AirTime column, side by side
#par function is being used to create subplots (an alternative is layout())

par(mfrow=c(1,2));

hist(data$AirTime,main = "Original Flight Time Histogram",xlab = "AirTime(Minutes)",col = "dark red",ylim=c(0,150000));

MySd <- sd(data$AirTime)
MyMean <- mean(data$AirTime)

text(600, 100000, paste("Mean = ", round(MyMean, 1), "\nStd.Dev = ", round(MySd, 1), sep = ''), pos = 2)
hist(AirTime[,1],main = "Transformed Flight Time Histogram",xlab = "AirTime - Transformed",col = "dark blue",xlim = c(-2,5),ylim=c(0,150000));

MySd <- sd(AirTime[,1])
MyMean <- mean(AirTime[,1])

text(5.0, 100000, paste("Mean = ", round(MyMean, 1), "\nStd.Dev = ", round(MySd, 1), sep = ''), pos = 2)

R Script Output:

Centering & Scaling are simple transformations but they are critical steps before creating some of predictive models. Note that the transformations need to be applied on all data points (Test & Train).

Further Reading:

2 responses

Leave a Reply

Your email address will not be published. Required fields are marked *