- Centering & Scaling
- Resolving Skewness
- Resolving Outliers
- Data Reduction & Feature Extraction
- Imputation & Dealing With Missing Values
- Removing Predictors
- Adding Predictors

In the first part we discuss Centering & Scaling transformation.

Centering transformation is basically reducing the Mean value of samples from all observations. So, the observations will have a mean value of *Zero* after this transformation. Scaling transformation is dividing value of predictor for each observation by standard deviation of all samples. This will cause the transformed values to have a standard deviation of *One*.

When & Why we need Centering & Scaling (Standardization):

- Standardization is recommended when regression models are being built. When there are predictors with different units and ranges, the final model will have coefficients which are very small for some predictors and it makes it difficult to interpret
- Centering & Scaling will improve the numerical stability of some models(i.e PLS)
- Many predictive modeling techniques use the predictor variance as an important factor for assigning importance to each predictor(PLS,…). In this situation, since variables with larger units usually have higher variance compare to predictors which have smaller units, the models will favor variables with larger units. The Centering & Scaling transformation ensures that unit differences don’t impact predictor selection and final model.
- For penalized models in regression (Lasso, Ridge Regression,…) the penalty is calculated based on estimated coefficient for each parameter. So, centering and scaling is critical for those models because otherwise predictors with smaller units will receive lower cost and the model will be impacted

In order to illustrate the impact of Centering & Scaling transformations we will use the Airlines On-Time performance Dataset. There is a field in this dataset named **AirTime** which shows flight time in Minutes. We show the histograms, standard deviation and mean of this field before and after Centering & Scaling transformations:

#Airline Data Example#Centering & Scaling Flight Time Using Python#You can download dataset from http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236import pandas as pd import matplotlib.pylab as plt from sklearn import preprocessing#First we import the datadata = pd.read_csv('On_Time_On_Time_Performance_2015_1.csv')#Replace Missing Values with zerodata['AirTime'].fillna(0,inplace=True)#The next line uses scale method from scikit-learn to transform the distributionAirTime = preprocessing.scale(data['AirTime'])#We draw the histograms side by sidefigure = plt.figure() ax1 = figure.add_subplot(121) plt.hist(data['AirTime'],facecolor='red',alpha=0.75) plt.xlabel("AirTime(Minutes)") plt.ylabel("Frequency") plt.title("Original Flight Time Histogram") ax1.text(300,100000,"Mean: {0:.2f} \n Std: {1:.2f}".format(data['AirTime'].mean(),data['AirTime'].std())) ax2 = figure.add_subplot(122) plt.hist(AirTime,facecolor='blue',alpha=0.75) plt.xlabel("AirTime - Transformed") plt.title("Transformed AirTime Histogram") ax2.text(2,100000,"Mean: {0:.2f} \n Std: {1:.2f}".format(AirTime.mean(),AirTime.std())) plt.show()

**Python Script Output:**

#Airline Data Example#Centering & Scaling Flight Time using R# You can download the data from http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236#We use functions in Caret package for transformationslibrary("caret");#Load Datasetdata <- read.csv('On_Time_On_Time_Performance_2015_1.csv',header = TRUE);#Replace NA values with zerodata$AirTime[is.na(data$AirTime)] <- 0#We use preProcess function from caret package to apply center and scale transformations #Notice that preProcess works on Data Frames or Matrix #If you need to apply the transformation on a single column use as.data.framecenterScaleModel <- preProcess(as.data.frame(data$AirTime),method = c("center","scale")); AirTime <- predict(centerScaleModel,as.data.frame(data$AirTime));#Here we plot the Actual and Transformed values for AirTime column, side by side #par function is being used to create subplots (an alternative is layout())par(mfrow=c(1,2)); hist(data$AirTime,main = "Original Flight Time Histogram",xlab = "AirTime(Minutes)",col = "dark red",ylim=c(0,150000)); MySd <- sd(data$AirTime) MyMean <- mean(data$AirTime) text(600, 100000, paste("Mean = ", round(MyMean, 1), "\nStd.Dev = ", round(MySd, 1), sep = ''), pos = 2) hist(AirTime[,1],main = "Transformed Flight Time Histogram",xlab = "AirTime - Transformed",col = "dark blue",xlim = c(-2,5),ylim=c(0,150000)); MySd <- sd(AirTime[,1]) MyMean <- mean(AirTime[,1]) text(5.0, 100000, paste("Mean = ", round(MyMean, 1), "\nStd.Dev = ", round(MySd, 1), sep = ''), pos = 2)

**R Script Output:**

Centering & Scaling are simple transformations but they are critical steps before creating some of predictive models. Note that the transformations need to be applied on all data points (Test & Train).

Further Reading:

## 2 responses