Data Preparation for Predictive Modeling: Resolving Outliers

Outliers or anomalies can impact accuracy of predictive models. Detecting outliers and dealing with them is a critical step in data preparation for predictive modeling. In this post we discuss PCA technique for detecting outliers in multivariate datasets. In future posts we will discuss other methods. Example below shows how an outlier can impact the overall fit of a linear regression model:

Impact of Outlier on Predictive Model

Impact of Outlier on Model

This post is continuation of my posts about data preparation for predictive modeling. I’ll cover below areas in these series of posts:

In this post we discuss techniques for resolving outliers.

Outliers: What are they?

We generally define outliers as samples that are exceptionally far from the mainstream of data.There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise. There are various methods of outlier detection. Some are graphical such as normal probability plots. Others are model-based.

There are two main types of outliers. Chambers (1986) refers to the two types as representative and nonrepresentative. A representative outlier is one that is a correct or valid observation that “cannot be regarded as unique”. While this type of outlier is considered an extreme value, it should be retained, with special treatment during the analysis stages. A nonrepresentative outlier is one that is an “incorrect observation” (i.e., due to an error in data entry, coding, or measurement) or is considered unique because there are no other values like it in the population. Nonrepresentative outliers should be corrected or excluded from the analysis (Chambers, 1986).

Some predictive modeling techniques can be impacted as a result of presence of outliers.

How to detect outliers?

There are several approaches for detecting Outliers. Charu Aggarwal in his book Outlier Analysis classifies Outlier detection models in following groups:

  • Extreme Value Analysis: This is the most basic form of outlier detection and only good for 1-dimension data. In these types of analysis, it is assumed that values which are too large or too small are outliers. Z-test and Student’s t-test are examples of these statistical methods. These are good heuristics for initial analysis of data but they don’t have much value in multivariate settings. They can be used as final steps for interpreting outputs of other outlier detection methods.
  • Probabilistic and Statistical Models: These models assume specific distributions for data. Then using the expectation-maximization(EM) methods they estimate the parameters of the model. Finally, they calculate probability of membership of each data point to calculated distribution. The points with low probability of membership are marked as outliers.
  • Linear Models: These methods model the data into a lower dimensional sub-spaces with the use of linear correlations. Then the distance of each data point to plane that fits the sub-space is being calculated. This distance is used to find outliers. PCA(Principal Component Analysis) is an example of linear models for anomaly detection.
  • Proximity-based Models: The idea with these methods is to model outliers as points which are isolated from rest of observations. Cluster analysis, density based analysis and nearest neighborhood are main approaches of this kind.
  • Information Theoretic Models: The idea of these methods is the fact that outliers increase the minimum code length to describe a data set.
  • High-Dimensional Outlier Detection: Specifc methods to handle high dimensional sparse data

In this post we discuss one of the linear methods, PCA(Principal component analysis) for outlier detection.

Outlier Detection Using Principal Component Analysis

Principal component analysis (PCA) is a statistical procedure that uses a transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set.

Using PCA we can map our dataset with n-dimension (possibly correlated variables) to a k-dimensional sub-space of k uncorrelated components (k<=n).

Below are steps for detecting anomalies using PCA:

  • First we map the data set from its original n-dimensional space to k-dimensional subspace using PCA
  • Next, we calculate the centroid of data points (μ)
  • Next, we calculate the variance of each component (λ)
  • Then, we calculate the score of each data point using below formula:

$latex Score(X) = \sum_j \frac{|(X-\mu).e_{j}|^{2}}{\lambda_{j}} $

  •  Finally, we use extreme value analysis methods to find data points with extreme scores

In the following section we apply PCA to diamonds dataset to find outliers.

library(psych)

#First we load the diamonds dataset
data(diamonds)

#Next We convert the categorical variables to dummy variables
#We use dummy.code from psych package
codedData <- cbind(diamonds,dummy.code(diamonds$cut),dummy.code(diamonds$color),dummy.code(diamonds$clarity))
codedData <- codedData[,-c(2,3,4,15,22,30)]

# Next we use the prcomp method to find PCAs
pr <- prcomp(codedData,center = TRUE, scale. = TRUE)

#Next we calculate centroid of data and save it in vector mu
transformedData <- predict(preProcess(codedData,method = c("center","scale")),codedData)
mu <- colMeans(transformedData)

#Next We calculate distances from centroid
distFromMu <- sweep(transformedData,2,mu,'-')
distFromMu <- abs(distFromMu)

#Then we calculate variance of each PCA
lam <- apply(pr$x,2,var)

# Next we multiply distance with eigenvectors
nominator <- (as.matrix(distFromMu)%*%as.matrix(pr$rotation))**2

# Then we devide the result with variances
Res <- sweep(nominator,2,lam,'/')

#Calculate the sum of each row
scores <- rowSums(Res)

par(mfrow=c(1,2));

#Create histogram of scores
hist(scores,breaks = 10000,xlim=c(0,600))

#Drawing Boxplot
boxplot(scores,horizontal = TRUE)

# Check max of score
maxScore = which.max(scores)
diamonds[maxScore,]

R Script Output:

PCA Outlier Scores Diamond Dataset - R

Most Data points have scores between 100-200. There are some outliers

Table below shows observation 24068 in diamond dataset, which has the maximum outlier score:

 

As you can see something is clearly incorrect with y value. It looks like the depth value has been incorrectly used for y. You can check other records with extreme outlier scores.

In next post we implement one of proximity based models named Local Outlier Factor (LOF)

Further Readings:

Charu Aggarwal’s Website

Survey of Outlier Analysis Methods

6 responses

  1. Hi,
    In the example, you have used codedDiamond.csv. Can you please provide the codedDiamond.csv as I have downloaded the diamond.csv from ggplot which you code does not work up. I will appreciate if you can provide codedDiamond.csv.
    Thank you

Leave a Reply

Your email address will not be published. Required fields are marked *