Detecting Outliers In High Dimensional Data Sets

Many real world data sets are very high dimensional. In many applications, data sets may contain hundreds or thousands of features. In those scenarios because of well known curse of dimensionality the traditional outlier detection approaches such as PCA and LOF which we discussed in previous posts will not be effective. In this post we discuss the High Contrast Subspaces for Density-Based Outlier Ranking (HiCS) method explained in this paper as an effective method to find outliers in high dimensional data sets.

Introduction

One of the challenges in data analysis in general and predictive modeling in particular is dealing with outliers. There are many modeling techniques which are resistant to outliers or reduce the impact of them, but still detecting outliers and understanding them can lead to interesting findings. We generally define outliers as samples that are exceptionally far from the mainstream of data.There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise.

There are several approaches for detecting Outliers. Charu Aggarwal in his book Outlier Analysis classifies Outlier detection models in following groups:

  • Extreme Value Analysis: This is the most basic form of outlier detection and only good for 1-dimension data. In these types of analysis, it is assumed that values which are too large or too small are outliers. Z-test and Student’s t-test are examples of these statistical methods. These are good heuristics for initial analysis of data but they don’t have much value in multivariate settings. They can be used as final steps for interpreting outputs of other outlier detection methods.
  • Probabilistic and Statistical Models: These models assume specific distributions for data. Then using the expectation-maximization(EM) methods they estimate the parameters of the model. Finally, they calculate probability of membership of each data point to calculated distribution. The points with low probability of membership are marked as outliers.
  • Linear Models: These methods model the data into a lower dimensional sub-spaces with the use of linear correlations. Then the distance of each data point to plane that fits the sub-space is being calculated. This distance is used to find outliers. PCA(Principal Component Analysis) is an example of linear models for anomaly detection.
  • Proximity-based Models: The idea with these methods is to model outliers as points which are isolated from rest of observations. Cluster analysis, density based analysis and nearest neighborhood are main approaches of this kind.
  • Information Theoretic Models: The idea of these methods is the fact that outliers increase the minimum code length to describe a data set.
  • High-Dimensional Outlier Detection: Specifc methods to handle high dimensional sparse data

In previous posts we discussed one of linear methods(PCA) and one of proximity based methods(LOF). In this post we discuss why previous methods don’t work in high dimensional data sets and introduce HiCS method that works well in high dimensional space.

The Curse of Dimensionality

As shown in previous post, Local Outlier Factor (LOF) method(Similar to other proximity based methods) uses all features available in data set to calculate the nearest neighborhood of each data point, the density of each cluster and finally outlier score for each data point.

There is a detailed proof available in this paper that shows that as dimensionality increases, the distance to the nearest neighbor approaches the distance to the farthest neighbor.In other word, contrast in distances to different data points becomes nonexistent. This basically means using methods such as LOF, which are based on nearest neighborhood, for high dimensional data sets will lead to outlier scores which are close to each other.

$latex \lim_{|A|\rightarrow\infty}\max_{z \in DB} dist_A(z,x) – \min_{z \in DB} dist_A(z,x) = 0 $

$latex Score(x)\approx Score(y) \space\space\space \forall \space x,y\space \in\space DB $

As a result LOF method degenerates to a random ranking for any dataset with sufficiently high number of dimensions. In order to solve this problem, many outlier detection algorithms have been developed which deal with subspaces of original space and calculate outliers for subspaces with lower dimensions(instead of dealing with full space). The idea with those algorithms is projecting points to lower dimensional subspaces, calculating outlier scores for those subspaces and deriving the final outlier score for each point based on aggregation of scores from different subspaces.

The main challenge with those approaches is selecting the appropriate subspace. Selecting all subspaces can be computationally expensive and can lead to distorted results. The other basic approach is randomly selecting some subspaces. This can lead to incorrect results as well because the quality of subspaces are not guaranteed. In the following section we discuss High Contrast Subspaces for Density-Based Outlier Ranking method which has been discussed in this paper and solves this problem in an effective way.

HiCS:High Contrast Subspaces for Density-Based Outlier Ranking

The proposed approach deals with the problem through following steps:

  1. First, we generate 2-dimensional subspaces from the original space
  2. We calculate the contrast of subspaces generated in previous step and only keep subspaces with contrast scores higher than a specific cut_off point
  3. We start adding dimensions to remaining subspaces from previous step and calculate the contrast score and keep the subspaces with scores above cut_off point
  4. We repeat the previous step until no more possible subspace is remaining
  5. Next. we calculate Outlier Score of each point for all remaining subspaces from previous step
  6. Finally, we calculate the average of outlier scores for each data point

One strength of this approach is separating subspace selection and scoring steps. Next, we discuss the proposed approach for calculating the contrast of each subspace.

Calculating Contrast

This method counts a subspace as a high contrast subspace if the marginal pdf of a variable is very different compare to conditional pdf of that variable. We use Ann Thyroid data set from UCI Machine Learning data sets to show this concept. The data set has 6 features, Var_0000 through Var_0005. The first figure below shows an example of high contrast subspace and the second figure shows an example of low contrast subspace.

Conditional pdf is different compare to marginal pdf

Conditional pdf is different compare to marginal pdf

conditional pdf and marginal pdf are almost the same

conditional pdf and marginal pdf are almost the same

 Here is code to generate above charts:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from scipy.io import arff
import pandas as pd
from scipy import stats
import itertools
import random
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics

#We use arff from scipy.io to read the file
#Download the data from here http://ipd.kit.edu/~muellere/HiCS/

data, meta = arff.loadarff('ann_thyroid.arff')
df = pd.DataFrame(data)

plt.figure(0)
ax1 = plt.subplot2grid((3,3), (0,0), colspan=2)
ax2 = plt.subplot2grid((3,3), (1,0), colspan=2, rowspan=2,sharex=ax1)
ax3 = plt.subplot2grid((3,3), (1, 2), rowspan=2,sharey=ax2)


plt.suptitle("High Contrast Subspace: var_0002 & var_0003")

ax2.scatter(df['var_0003'],df['var_0002'])
df['var_0003'].hist(color='k', alpha=0.5, bins=30,ax=ax1,normed = True)
df['var_0002'].hist(color='k', alpha=0.5, bins=30,orientation="horizontal")

var0002_index = df['var_0002'].rank()/max(df['var_0002'].rank())
df[(var0002_index>0.65)]['var_0003'].hist(color='r', alpha=0.2, bins=20,ax=ax1,normed = True,linestyle='dotted')
ax1.text(0.8,5,'p-value:<0.0001***',va="center", ha="center")
plt.show()

#Below lines calculates Weltch's t-test and Kolmogorov-Smirnov statistics

t=stats.ttest_ind(df['var_0003'].values, df[(var0002_index>0.65)]['var_0003'].values, equal_var = False)
print "t-test"
print t.pvalue

k = stats.ks_2samp(df['var_0003'].values, df[(var0002_index>0.65)]['var_0003'].values)
print "k-test"
print k.pvalue

#Now we show a low contrast subspace
plt.figure(1)
ax1 = plt.subplot2grid((3,3), (0,0), colspan=2)
ax2 = plt.subplot2grid((3,3), (1,0), colspan=2, rowspan=2,sharex=ax1)
ax3 = plt.subplot2grid((3,3), (1, 2), rowspan=2,sharey=ax2)


plt.suptitle("Low Contrast Subspace: var_0000 & var_0003")

ax2.scatter(df['var_0000'],df['var_0003'])
df['var_0000'].hist(color='k', alpha=0.5, bins=30,ax=ax1,normed = True)
df['var_0003'].hist(color='k', alpha=0.5, bins=30,orientation="horizontal")

var0003_index = df['var_0003'].rank()/max(df['var_0003'].rank())
df[(var0003_index>0.65)]['var_0000'].hist(color='r', alpha=0.2, bins=30,ax=ax1,normed = True,linestyle='dotted')
ax1.text(0.8,5,'p-value:<0.35',va="center", ha="center")
plt.show()

t=stats.ttest_ind(df['var_0000'].values, df[(var0003_index>0.65)]['var_0000'].values, equal_var = False)
print "t-test"
print t.pvalue

k=stats.ks_2samp(df['var_0000'].values, df[(var0003_index>0.65)]['var_0000'].values)
print "k-test"
print k.pvalue

As you can see in the first case the p-value (based on Weltch’s t-test or Kolmogorov-Smirnov test) is very small (less than 0.0001) which means there is a very small chance that marginal pdf and conditional pdf are coming from the same distribution. While in the second case p-value is ~0.8 which means there is a 80% chance that difference between marginal pdf and conditional pdfs are because of randomness.

We use the above method for calculating contrast and generalize it to higher dimension subspaces:

  1. We pick a random dimension from subspace
  2. We calculate the distribution of selected dimension in the subspace
  3. We select a random subset of subspace by applying lower and upper bands to remaining variables in subspace
  4. We calculate the distribution of picked dimension for only subset that satisfies conditions of previous step for other dimensions
  5. We use a test like Welch t-test or kolmogrov-smirnov to calculate 1-p-value
  6. We repeat previous steps for a pre-defined number of times and calculate average of 1 – p-value
  7. The result is the contrast score of subspace

Implementation

The following code applies HiCS algorithm and LOF on Ann Thyroid data set from UCI Machine Learning data sets. We calculated the AUC score for both methods and achieved approximately 0.9 AUC score for LOF and approximately 0.96 AUC score for HiCS. The reader can use the code to check the scores for higher dimensional data sets to see the improvements.

import matplotlib.pyplot as plt
plt.style.use('ggplot')
from scipy.io import arff
import pandas as pd
from scipy import stats
import itertools
import random
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics

#We use arff from scipy.io to read the file
#Download the data from here http://ipd.kit.edu/~muellere/HiCS/

data, meta = arff.loadarff('ann_thyroid.arff')
df = pd.DataFrame(data)

#Below functions are used to calculate LOF
def knn(df,k):
 nbrs = NearestNeighbors(n_neighbors=3)
 nbrs.fit(df)
 distances, indices = nbrs.kneighbors(df)
 return distances, indices

def reachDist(df,MinPts,knnDist):
 nbrs = NearestNeighbors(n_neighbors=MinPts)
 nbrs.fit(df)
 distancesMinPts, indicesMinPts = nbrs.kneighbors(df)
 distancesMinPts[:,0] = np.amax(distancesMinPts,axis=1)
 distancesMinPts[:,1] = np.amax(distancesMinPts,axis=1)
 distancesMinPts[:,2] = np.amax(distancesMinPts,axis=1)
 return distancesMinPts, indicesMinPts

def ird(MinPts,knnDistMinPts):
 return (MinPts/np.sum(knnDistMinPts,axis=1))

def lof(Ird,MinPts,dsts):
 lof=[]
 for item in dsts:
 tempIrd = np.divide(Ird[item[1:]],Ird[item[0]])
 lof.append(tempIrd.sum()/MinPts)
 return lof

#calculate the index, we use this for selecting random sections in subspace
index_df = (df.rank()/df.rank().max()).iloc[:,:-1]

#Below function is used to create possible subspaces in each step
def comboGenerator(startPoint,space,n):
 combosFinal=[]
 for item in itertools.combinations(list(set(space)-set(startPoint)),(n-len(startPoint))):
 combosFinal.append(sorted(startPoint+list(item)))
 return combosFinal

#We start with 2-D subspaces
listOfCombos = comboGenerator([],df.columns[:-1],2)
testedCombos=[]
selection=[]

#We calculate the contrast score for each subspace
#For each subspace that satisfies the cut_off point criteria
#We add additional dimensions

while(len(listOfCombos)>0):
 if listOfCombos[0] not in testedCombos:
 alpha1 = pow(0.2,(float(1)/float(len(listOfCombos[0]))))
 pvalue_Total =0
 pvalue_cnt = 0
 avg_pvalue=0
 for i in range(0,50):
 lband = random.random()
 uband = lband+alpha1
 v = random.randint(0,(len(listOfCombos[0])-1))
 rest = list(set(listOfCombos[0])-set([listOfCombos[0][v]]))
 k=stats.ks_2samp(df[listOfCombos[0][v]].values, df[((index_df[rest]<uband) & (index_df[rest]>lband)).all(axis=1)][listOfCombos[0][v]].values)
 if not(np.isnan(k.pvalue)):
 pvalue_Total = pvalue_Total+k.pvalue
 pvalue_cnt = pvalue_cnt+1
 if pvalue_cnt>0:
 avg_pvalue = pvalue_Total/pvalue_cnt
 if (1.0-avg_pvalue)>0.6:
 selection.append(listOfCombos[0])
 listOfCombos = listOfCombos + comboGenerator(listOfCombos[0],df.columns[:-1],(len(listOfCombos[0])+1))
 testedCombos.append(listOfCombos[0])
 listOfCombos.pop(0)
 listOfCombos = [list(t) for t in set(map(tuple,listOfCombos))]
 else:
 listOfCombos.pop(0)

#We calculate the contrast score 50 times for each subspace
#We average the contrast scores from iterations

scoresList=[]
for item in selection:
 m=50
 knndist, knnindices = knn(df[item],3)
 reachdist, reachindices = reachDist(df[item],m,knndist)
 irdMatrix = ird(m,reachdist)
 lofScores = lof(irdMatrix,m,reachindices)
 scoresList.append(lofScores)

#We calculate average LOF score for each data point from each subspace

avgs = np.nanmean(np.ma.masked_invalid(np.array(scoresList)),axis=0)

#We scale the results to 0,1 range
scaled_avgs = MinMaxScaler().fit_transform(avgs.reshape(-1,1))

#Here is the AUC score from HiCS
print "HCiS AUC Score"
print metrics.roc_auc_score(pd.to_numeric(df['class'].values),scaled_avgs)

#We calculate AUC score from simple LOF
m=50
knndist, knnindices = knn(df.iloc[:,:-1],3)
reachdist, reachindices = reachDist(df.iloc[:,:-1],m,knndist)
irdMatrix = ird(m,reachdist)
lofScores = lof(irdMatrix,m,reachindices)
ss=MinMaxScaler().fit_transform(np.array(lofScores).reshape(-1,1))
print "LOF AUC Score"
print metrics.roc_auc_score(pd.to_numeric(df['class'].values),ss)

Please notice that optimal value of cut_off point needs to be determined through experimenting.

References & Further reading

6 responses

  1. Hi,
    thanks for the code, however I do not understand why you use the asymptotic p value. The paper says the supremum of the difference in cumulative marginal and conditional distribution. I am implementing in matlab so I use the test statistic k and not the p.

    Can you please explain why you use p and subtract it from 1??

    Thanks
    Arvind

    • Hi Arvind,

      sorry for delay… It has been a busy week. You are right… The paper recommends using probability for t-test and supremum for KS test. However, the general idea is finding the probability that the distributions are the same and the differences are because of randomness and one of the strengths of method proposed in paper is that you can use alternative methods for quality criterion. I think 1-p should be a good proxy to check if distributions are the same.

      Thanks,
      ~Shahram

Leave a Reply

Your email address will not be published. Required fields are marked *