My first attempt with Local Outlier Factor(LOF): Identifying Density Based Local Outliers

One of the challenges in data analysis in general and predictive modeling in particular is dealing with outliers. There are many modeling techniques which are resistant to outliers or reduce the impact of them, but still detecting outliers and understanding them can lead to interesting findings. In previous post I used one of the linear methods to find outliers in Diamonds dataset. In this post I’ll try to implement the Local Outlier Factor(LOF) which has been presented in this paper. Notice that at this stage we are not comparing these different methods for Outlier detection. We are just exploring different approaches and trying to learn them through implementation.

In previous post we presented the framework that Charu Aggarwal in his book Outlier Analysis outlines for outlier detection methods:

  • Extreme Value Analysis: This is the most basic form of outlier detection and only good for 1-dimension data. In these types of analysis, it is assumed that values which are too large or too small are outliers. Z-test and Student’s t-test are examples of these statistical methods. These are good heuristics for initial analysis of data but they don’t have much value in multivariate settings. They can be used as final steps for interpreting outputs of other outlier detection methods.
  • Probabilistic and Statistical Models: These models assume specific distributions for data. Then using the expectation-maximization(EM) methods they estimate the parameters of the model. Finally, they calculate probability of membership of each data point to calculated distribution. The points with low probability of membership are marked as outliers.
  • Linear Models: These methods model the data into a lower dimensional sub-spaces with the use of linear correlations. Then the distance of each data point to plane that fits the sub-space is being calculated. This distance is used to find outliers. PCA(Principal Component Analysis) is an example of linear models for anomaly detection.
  • Proximity-based Models: The idea with these methods is to model outliers as points which are isolated from rest of observations. Cluster analysis, density based analysis and nearest neighborhood are main approaches of this kind.
  • Information Theoretic Models: The idea of these methods is the fact that outliers increase the minimum code length to describe a data set.
  • High-Dimensional Outlier Detection: Specifc methods to handle high dimensional sparse data

PCA method which we discussed in previous post is one of the Linear Model methods for anomaly detection. The LOF method which we implement in this post is one of the proximity based methods.

Proximity based methods can be classified in 3 categories: 1) Cluster based methods 2)Distance based methods 3) Density based methods

Cluster based methods classify data to different clusters and count points which are not members of any of known clusters as outliers. Distance based methods in the other hand are more granular and use the distance between individual points to find outliers.

Local Outlier Factor method discussed in this post is one of density based methods. Consider below figure:

local outlier factor

Distance based approaches will have problem finding an outlier like point O2. Because the points in cluster C1 are less dense compare to cluster C2. If we chose a large threshold to capture an outlier like O2, many of the points in C1 will be counted as outliers.

Cluster based approaches have similar problems. Because they only consider the distance between point and centroid of cluster to calculate outlier score. The density based approaches and specially LOF approach discussed here are sensitive to densities and those approaches are more appropriate for calculating local outliers.

Below are main steps for calculating outlier score using LOF:

  1. First we find the K-nearest neighbors of each point in dataset. Selecting the right K has been discussed in the paper
  2. We call the max distance to K-nearest points that we found in previous step K-distance. For example, for the first point if used K=3 and found the 3 nearest neighbors have distances of 1.2, 2.5 and 6.4 the k-distance for this point will be 6.4.
  3. Next, for certain number of points (MinPts) we calculate the reach-distance:

$latex reach-dist_{k}(p, o) = max \{ k-distance(o), d(p, o) \} $

4.  Then we calculate the local reachability density of each point using below formula:

$latex lrd_{MinPts}(p) = 1/(\frac{\sum_{o \in N_{MinPts}(p)}reach-dist_{MinPts}(p,o)}{N_{MinPts}(p)}) $

 5. Finally, we calculate LOF Scores using below formula:

$latex LOF_{MinPts}(p)=\frac{\sum_{o \in N_{MinPts}(p)}\frac{lrd_{MinPts}(p)}{lrd_{MinPts}(o)}}{N_{MinPts}(p)} $

In the following section we implement the above steps in Python. We will be using Scikit learn package to find nearest neighbors:
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import matplotlib.patches as mPatch
from matplotlib.legend_handler import HandlerLine2D

#knn function gets the dataset and calculates K-Nearest neighbors and distances
def knn(df,k):
    nbrs = NearestNeighbors(n_neighbors=3)
    nbrs.fit(df)
    distances, indices = nbrs.kneighbors(df)
    return distances, indices

#reachDist calculates the reach distance of each point to MinPts around it
def reachDist(df,MinPts,knnDist):
    nbrs = NearestNeighbors(n_neighbors=MinPts)
    nbrs.fit(df)
    distancesMinPts, indicesMinPts = nbrs.kneighbors(df)
    distancesMinPts[:,0] = np.amax(distancesMinPts,axis=1)
    distancesMinPts[:,1] = np.amax(distancesMinPts,axis=1)
    distancesMinPts[:,2] = np.amax(distancesMinPts,axis=1)
    return distancesMinPts, indicesMinPts

#lrd calculates the Local Reachability Density
def lrd(MinPts,knnDistMinPts):
    return (MinPts/np.sum(knnDistMinPts,axis=1))

#Finally lof calculates lot outlier scores
def lof(Ird,MinPts,dsts):
    lof=[]
    for item in dsts:
       tempIrd = np.divide(Ird[item[1:]],Ird[item[0]])
       lof.append(tempIrd.sum()/MinPts)
    return lof

#We flag anything with outlier score greater than 1.2 as outlier#This is just for charting purposes
def returnFlag(x):
    if x['Score']>1.2:
       return 1
    else:
       return 0

#Read the file to data frame
data = pd.read_csv('codedDiamonds.csv')

#You can change below value for different MinPts
m=15

knndist, knnindices = knn(data,3)
reachdist, reachindices = reachDist(data,m,knndist)
irdMatrix = lrd(m,reachdist)
lofScores = lof(irdMatrix,m,reachindices) 
scores= pd.DataFrame(lofScores,columns=['Score'])
mergedData=pd.merge(data,scores,left_index=True,right_index=True)
mergedData['flag'] = mergedData.apply(returnFlag,axis=1)
Outliers = mergedData[(mergedData['flag']==1)]
Normals = mergedData[(mergedData['flag']==0)]

#Below section creates the charts

line1, = plt.plot([1], marker='o', label='Regular',linestyle='None',color='blue')
line2, = plt.plot([1], marker='*', label='Outlier',linestyle='None',color='red')

fig=plt.figure(dpi=80, facecolor='w', edgecolor='k')
fig.legend((line2,line1),('Outliers','Regular'),loc=1,numpoints=1,ncol=2)

#First we draw Carat vs Table vs Price#We show outliers with *

ax1 = plt.subplot2grid((1,1), (0,0),projection='3d')
ax1.scatter(Outliers['carat'],Outliers['table'],Outliers['price']/1000,c='r',marker='*')
ax1.scatter(Normals['carat'],Normals['table'],Normals['price']/1000,c='b',marker='o')
ax1.set_xlabel('Carat')
ax1.set_ylabel('Table')
ax1.set_zlabel('Price(K)')
ax1.set_title('Outliers Vs. Rest\nCarat, Table, Price View')

plt.tight_layout()
plt.show()

#Next we draw X vs Y vs Z#We show outliers with *
fig=plt.figure(dpi=80, facecolor='w', edgecolor='k')
fig.legend((line2,line1),('Outliers','Regular'),loc=1,numpoints=1,ncol=2)


ax2 = plt.subplot2grid((1,1), (0,0),projection='3d')
ax2.scatter(Outliers['x'],Outliers['y'],Outliers['z'],c='r',marker='*')
ax2.scatter(Normals['x'],Normals['y'],Normals['z'],c='b',marker='o')
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_zlabel('z')
ax2.set_title('Outliers Vs. Rest\nX,Y, Z View')

plt.tight_layout()
plt.show()

#Finally we draw the histogram of scores

fig=plt.figure(dpi=80, facecolor='w', edgecolor='k')


ax3 = plt.subplot2grid((1,1), (0,0))
ax3.hist(mergedData['Score'],bins=100,facecolor='cornflowerblue')
ax3.set_xlabel('LOF Score')
ax3.set_ylabel('Frequency')
ax3.set_title('Outlier Scores')



plt.tight_layout()
plt.show()

Python Script Output:

LOF Scores: Carat, Table, Price View

LOF Scores: Carat, Table, Price View

LOF Score: X,Y,Z View

LOF Score: X,Y,Z View

LOF Outlier Scores Histogram

LOF Outlier Scores Histogram

Normal observations have LOF outlier scores close to one. Table below shows couple of observations with highest outlier scores:

There is a detailed section in the paper that discusses selection of MinPts and interpretation of outlier scores. We will not be covering them in this post. You can refer to original paper for the details.

References:

3 responses

Leave a Reply

Your email address will not be published. Required fields are marked *