In previous post we presented the framework that Charu Aggarwal in his book Outlier Analysis outlines for outlier detection methods:

**Extreme Value Analysis:**This is the most basic form of outlier detection and only good for 1-dimension data. In these types of analysis, it is assumed that values which are too large or too small are outliers. Z-test and Student’s t-test are examples of these statistical methods. These are good heuristics for initial analysis of data but they don’t have much value in multivariate settings. They can be used as final steps for interpreting outputs of other outlier detection methods.**Probabilistic and Statistical Models**: These models assume specific distributions for data. Then using the expectation-maximization(EM) methods they estimate the parameters of the model. Finally, they calculate probability of membership of each data point to calculated distribution. The points with low probability of membership are marked as outliers.**Linear Models:**These methods model the data into a lower dimensional sub-spaces with the use of linear correlations. Then the distance of each data point to plane that fits the sub-space is being calculated. This distance is used to find outliers. PCA(Principal Component Analysis) is an example of linear models for anomaly detection.**Proximity-based Models:**The idea with these methods is to model outliers as points which are isolated from rest of observations. Cluster analysis, density based analysis and nearest neighborhood are main approaches of this kind.**Information Theoretic Models**: The idea of these methods is the fact that outliers increase the minimum code length to describe a data set.**High-Dimensional Outlier Detection:**Specifc methods to handle high dimensional sparse data

PCA method which we discussed in previous post is one of the Linear Model methods for anomaly detection. The LOF method which we implement in this post is one of the proximity based methods.

Proximity based methods can be classified in 3 categories: **1) Cluster based methods 2)Distance based methods 3) Density based methods**

Cluster based methods classify data to different clusters and count points which are not members of any of known clusters as outliers. Distance based methods in the other hand are more granular and use the distance between individual points to find outliers.

Local Outlier Factor method discussed in this post is one of density based methods. Consider below figure:

Distance based approaches will have problem finding an outlier like point O2. Because the points in cluster C1 are less dense compare to cluster C2. If we chose a large threshold to capture an outlier like O2, many of the points in C1 will be counted as outliers.

Cluster based approaches have similar problems. Because they only consider the distance between point and centroid of cluster to calculate outlier score. The density based approaches and specially LOF approach discussed here are sensitive to densities and those approaches are more appropriate for calculating local outliers.

Below are main steps for calculating outlier score using LOF:

- First we find the K-nearest neighbors of each point in dataset. Selecting the right K has been discussed in the paper
- We call the max distance to K-nearest points that we found in previous step K-distance. For example, for the first point if used K=3 and found the 3 nearest neighbors have distances of 1.2, 2.5 and 6.4 the k-distance for this point will be 6.4.
- Next, for certain number of points (MinPts) we calculate the reach-distance:

$latex reach-dist_{k}(p, o) = max \{ k-distance(o), d(p, o) \} $

4. Then we calculate the local reachability density of each point using below formula:

$latex lrd_{MinPts}(p) = 1/(\frac{\sum_{o \in N_{MinPts}(p)}reach-dist_{MinPts}(p,o)}{N_{MinPts}(p)}) $

$latex LOF_{MinPts}(p)=\frac{\sum_{o \in N_{MinPts}(p)}\frac{lrd_{MinPts}(p)}{lrd_{MinPts}(o)}}{N_{MinPts}(p)} $

import numpy as np import pandas as pd from sklearn.neighbors import NearestNeighbors from mpl_toolkits.mplot3d import Axes3D import matplotlib.pyplot as plt import matplotlib.patches as mPatch from matplotlib.legend_handler import HandlerLine2D#knn function gets the dataset and calculates K-Nearest neighbors and distancesdef knn(df,k): nbrs = NearestNeighbors(n_neighbors=3) nbrs.fit(df) distances, indices = nbrs.kneighbors(df) return distances, indices#reachDist calculates the reach distance of each point to MinPts around itdef reachDist(df,MinPts,knnDist): nbrs = NearestNeighbors(n_neighbors=MinPts) nbrs.fit(df) distancesMinPts, indicesMinPts = nbrs.kneighbors(df) distancesMinPts[:,0] = np.amax(distancesMinPts,axis=1) distancesMinPts[:,1] = np.amax(distancesMinPts,axis=1) distancesMinPts[:,2] = np.amax(distancesMinPts,axis=1) return distancesMinPts, indicesMinPts#lrd calculates the Local Reachability Densitydef lrd(MinPts,knnDistMinPts): return (MinPts/np.sum(knnDistMinPts,axis=1))#Finally lof calculates lot outlier scoresdef lof(Ird,MinPts,dsts): lof=[] for item in dsts: tempIrd = np.divide(Ird[item[1:]],Ird[item[0]]) lof.append(tempIrd.sum()/MinPts) return lof#We flag anything with outlier score greater than 1.2 as outlier#This is just for charting purposesdef returnFlag(x): if x['Score']>1.2: return 1 else: return 0#Read the file to data framedata = pd.read_csv('codedDiamonds.csv')#You can change below value for different MinPtsm=15 knndist, knnindices = knn(data,3) reachdist, reachindices = reachDist(data,m,knndist) irdMatrix = lrd(m,reachdist) lofScores = lof(irdMatrix,m,reachindices) scores= pd.DataFrame(lofScores,columns=['Score']) mergedData=pd.merge(data,scores,left_index=True,right_index=True) mergedData['flag'] = mergedData.apply(returnFlag,axis=1) Outliers = mergedData[(mergedData['flag']==1)] Normals = mergedData[(mergedData['flag']==0)]#Below section creates the chartsline1, = plt.plot([1], marker='o', label='Regular',linestyle='None',color='blue') line2, = plt.plot([1], marker='*', label='Outlier',linestyle='None',color='red') fig=plt.figure(dpi=80, facecolor='w', edgecolor='k') fig.legend((line2,line1),('Outliers','Regular'),loc=1,numpoints=1,ncol=2)#First we draw Carat vs Table vs Price#We show outliers with *ax1 = plt.subplot2grid((1,1), (0,0),projection='3d') ax1.scatter(Outliers['carat'],Outliers['table'],Outliers['price']/1000,c='r',marker='*') ax1.scatter(Normals['carat'],Normals['table'],Normals['price']/1000,c='b',marker='o') ax1.set_xlabel('Carat') ax1.set_ylabel('Table') ax1.set_zlabel('Price(K)') ax1.set_title('Outliers Vs. Rest\nCarat, Table, Price View') plt.tight_layout() plt.show()#Next we draw X vs Y vs Z#We show outliers with *fig=plt.figure(dpi=80, facecolor='w', edgecolor='k') fig.legend((line2,line1),('Outliers','Regular'),loc=1,numpoints=1,ncol=2) ax2 = plt.subplot2grid((1,1), (0,0),projection='3d') ax2.scatter(Outliers['x'],Outliers['y'],Outliers['z'],c='r',marker='*') ax2.scatter(Normals['x'],Normals['y'],Normals['z'],c='b',marker='o') ax2.set_xlabel('x') ax2.set_ylabel('y') ax2.set_zlabel('z') ax2.set_title('Outliers Vs. Rest\nX,Y, Z View') plt.tight_layout() plt.show()#Finally we draw the histogram of scoresfig=plt.figure(dpi=80, facecolor='w', edgecolor='k') ax3 = plt.subplot2grid((1,1), (0,0)) ax3.hist(mergedData['Score'],bins=100,facecolor='cornflowerblue') ax3.set_xlabel('LOF Score') ax3.set_ylabel('Frequency') ax3.set_title('Outlier Scores') plt.tight_layout() plt.show()

**Python Script Output:**

Normal observations have LOF outlier scores close to one. Table below shows couple of observations with highest outlier scores:

Observation | carat | depth | table | price | x | y | z | Type | Grade | Grade 2 | Score |

24068 | 2 | 58.9 | 57 | 12210 | 8.09 | 58.9 | 8.06 | Premium | H | SI2 | 4.748578106 |

48411 | 0.51 | 61.8 | 54.7 | 1970 | 5.12 | 5.15 | 31.8 | Very Good | E | VS1 | 4.022188399 |

There is a detailed section in the paper that discusses selection of MinPts and interpretation of outlier scores. We will not be covering them in this post. You can refer to original paper for the details.

References:

## 3 responses