异常值检测主要是为了发现数据集中的一些"与众不同"的数据值，所谓“与众不同”的数据值是指这些数据与大多数数据存在较大的差异我们称之为“异常值”，并且在现实中这些“异常值”并没有被打上标签，因此我们必须通过某种算法来自动识别出这些异常值。对于异常值我们有如下的定义:

异常值所占整体数据的比例较少,产生异常值的概率非常低。
异常值本身的特征与其他正常值有明显的差异。

数据

在本篇博客中我们的数据来自于国外的某个超市的销售数据，你可以在这里下载

%matplotlib inline
import pandas as pd
import numpy as np
from numpy import percentile
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib

from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from pyod.models.abod import ABOD
from pyod.models.cblof import CBLOF
from pyod.models.feature_bagging import FeatureBagging
from pyod.models.hbos import HBOS
from pyod.models.iforest import IForest
from pyod.models.knn import KNN
from pyod.models.lof import LOF
from scipy import stats
from sklearn.preprocessing import MinMaxScaler

df = pd.read_excel("./data/Superstore.xls")
df.head(3)

查看Sales变量的分布

print(df.Sales.describe())
df['Sales'].hist()

从分布上看,销售价格明显严重右偏,右侧有长尾,并且我们看到销售价格(sales)的异常区域大致应该分布在上图的红圈范围内。下面我们查看sales的偏度(Skewness)与峰度(Kurtosis)，偏度反应的是分布的偏斜程度,可能是左偏，右偏，长尾等形态，峰度反应分布的形状的胖瘦(宽窄),具体解释请参考这篇博客。

print("Skewness: %f" % df['Sales'].skew())
print("Kurtosis: %f" % df['Sales'].kurt())

查看利润(Profit)分布

print(df.Profit.describe())
sns.distplot(df['Profit'])
plt.title("Distribution of Profit")
sns.despine()

从上图中我们可以发现 Profit分布似乎呈左右对称并且是单峰的形状,均值的左右两侧都有长尾且右侧长尾比左侧长尾更长(右侧最大值是800，左侧最小值是-6000)，因此总体上Profit分布呈现轻微右偏，数据出现在两侧长尾范围内的概率最低,因此异常区域应该处于左右两侧的红圈内。下面我们查看profit的偏度与峰度。

print("Skewness: %f" % df['Profit'].skew())
print("Kurtosis: %f" % df['Profit'].kurt())

IsolationForest(隔离森林)

IsolationForest是一种简单有效的检测异常值的算法，它可以在数据的分布区域中找出异常值所在的区域，并对所有数据进行评分，那些落在异常区域的数据值会获得较低的分数，而那些不在异常区域中的数据将会获得较高的分数,大家可以参考这篇文章，在这篇文章中作者随机生成了两个正太分布N(-2,5)和N(2,5),同时通过隔离森林算法找到到了这两个分布中的异常区域,并且生成一条评分曲线,落在异常区域内的数据将会得到低分，落在异常区域内以外的数据将会获得高分:

import numpy as np 
import matplotlib.pyplot as plt 
x = np.concatenate((np.random.normal(loc=-2, scale=.5,size=500), np.random.normal(loc=2, scale=.5, size=500)))

isolation_forest = IsolationForest(n_estimators=100) 
isolation_forest.fit(x.reshape(-1, 1)) 
xx = np.linspace(-6, 6, 100).reshape(-1,1) 
anomaly_score = isolation_forest.decision_function(xx) 
outlier = isolation_forest.predict(xx)

plt.figure(figsize=(10,8))
plt.subplot(2,1,1)
plt.hist(x, normed=True) 
plt.xlim([-5, 5]) 

plt.subplot(2,1,2)
plt.plot(xx, anomaly_score, label='异常值分数') 
plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score), where=outlier==-1, color='r', alpha=.4, label='异常值区域') 
plt.legend() 
plt.ylabel('异常值分数') 
plt.xlabel('x') 
plt.xlim([-5, 5]) 
plt.show()

使用隔离森林算法来探测Sales的异常值区域

隔离森林是一种检测异常值的算法，使用IsolationForest算法返回每个样本的异常分数，该算法基于这样的思想:异常是少数和在特征上不同的数据点。隔离森林是一种基于树的模型。在这些树中，通过首先随机选择特征然后在所选特征的最小值和最大值之间选择随机分割值来创建分区。下面我们使用隔离森林算法来探测sales的异常区域,并生成评分曲线:

#定义隔离森林
isolation_forest = IsolationForest(n_estimators=100)
#训练销售价格数据
isolation_forest.fit(df['Sales'].values.reshape(-1, 1))
#在销售价格的最小值和最大值之间分割数据
xx = np.linspace(df['Sales'].min(), df['Sales'].max(), len(df)).reshape(-1,1)
#生成所有数据的异常值分数
anomaly_score = isolation_forest.decision_function(xx)
#预测异常值
outlier = isolation_forest.predict(xx)

plt.figure(figsize=(10,8))
plt.subplot(2,1,1)
sns.distplot(df['Sales'])
sns.despine()

plt.subplot(2,1,2)
plt.plot(xx, anomaly_score, label='异常值分数')
plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score), 
                 where=outlier==-1, color='r', 
                 alpha=.4, label='异常值区域')
plt.legend()
plt.ylabel('异常值分数')
plt.xlabel('销售价格(Sales)')
plt.show();

从上图中我们发现隔离森林算法很轻松的就识别出了Sales分布的异常区域(粉色矩形),并且生成了评分曲线,当数据落在粉色矩形的区域中时将会得到较低的评分,当数据落在粉色矩形区域以外时将会得到高分。下面是找出所有销售价格为异常值的销售记录。

print('销售价格最小异常值:',df[df.Sales>=xx[outlier==-1].min()].Sales.min())
df[df.Sales>=xx[outlier==-1].min()]

使用隔离森林算法来探测Profit的异常值区域

下面我们使用隔离森林算法来探测Profit的异常区域,并生成评分曲线:

#定义隔离森林
isolation_forest = IsolationForest(n_estimators=100)
#训练销利润数据
isolation_forest.fit(df['Profit'].values.reshape(-1, 1))
xx = np.linspace(df['Profit'].min(), df['Profit'].max(), len(df)).reshape(-1,1)
#生成所有数据的异常值分数
anomaly_score = isolation_forest.decision_function(xx)
#预测异常值
outlier = isolation_forest.predict(xx)

plt.figure(figsize=(10,8))
plt.subplot(2,1,1)
sns.distplot(df['Profit'])
sns.despine()

plt.subplot(2,1,2)
plt.plot(xx, anomaly_score, label='异常值分数')
plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score), 
                 where=outlier==-1, color='r', 
                 alpha=.4, label='异常值区域')
plt.legend()
plt.ylabel('异常值分数')
plt.xlabel('利润(Profit)')
plt.show();

在上图中隔离森林算法轻松的探测出来了Profit分布左右两侧的异常值区域，并生成了评分曲线。当数据落在左右两侧的分数矩形范围内时将会得到较低的分数,而当数据落在粉色矩形框的范围之外将会得到较高的分数。下面我们查看右两侧异常值的最小值,和左侧异常值的最大值。

x1=xx[outlier==-1]
right_min=x1[x1>0].min()
left_max = x1[x1<0].max()
print('右侧最小异常值为：',df[df.Profit>=right_min].Profit.min())
df[df.Profit>right_min].head(10)

print('左侧侧最大异常值为：',df[df.Profit<=left_max].Profit.max())
df[df.Profit<=left_max].head(10)

上述两个可视化结果显示了异常值分数并突出显示异常值所在的区域。从图中可见异常分数反映了基础分布的形状，异常区域对应于低概率区域。然而，到目前为止我们只是对Sales和Profit这两个单一变量分别进行了分析。如果我们仔细研究,可能会发现，由我们的模型确定的一些异常值只不过是数学统计上的异常,它可能并非是我们业务场景中的异常值,比如某些利润很高的订单可能是由于商品本身的利润高所导致,它可能在统计分布上是异常值,但它在实际的业务场景中并不应该是异常值。下面我们同时观察Sales和Profit这两个变量的散点分布,并对Sales和Profit进行线性拟合,那些严重偏离拟合曲线的点,可以被认为是异常值,以这样的方式来判定异常值更符合实际的业务场景。

sns.regplot(x="Sales", y="Profit", data=df)
sns.despine();

CBLOF(Cluster-based Local Outlier Factor)

CBLOF算法时基于聚类组的本地异常因子计算异常值分数。
CBLOF将数据集和由聚类算法生成的聚类模型作为输入。它使用参数alpha和beta将群集分为小群集和大群集。然后基于该点所属的聚类的大小以及到最近的大聚类的距离来计算异常分数。我们使用PyOD库来实现CBLOF算法.

将销售额和利润标准化处理将其缩放到0到1之间。
根据经验将设置异常值比例设置为1％。
使用CBLOF模型拟合数据并预测结果。
使用阈值来考虑数据点是正常值还是异常值。
使用决策函数计算每个点的异常值分数。

cols = ['Sales', 'Profit']
minmax = MinMaxScaler(feature_range=(0, 1))
print(df[cols].head())
print('---------------------')
df[['Sales','Profit']] = minmax.fit_transform(df[['Sales','Profit']])
print(df[['Sales','Profit']].head())

下面的代码参考了"比较所有已实现的离群值检测模型的例子"和"使用PyOD库在Python中学习异常检测的一个很棒的教程"这两篇文章。

#将Sales和Profit合并成一个两列的numpy数组
X1 = df['Sales'].values.reshape(-1,1)
X2 = df['Profit'].values.reshape(-1,1)
X = np.concatenate((X1,X2),axis=1)

#设定异常值比例
outliers_fraction = 0.01
xx , yy = np.meshgrid(np.linspace(0, 1, 100), np.linspace(0, 1, 100))

#定义CBLOF模型
clf = CBLOF(contamination=outliers_fraction,check_estimator=False, random_state=0)

#训练数据
clf.fit(X)
# 预测异常值分数
scores_pred = clf.decision_function(X) * -1
        
# 预测异常值和正常值的数据
y_pred = clf.predict(X)
n_inliers = len(y_pred) - np.count_nonzero(y_pred)
n_outliers = np.count_nonzero(y_pred == 1)

plt.figure(figsize=(8, 8))

df1 = df
df1['outlier'] = y_pred.tolist()
    
#过滤出Sales和Profit的正常值
inliers_sales = np.array(df1['Sales'][df1['outlier'] == 0]).reshape(-1,1)
inliers_profit = np.array(df1['Profit'][df1['outlier'] == 0]).reshape(-1,1)
    
#过滤出Sales和Profit的异常值
outliers_sales = df1['Sales'][df1['outlier'] == 1].values.reshape(-1,1)
outliers_profit = df1['Profit'][df1['outlier'] == 1].values.reshape(-1,1)
         
print('异常值数量:',n_outliers,'正常值数量:',n_inliers)
        
# 设定一个阈值用以识别正常值和异常值的标准
threshold = np.percentile(scores_pred, 100 * outliers_fraction)
        
#决策函数为每一个数据点计算异常值分数
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
Z = Z.reshape(xx.shape)
#在图上对从最小的异常值分数到阈值的范围内进行分层着色
plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),cmap=plt.cm.Blues_r)
        
#在异常值分数等于阈值处画红色线条
a = plt.contour(xx, yy, Z, levels=[threshold],linewidths=2, colors='red')
        
#填充橙色轮廓线，其中异常分数的范围是从阈值到最大异常分数
plt.contourf(xx, yy, Z, levels=[threshold, Z.max()],colors='orange')
b = plt.scatter(inliers_sales, inliers_profit, c='white',s=20, edgecolor='k')
    
c = plt.scatter(outliers_sales, outliers_profit, c='black',s=20, edgecolor='k')
       
plt.axis('tight')   
plt.legend([a.collections[0], b,c], ['决策函数', '正常值','异常值'],
           prop=matplotlib.font_manager.FontProperties(size=20),loc='lower right')
      
plt.xlim((0, 1))
plt.ylim((0, 1))
plt.title('CBLOF(Cluster-based Local Outlier Factor)')
plt.show();

基于直方图的离群值检测（HBOS）

基于直方图的离群值检测（HBOS）是一种有效的无监督方法。它假设特征独立并通过构建直方图来计算异常程度，在多变量异常检测中，可以计算每个单个特征的直方图，单独评分并在最后组合。使用PyOD库时，其代码与CBLOF类似。

outliers_fraction = 0.01
xx , yy = np.meshgrid(np.linspace(0, 1, 100), np.linspace(0, 1, 100))
clf = HBOS(contamination=outliers_fraction)
clf.fit(X)

scores_pred = clf.decision_function(X) * -1
        

y_pred = clf.predict(X)
n_inliers = len(y_pred) - np.count_nonzero(y_pred)
n_outliers = np.count_nonzero(y_pred == 1)
plt.figure(figsize=(8, 8))

df1 = df
df1['outlier'] = y_pred.tolist()
    
inliers_sales = np.array(df1['Sales'][df1['outlier'] == 0]).reshape(-1,1)
inliers_profit = np.array(df1['Profit'][df1['outlier'] == 0]).reshape(-1,1)
    
outliers_sales = df1['Sales'][df1['outlier'] == 1].values.reshape(-1,1)
outliers_profit = df1['Profit'][df1['outlier'] == 1].values.reshape(-1,1)
         
print('异常值数量:',n_outliers,'正常值:',n_inliers)

threshold = np.percentile(scores_pred, 100 * outliers_fraction)
        
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),cmap=plt.cm.Blues_r)

a = plt.contour(xx, yy, Z, levels=[threshold],linewidths=2, colors='red')

plt.contourf(xx, yy, Z, levels=[threshold, Z.max()],colors='orange')
b = plt.scatter(inliers_sales, inliers_profit, c='white',s=20, edgecolor='k')
    
c = plt.scatter(outliers_sales, outliers_profit, c='black',s=20, edgecolor='k')
       
plt.axis('tight')  
     
plt.legend([a.collections[0], b,c], ['决策函数', '正常值','异常值'],
           prop=matplotlib.font_manager.FontProperties(size=20),loc='lower right')
      
plt.xlim((0, 1))
plt.ylim((0, 1))
plt.title('基于直方图的离群值检测(HBOS)')
plt.show();

隔离森林

隔离森林其原理与随机森林类似，建立在决策树的基础上。隔离林通过随机选择特征然后根据特征的最大值和最小值之间的分割值来隔离观察。 PyOD Isolation Forest模块是Scikit-learn Isolation Forest的wrapper，它具有更多功能。其代码与之前的CBLOF非常相似。

outliers_fraction = 0.01
xx , yy = np.meshgrid(np.linspace(0, 1, 100), np.linspace(0, 1, 100))
clf = IForest(contamination=outliers_fraction,random_state=0)
clf.fit(X)

scores_pred = clf.decision_function(X) * -1
        
y_pred = clf.predict(X)
n_inliers = len(y_pred) - np.count_nonzero(y_pred)
n_outliers = np.count_nonzero(y_pred == 1)
plt.figure(figsize=(8, 8))

df1 = df
df1['outlier'] = y_pred.tolist()

inliers_sales = np.array(df1['Sales'][df1['outlier'] == 0]).reshape(-1,1)
inliers_profit = np.array(df1['Profit'][df1['outlier'] == 0]).reshape(-1,1)

outliers_sales = df1['Sales'][df1['outlier'] == 1].values.reshape(-1,1)
outliers_profit = df1['Profit'][df1['outlier'] == 1].values.reshape(-1,1)
         
print('异常值数量: ',n_outliers,'正常值数量: ',n_inliers)
        
threshold = np.percentile(scores_pred, 100 * outliers_fraction)

Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),cmap=plt.cm.Blues_r)
        
a = plt.contour(xx, yy, Z, levels=[threshold],linewidths=2, colors='red')
        
plt.contourf(xx, yy, Z, levels=[threshold, Z.max()],colors='orange')
b = plt.scatter(inliers_sales, inliers_profit, c='white',s=20, edgecolor='k')
    
c = plt.scatter(outliers_sales, outliers_profit, c='black',s=20, edgecolor='k')
       
plt.axis('tight')
plt.legend([a.collections[0], b,c], ['决策函数', '正常值','异常值'],
           prop=matplotlib.font_manager.FontProperties(size=20),loc='lower right')
      
plt.xlim((0, 1))
plt.ylim((0, 1))
plt.title('隔离森林')
plt.show();

KNN(K - Nearest Neighbors)

用于离群检测的pyod.models.knn.KNN, 对于数据，它与第k个最近邻居的距离可以被视为异常值。它可以被视为衡量密度的一种方法。其代码与之前的CBLOF非常相似。

outliers_fraction = 0.01
xx , yy = np.meshgrid(np.linspace(0, 1, 100), np.linspace(0, 1, 100))
clf = KNN(contamination=outliers_fraction)
clf.fit(X)

scores_pred = clf.decision_function(X) * -1
        
y_pred = clf.predict(X)
n_inliers = len(y_pred) - np.count_nonzero(y_pred)
n_outliers = np.count_nonzero(y_pred == 1)
plt.figure(figsize=(8, 8))

df1 = df
df1['outlier'] = y_pred.tolist()
    
inliers_sales = np.array(df1['Sales'][df1['outlier'] == 0]).reshape(-1,1)
inliers_profit = np.array(df1['Profit'][df1['outlier'] == 0]).reshape(-1,1)
    
outliers_sales = df1['Sales'][df1['outlier'] == 1].values.reshape(-1,1)
outliers_profit = df1['Profit'][df1['outlier'] == 1].values.reshape(-1,1)
         
print('异常值数量: ',n_outliers,'正常值数量: ',n_inliers)
        
threshold = np.percentile(scores_pred, 100 * outliers_fraction)
        
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),cmap=plt.cm.Blues_r)
        
a = plt.contour(xx, yy, Z, levels=[threshold],linewidths=2, colors='red')
        
plt.contourf(xx, yy, Z, levels=[threshold, Z.max()],colors='orange')
b = plt.scatter(inliers_sales, inliers_profit, c='white',s=20, edgecolor='k')
    
c = plt.scatter(outliers_sales, outliers_profit, c='black',s=20, edgecolor='k')
       
plt.axis('tight')  
   
plt.legend([a.collections[0], b,c], ['决策函数', '正常值','异常值'],
           prop=matplotlib.font_manager.FontProperties(size=20),loc='lower right')
      
plt.xlim((0, 1))
plt.ylim((0, 1))
plt.title('K Nearest Neighbors (KNN)')
plt.show();