对实际数据集进行异常值检测

该示例说明了对实际数据集的鲁棒协方差估计的需要。它对于异常值检测和更好地理解数据结构都很有用。

我们从波士顿住房数据集中选择了两组两个变量，以说明可以使用多个异常值检测工具进行何种分析。出于可视化的目的，我们正在使用二维示例，但是应该意识到事物在高维度上并不是那么微不足道，因为它将被指出。

在下面的两个例子中，主要结果是经验协方差估计作为非鲁棒性，受观察的异质结构的高度影响。尽管稳健的协方差估计能够关注数据分布的主要模式，但它坚持数据应该是高斯分布的假设，产生一些数据结构的偏差估计，但在某种程度上仍然准确。单类SVM不假设数据分布的任何参数形式，因此可以更好地建模数据的复杂形状。

第一个例子

第一个示例说明了当另一个群集存在时，协方差估计如何有助于集中于相关群集。在这里，许多观察被混淆为一，并打破了经验协方差估计。当然，一些筛选工具会指出存在两个簇（支持向量机，高斯混合模型，单变量异常值检测，......）。但如果它是一个高维度的例子，那么这些都不能轻易应用。

第二个例子

第二个例子表明协方差的最小协方差行列式稳健估计量的能力集中在数据分布的主要模式上：虽然由于香蕉形分布而难以估计协方差，但是位置似乎很好地估计。无论如何，我们可以摆脱一些偏僻的观察。 One-Class SVM能够捕获真实的数据结构，但难以调整其内核带宽参数，以便在数据散布矩阵的形状和过度拟合数据的风险之间获得良好的折衷。

../../_images/sphx_glr_plot_outlier_detection_housing_001.png

../../_images/sphx_glr_plot_outlier_detection_housing_002.png

print(__doc__)

# Author: Virgile Fritsch <[email protected]>
# License: BSD 3 clause

import numpy as np
from sklearn.covariance import EllipticEnvelope
from sklearn.svm import OneClassSVM
import matplotlib.pyplot as plt
import matplotlib.font_manager
from sklearn.datasets import load_boston

# Get data
X1 = load_boston()['data'][:, [8, 10]]  # two clusters
X2 = load_boston()['data'][:, [5, 12]]  # "banana"-shaped

# Define "classifiers" to be used
classifiers = {
    "Empirical Covariance": EllipticEnvelope(support_fraction=1.,
                                             contamination=0.261),
    "Robust Covariance (Minimum Covariance Determinant)":
    EllipticEnvelope(contamination=0.261),
    "OCSVM": OneClassSVM(nu=0.261, gamma=0.05)}
colors = ['m', 'g', 'b']
legend1 = {}
legend2 = {}

# Learn a frontier for outlier detection with several classifiers
xx1, yy1 = np.meshgrid(np.linspace(-8, 28, 500), np.linspace(3, 40, 500))
xx2, yy2 = np.meshgrid(np.linspace(3, 10, 500), np.linspace(-5, 45, 500))
for i, (clf_name, clf) in enumerate(classifiers.items()):
    plt.figure(1)
    clf.fit(X1)
    Z1 = clf.decision_function(np.c_[xx1.ravel(), yy1.ravel()])
    Z1 = Z1.reshape(xx1.shape)
    legend1[clf_name] = plt.contour(
        xx1, yy1, Z1, levels=[0], linewidths=2, colors=colors[i])
    plt.figure(2)
    clf.fit(X2)
    Z2 = clf.decision_function(np.c_[xx2.ravel(), yy2.ravel()])
    Z2 = Z2.reshape(xx2.shape)
    legend2[clf_name] = plt.contour(
        xx2, yy2, Z2, levels=[0], linewidths=2, colors=colors[i])

legend1_values_list = list(legend1.values())
legend1_keys_list = list(legend1.keys())

# Plot the results (= shape of the data points cloud)
plt.figure(1)  # two clusters
plt.title("Outlier detection on a real data set (boston housing)")
plt.scatter(X1[:, 0], X1[:, 1], color='black')
bbox_args = dict(boxstyle="round", fc="0.8")
arrow_args = dict(arrowstyle="->")
plt.annotate("several confounded points", xy=(24, 19),
             xycoords="data", textcoords="data",
             xytext=(13, 10), bbox=bbox_args, arrowprops=arrow_args)
plt.xlim((xx1.min(), xx1.max()))
plt.ylim((yy1.min(), yy1.max()))
plt.legend((legend1_values_list[0].collections[0],
            legend1_values_list[1].collections[0],
            legend1_values_list[2].collections[0]),
           (legend1_keys_list[0], legend1_keys_list[1], legend1_keys_list[2]),
           loc="upper center",
           prop=matplotlib.font_manager.FontProperties(size=12))
plt.ylabel("accessibility to radial highways")
plt.xlabel("pupil-teacher ratio by town")

legend2_values_list = list(legend2.values())
legend2_keys_list = list(legend2.keys())

plt.figure(2)  # "banana" shape
plt.title("Outlier detection on a real data set (boston housing)")
plt.scatter(X2[:, 0], X2[:, 1], color='black')
plt.xlim((xx2.min(), xx2.max()))
plt.ylim((yy2.min(), yy2.max()))
plt.legend((legend2_values_list[0].collections[0],
            legend2_values_list[1].collections[0],
            legend2_values_list[2].collections[0]),
           (legend2_keys_list[0], legend2_keys_list[1], legend2_keys_list[2]),
           loc="upper center",
           prop=matplotlib.font_manager.FontProperties(size=12))
plt.ylabel("% lower status of the population")
plt.xlabel("average number of rooms per dwelling")

plt.show()

Scikit learn Sample9—Outlier detection on a real data set

对实际数据集进行异常值检测

第一个例子

第二个例子

猜你喜欢