二十二、无监督学习：异常检测

优质

小牛编辑

129浏览

2023-12-01

常检测是一种机器学习任务，包括发现所谓的异常值。

“异常值是一种数据集中的观测值，似乎与该组数据的其余部分不一致。”-- Johnson 1992

“异常值是一种观测值，与其他观测值有很大差异，引起人们怀疑它是由不同的机制产生的。”-- Outlier/Anomaly Hawkins 1980

异常检测设定的类型

监督 AD
- 标签可用于正常和异常数据
- 类似于稀有类挖掘/不平衡分类
半监督 AD（新奇检测）
- 只有正常的数据可供训练
- 该算法仅学习正常数据
无监督 AD（异常值检测）
- 没有标签，训练集 = 正常 + 异常数据
- 假设：异常非常罕见

%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import matplotlib
import matplotlib.pyplot as plt

让我们首先熟悉不同的无监督异常检测方法和算法。为了可视化不同算法的输出，我们考虑包含二维高斯混合的玩具数据集。

生成数据集

from sklearn.datasets import make_blobs

X, y = make_blobs(n_features=2, centers=3, n_samples=500,
                  random_state=42)

X.shape

plt.figure()
plt.scatter(X[:, 0], X[:, 1])
plt.show()

使用密度估计的异常检测

from sklearn.neighbors.kde import KernelDensity

# 用高斯核密度估计器估算密度
kde = KernelDensity(kernel='gaussian')
kde = kde.fit(X)
kde

kde_X = kde.score_samples(X)
print(kde_X.shape)  # 包含数据的对数似然。 越小样本越罕见

from scipy.stats.mstats import mquantiles
alpha_set = 0.95
tau_kde = mquantiles(kde_X, 1. - alpha_set)

n_samples, n_features = X.shape
X_range = np.zeros((n_features, 2))
X_range[:, 0] = np.min(X, axis=0) - 1.
X_range[:, 1] = np.max(X, axis=0) + 1.

h = 0.1  # step size of the mesh
x_min, x_max = X_range[0]
y_min, y_max = X_range[1]
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

grid = np.c_[xx.ravel(), yy.ravel()]

Z_kde = kde.score_samples(grid)
Z_kde = Z_kde.reshape(xx.shape)

plt.figure()
c_0 = plt.contour(xx, yy, Z_kde, levels=tau_kde, colors='red', linewidths=3)
plt.clabel(c_0, inline=1, fontsize=15, fmt={tau_kde[0]: str(alpha_set)})
plt.scatter(X[:, 0], X[:, 1])
plt.show()

单类 SVM

基于密度的估计的问题在于，当数据的维数增加时，它们往往变得低效。这就是所谓的维度灾难，尤其会影响密度估算算法。在这种情况下可以使用单类 SVM 算法。

from sklearn.svm import OneClassSVM

nu = 0.05  # theory says it should be an upper bound of the fraction of outliers
ocsvm = OneClassSVM(kernel='rbf', gamma=0.05, nu=nu)
ocsvm.fit(X)

X_outliers = X[ocsvm.predict(X) == -1]

Z_ocsvm = ocsvm.decision_function(grid)
Z_ocsvm = Z_ocsvm.reshape(xx.shape)

plt.figure()
c_0 = plt.contour(xx, yy, Z_ocsvm, levels=[0], colors='red', linewidths=3)
plt.clabel(c_0, inline=1, fontsize=15, fmt={0: str(alpha_set)})
plt.scatter(X[:, 0], X[:, 1])
plt.scatter(X_outliers[:, 0], X_outliers[:, 1], color='red')
plt.show()

支持向量 - 离群点

所谓的单类 SVM 的支持向量形成离群点。

X_SV = X[ocsvm.support_]
n_SV = len(X_SV)
n_outliers = len(X_outliers)

print('{0:.2f} <= {1:.2f} <= {2:.2f}?'.format(1./n_samples*n_outliers, nu, 1./n_samples*n_SV))

只有支持向量涉及单类 SVM 的决策函数。

绘制单类 SVM 决策函数的级别集，就像我们对真实密度所做的那样。
突出支持向量。

plt.figure()
plt.contourf(xx, yy, Z_ocsvm, 10, cmap=plt.cm.Blues_r)
plt.scatter(X[:, 0], X[:, 1], s=1.)
plt.scatter(X_SV[:, 0], X_SV[:, 1], color='orange')
plt.show()

练习
更改`gamma``参数并查看它对决策函数平滑度的影响。

# %load solutions/22_A-anomaly_ocsvm_gamma.py

隔离森林

隔离森林是一种基于树的异常检测算法。该算法构建了许多随机树，其基本原理是，如果样本被隔离，在非常少量的随机分割之后，它应该单独存在于叶子中。隔离森林根据样本最终所在的树的深度建立异常得分。

from sklearn.ensemble import IsolationForest

iforest = IsolationForest(n_estimators=300, contamination=0.10)
iforest = iforest.fit(X)

Z_iforest = iforest.decision_function(grid)
Z_iforest = Z_iforest.reshape(xx.shape)

plt.figure()
c_0 = plt.contour(xx, yy, Z_iforest,
                  levels=[iforest.threshold_],
                  colors='red', linewidths=3)
plt.clabel(c_0, inline=1, fontsize=15,
           fmt={iforest.threshold_: str(alpha_set)})
plt.scatter(X[:, 0], X[:, 1], s=1.)
plt.show()

练习
以图形方式说明树的数量对决策函数平滑度的影响。

# %load solutions/22_B-anomaly_iforest_n_trees.py

数字数据集上的图解

我们现在将应用IsolationForest算法来查找以非常规方式编写的数字。

from sklearn.datasets import load_digits
digits = load_digits()

数字数据集包括8×8的数字图像。

images = digits.images
labels = digits.target
images.shape

i = 102

plt.figure(figsize=(2, 2))
plt.title('{0}'.format(labels[i]))
plt.axis('off')
plt.imshow(images[i], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

要将图像用作训练集，我们需要将图像展开。

n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

data.shape

X = data
y = digits.target

X.shape

让我们关注数字 5。

X_5 = X[y == 5]

X_5.shape

fig, axes = plt.subplots(1, 5, figsize=(10, 4))
for ax, x in zip(axes, X_5[:5]):
    img = x.reshape(8, 8)
    ax.imshow(img, cmap=plt.cm.gray_r, interpolation='nearest')
    ax.axis('off')

让我们使用IsolationForest来查找前 5% 最异常的图像。
让我们绘制他们吧！

from sklearn.ensemble import IsolationForest
iforest = IsolationForest(contamination=0.05)
iforest = iforest.fit(X_5)

使用iforest.decision_function计算“异常”的级别。越低就越异常。

iforest_X = iforest.decision_function(X_5)
plt.hist(iforest_X);

让我们绘制最强的正常值。

X_strong_inliers = X_5[np.argsort(iforest_X)[-10:]]

fig, axes = plt.subplots(2, 5, figsize=(10, 5))

for i, ax in zip(range(len(X_strong_inliers)), axes.ravel()):
    ax.imshow(X_strong_inliers[i].reshape((8, 8)),
               cmap=plt.cm.gray_r, interpolation='nearest')
    ax.axis('off')

让我们绘制最强的异常值。

fig, axes = plt.subplots(2, 5, figsize=(10, 5))

X_outliers = X_5[iforest.predict(X_5) == -1]

for i, ax in zip(range(len(X_outliers)), axes.ravel()):
    ax.imshow(X_outliers[i].reshape((8, 8)),
               cmap=plt.cm.gray_r, interpolation='nearest')
    ax.axis('off')

练习
用所有其他数字重新运行相同的分析。

# %load solutions/22_C-anomaly_digits.py