scikit-learn之聚类算法之K-Means

K-means

算法步骤：
1、给定类别个数 k，在数据集 X 中选择 k 个点作为初始的图心；
重复进行2、3步骤直到更新前后的图心之间的距离小于设定的阈值；
2、将数据集 X 中的点分配给离它最近的图心；
3、根据属于每个图心的所有点，重新计算新的图心；
上面的算法涉及两个问题：
1、输入问题；将输入的文本、图像向量化，这属于特征选择问题；
2、初始图心选择问题：可以使用k-means++初始化方法，它使得初始化的图心彼此之间的距离尽可能的远；
3、距离：采用欧式距离；

Mini Batch K-Means

为了减少计算时间，有人提出了 Mini Batch K-Means 方法：
1、给定类别个数 k，在数据集 X 中选择 k 个点作为初始的图心；
重复进行 2-4 步骤直到更新前后的图心之间的距离小于设定的阈值；
2、在数据集 X 中随机选择 b 个点；
3、将上面的 b 个点分配给离它最近的图心；
4、根据属于每个图心的所有点（这 b 个点以及之前分配过的所有点），重新计算新的图心；

Mini Batch K-Means 比 K-Means 更高效，但是聚类效果不如 K-Means 好。

算法优缺点

优点：收敛速度快；
缺点：由于使用了图心的概念，K-Means 算法聚出来的类在空间的形状都是凸的，它对于形状是细长型或者不规则的图形表现不好；

sklearn中的参数

[class sklearn.cluster.KMeans]
n_clusters=8：聚类个数，也就是图心个数；
init=’k-means++’：{‘k-means++’, ‘random’ or an ndarray}
                            ‘k-means++’：使用 k-means++ 算法选择初始点，可以较快收敛；
                            ‘random’：随机选择初始点；
                             ndarray：形状必须是(n_clusters, n_features)，自己定义初始点；
n_init=10：以不同的初始图心聚类的轮数；
max_iter=300：每轮中的最大迭代次数；
tol=0.0001：容忍的最小误差，当误差小于tol就会退出迭代
precompute_distances=’auto’：{‘auto’, True, False}，这个参数会在空间和时间之间做权衡；‘auto’：当 n_samples * n_clusters > 12 million 时（双精度大概100M），不保存距离矩阵；True：保存距离矩阵；False：不保存距离矩阵；
verbose=0：int 类型，是否输出详细信息；
random_state=None：int，RandomState instance or None，随机生成器的种子，和初始化中心有关；
copy_x=True：在 scikit-learn 很多接口中都会有这个参数的，就是是否对输入数据进行 copy 操作，以便不修改用户的输入数据；
n_jobs=1：int，多线程；
                   -1：使用所有的cpu；
                    1：不使用多线程；
                  -2：如果 n_jobs<0，(n_cpus + 1 + n_jobs)个cpu被使用，所以 n_jobs=-2 时，所有的cpu中只有一块不被使用；
algorithm=’auto’：“auto”, “full” or “elkan”；
                              “full” ：EM风格算法；
                              “elkan”：使用了三角形公理的变体算法，更加高效，但是不支持稀疏数据；
                               “auto”：对于稠密数据使用 “elkan”，对于稀疏数据使用“full”；

[class sklearn.cluster.MiniBatchKMeans]
batch_size=100：mini batches 的大小；
compute_labels=True：对于这个参数不太理解；
max_no_improvement=10：类似于 early stopping 中的 patience，当连续 max_no_improvement 个 mini-batch 目标函数（数据集中所有点到各自图心的距离和）没有再变小，就停止更新；
init_size=None：对于这个参数不太理解；
reassignment_ratio=0.01： float，再次分配给某个图心的点的最大比例，这个比例越高，那些原本点数少的图心会更容易被分配新的点，这会使得算法不容易收敛，但会得到一个更好的聚类。

示例代码

Clustering text documents using K-means

# Author: Peter Prettenhofer <[email protected]>
#         Lars Buitinck
# License: BSD 3 clause

from __future__ import print_function

from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn import metrics

from sklearn.cluster import KMeans, MiniBatchKMeans

import logging
from optparse import OptionParser
import sys
from time import time

import numpy as np


# Display progress logs on stdout
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')

# parse commandline arguments
op = OptionParser()
op.add_option("--lsa",
              dest="n_components", type="int",
              help="Preprocess documents with latent semantic analysis.")
op.add_option("--no-minibatch",
              action="store_false", dest="minibatch", default=True,
              help="Use ordinary k-means algorithm (in batch mode).")
op.add_option("--no-idf",
              action="store_false", dest="use_idf", default=True,
              help="Disable Inverse Document Frequency feature weighting.")
op.add_option("--use-hashing",
              action="store_true", default=False,
              help="Use a hashing feature vectorizer")
op.add_option("--n-features", type=int, default=10000,
              help="Maximum number of features (dimensions)"
                   " to extract from text.")
op.add_option("--verbose",
              action="store_true", dest="verbose", default=False,
              help="Print progress reports inside k-means algorithm.")

print(__doc__)
op.print_help()


def is_interactive():
    return not hasattr(sys.modules['__main__'], '__file__')

# work-around for Jupyter notebook and IPython console
argv = [] if is_interactive() else sys.argv[1:]
(opts, args) = op.parse_args(argv)
if len(args) > 0:
    op.error("this script takes no arguments.")
    sys.exit(1)


# #############################################################################
# Load some categories from the training set
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
# Uncomment the following to do the analysis on all the categories
# categories = None

print("Loading 20 newsgroups dataset for categories:")
print(categories)

dataset = fetch_20newsgroups(subset='all', categories=categories,
                             shuffle=True, random_state=42)

print("%d documents" % len(dataset.data))
print("%d categories" % len(dataset.target_names))
print()

labels = dataset.target
true_k = np.unique(labels).shape[0]

print("Extracting features from the training dataset using a sparse vectorizer")
t0 = time()
if opts.use_hashing:
    if opts.use_idf:
        # Perform an IDF normalization on the output of HashingVectorizer
        hasher = HashingVectorizer(n_features=opts.n_features,
                                   stop_words='english', alternate_sign=False,
                                   norm=None, binary=False)
        vectorizer = make_pipeline(hasher, TfidfTransformer())
    else:
        vectorizer = HashingVectorizer(n_features=opts.n_features,
                                       stop_words='english',
                                       alternate_sign=False, norm='l2',
                                       binary=False)
else:
    vectorizer = TfidfVectorizer(max_df=0.5, max_features=opts.n_features,
                                 min_df=2, stop_words='english',
                                 use_idf=opts.use_idf)
X = vectorizer.fit_transform(dataset.data)

print("done in %fs" % (time() - t0))
print("n_samples: %d, n_features: %d" % X.shape)
print()

if opts.n_components:
    print("Performing dimensionality reduction using LSA")
    t0 = time()
    # Vectorizer results are normalized, which makes KMeans behave as
    # spherical k-means for better results. Since LSA/SVD results are
    # not normalized, we have to redo the normalization.
    svd = TruncatedSVD(opts.n_components)
    normalizer = Normalizer(copy=False)
    lsa = make_pipeline(svd, normalizer)

    X = lsa.fit_transform(X)

    print("done in %fs" % (time() - t0))

    explained_variance = svd.explained_variance_ratio_.sum()
    print("Explained variance of the SVD step: {}%".format(
        int(explained_variance * 100)))

    print()


# #############################################################################
# Do the actual clustering

if opts.minibatch:
    km = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1,
                         init_size=1000, batch_size=1000, verbose=opts.verbose)
else:
    km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1,
                verbose=opts.verbose)

print("Clustering sparse data with %s" % km)
t0 = time()
km.fit(X)
print("done in %0.3fs" % (time() - t0))
print()

print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
print("Adjusted Rand-Index: %.3f"
      % metrics.adjusted_rand_score(labels, km.labels_))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, km.labels_, sample_size=1000))

print()


if not opts.use_hashing:
    print("Top terms per cluster:")

    if opts.n_components:
        original_space_centroids = svd.inverse_transform(km.cluster_centers_)
        order_centroids = original_space_centroids.argsort()[:, ::-1]
    else:
        order_centroids = km.cluster_centers_.argsort()[:, ::-1]

    terms = vectorizer.get_feature_names()
    for i in range(true_k):
        print("Cluster %d:" % i, end='')
        for ind in order_centroids[i, :10]:
            print(' %s' % terms[ind], end='')
        print()

References

[1] “k-means++: The advantages of careful seeding” Arthur, David, and Sergei Vassilvitskii, Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics (2007)
[2] “Web Scale K-Means clustering” D. Sculley, Proceedings of the 19th international conference on World wide web (2010)