聚类的基本问题及两个常用算法

一、聚类的定义及其两个基本问题

Data clustering is the task of partitioning a set of objects into groups such that the similarity of objects within each group is higher than that of objects across groups.

To cluster data, we need:

A distance measure (to quantify how similar or dissimilar two objects are)

An algorithm for clustering the data based on the distance measure


1、Distance measure

point and point distance

point and cluster distance:等价于point与cluster center point的距离

cluster and cluster distance:等价于cluster center points的距离

2、The Closest-Pair Problem

找出P中距离最近的两个点:

6778119-a5c88aaf079ee5f9.png
the closest-pair problem

(1)Brute force algorithm: 时间复杂度为O()

6778119-5de0586fa890bc08.png
SlowClosestPair

(2)Divide and conquer algorithm: 时间复杂度O(n(logn)^2)

FastClosestPair的recurrences:

T(n) = 2T(n/2) + f(n) ,f(n)为ClosestPairStrip时间复杂度O(nlogn)

T(2) = O(1)

6778119-04bd7f3c45607282.png
FastClosestPair

ClosestPairStrip时间复杂度:O(nlogn)

6778119-257e6a2c88290a5a.png
ClosestPairStrip


二、两种常用聚类算法

1、Hierarchical Clustering 层次聚类

算法思想:给定data、目标簇数k

step1:首先把每个点当成一个簇

step2:找到最近的两个簇,把它们合并成一个簇

step3:重复step2直到只剩下k个簇

6778119-e1170be483111ba3.png
层次聚类

2、K-means Clustering K均值聚类

算法思想:给定data、目标簇数k、迭代次数q

step1:初始化k个centers(如何初始化?)

step2:把每个点分配到离它最近的center

step3:属于同一个center的点构成一个cluster

step4:重新计算每个cluster的center

step5:重复step2-4 q次

时间复杂度:O(qkn)

6778119-dc54273a6e3f719c.png
K-means聚类

3、如何选择一个合适的k?

通常情况下,我们并不知道应该聚成多少类,因此我们会选择不同的k,比较聚出来的簇的质量,衡量簇的质量用error of a cluster:

6778119-2e29eeb71fe2520d.png
聚类误差



参考资料:Coursera Algorithmic Thinking, Rice University.

猜你喜欢

转载自blog.csdn.net/weixin_33895657/article/details/87094010