K-means聚类算法-Python

K-Means 聚类算法

一般的, 我们用\(C_i\) 既聚类的中心来代表一个聚类

Euclidean distance(\(in R^2\)): \(dist(a, b)=\sqrt{(a_x-b_x)^2+(a_y-b_y)^2}\)

蔟内变差(用于衡量一个蔟\(C_i\)的质量):\(E=\sum_{i=1}^k\sum_{p\in{C_i}}dist(p,c_i)^2\)


Question:既然蔟内变差可以恒量一个蔟的质量,那么枚举完所有的可能,然后再采用最佳的划分不就好了吗?
Answer:不可以,这是个NP完全问题,即使固定蔟的个数#cluster和空间维度#dimension,依然开销巨大,所以我们要采用Greedy Algorithm(贪心方法)

算法: k-mean 用于划分k-mean算法,其中每个蔟的中心都用蔟中所有的对象均值来表示
Input:

  • k: number of cluster
  • D: Data set

Output:k cluster

Approach:

  1. Select k object as cluster center randomly from D;
  2. repeat
  3. Based on the distance between a obj(or points) with centroids, partition them into the most similar cluster;
  4. Update the mean of very cluster as the new centroids;
  5. until it doesn't change

代码部分


import numpy as np
import pandas as pd

data = pd.read_csv('../trip.csv')
data = data.iloc[:, 2:]

class KMeans():

def __init__(self, data, k, r):
    self.data = data
    self.k = k
    self.label = np.empty(data.shape[0])
    self.E = np.empty(data.shape[0])
    self.centroids = self.__init(data, k, r)
    self.__convergence = False

def __distance(self, x, y):
    '''
    Return Euclid distance of x and y.
    '''
    return sum((x-y)**2)**(1/2)

def __init(self, data, k_cluster, random_state):
    '''
    Select k objs as centroids from data set.
    '''
    rs = np.random.RandomState(random_state)
    num, dim = data.shape
    rand_id = rs.randint(num, size=k_cluster)
    centroids = np.empty((k_cluster, dim))
    for i, v in enumerate(rand_id):
        centroids[i, :] = data.iloc[v, :]
    return centroids

def __converge(self, pre):
    return True if False not in (pre==self.centroids) else False

def __update(self):
    for i in range(self.k):
        centroids[i] = data.loc[label==i].mean(axis=0) 

def fit(self):
    while not self.__convergence:
        pre_centroids = self.centroids.copy()
        num, dim = data.shape
        for i in range(num):
            min_dist, neighbour = np.inf, -1
            for j in range(self.k):
                dist = self.__distance(self.data.iloc[i,:], self.centroids[j])
                if dist < min_dist:
                    min_dist, neighbour = dist, j
                    self.label[i] = j
            self.E[i] = self.__distance(self.data.iloc[i,:], self.centroids[j])**2
        self.__convergence = self.__converge(pre_centroids)

猜你喜欢

转载自www.cnblogs.com/Frankhff/p/12466429.html