Matlab implements Kmeans clustering algorithm

1. Introduction to Kmeans clustering algorithm

The kmeans clustering algorithm is a clustering analysis algorithm for iterative solution. Its implementation steps are as follows:

(1) Randomly select K objects as the initial cluster centers

(2) Calculate the distance between each object and each seed cluster center, and assign each object to the nearest cluster center.

(3) The cluster centers and the objects assigned to them represent a cluster. Each time a sample is assigned, the cluster center of the cluster is recalculated based on the existing objects in the cluster.

(4) Repeat steps (2), (3) until a termination condition is met. The termination condition can be the change of the cluster center or the local minimum of the sum of squared errors.

2. Code implementation of Kmeans clustering algorithm

(1) First, load the data set that needs to be classified.

data(:,1)=[90,35,52,83,64,24,49,92,99,45,19,38,1,71,56,97,63,...
    32,3,34,33,55,75,84,53,15,88,66,41,51,39,78,67,65,25,40,77,...
    13,69,29,14,54,87,47,44,58,8,68,81,31];
data(:,2)=[33,71,62,34,49,48,46,69,56,59,28,14,55,41,39,...
    78,23,99,68,30,87,85,43,88,2,47,50,77,22,76,94,11,80,...
    51,6,7,72,36,90,96,44,61,70,60,75,74,63,40,81,4];
figure(1)
scatter(data(:,1),data(:,2),'LineWidth',2)
title("原始数据散点图")

The scatter plot of the original data is as follows:

(2) Set the number of categories and call the kmeans clustering function written by yourself

cluster_num=4;
[index_cluster,cluster] = kmeans_func(data,cluster_num);
function [index_cluster,cluster] = kmeans_func(data,cluster_num)
%% 原理推导Kmeans聚类算法
[m,n]=size(data);
cluster=data(randperm(m,cluster_num),:);%从m个点中随机选择cluster_num个点作为初始聚类中心点
epoch_max=1000;%最大次数
therad_lim=0.001;%中心变化阈值
epoch_num=0;
while(epoch_num<epoch_max)
    epoch_num=epoch_num+1;
    % distance1存储每个点到各聚类中心的欧氏距离
    for i=1:cluster_num
        distance=(data-repmat(cluster(i,:),m,1)).^2;
        distance1(:,i)=sqrt(sum(distance'));
    end
    [~,index_cluster]=min(distance1');%index_cluster取值范围1~cluster_num
    % cluster_new存储新的聚类中心
    for j=1:cluster_num
        cluster_new(j,:)=mean(data(find(index_cluster==j),:));
    end
    %如果新的聚类中心和上一轮的聚类中心距离和大于therad_lim,更新聚类中心,否则算法结束
    if (sqrt(sum((cluster_new-cluster).^2))>therad_lim)
        cluster=cluster_new;
    else
        break;
    end
end
end

(3) Visual display of classification results and final cluster centers

%% 画出聚类效果
figure(2)
% subplot(2,1,1)
a=unique(index_cluster); %找出分类出的个数
C=cell(1,length(a));
for i=1:length(a)
   C(1,i)={find(index_cluster==a(i))};
end
for j=1:cluster_num
    data_get=data(C{1,j},:);
    scatter(data_get(:,1),data_get(:,2),100,'filled','MarkerFaceAlpha',.6,'MarkerEdgeAlpha',.9);
    hold on
end
%绘制聚类中心
plot(cluster(:,1),cluster(:,2),'ks','LineWidth',2);
hold on
sc_t=mean(silhouette(data,index_cluster'));
title_str=['原理推导K均值聚类','  聚类数为:',num2str(cluster_num),'  SC轮廓系数:',num2str(sc_t)];
title(title_str)

3. Fully implement the code

clc;clear;close all;
data(:,1)=[90,35,52,83,64,24,49,92,99,45,19,38,1,71,56,97,63,...
    32,3,34,33,55,75,84,53,15,88,66,41,51,39,78,67,65,25,40,77,...
    13,69,29,14,54,87,47,44,58,8,68,81,31];
data(:,2)=[33,71,62,34,49,48,46,69,56,59,28,14,55,41,39,...
    78,23,99,68,30,87,85,43,88,2,47,50,77,22,76,94,11,80,...
    51,6,7,72,36,90,96,44,61,70,60,75,74,63,40,81,4];
figure(1)
scatter(data(:,1),data(:,2),'LineWidth',2)
title("原始数据散点图")
cluster_num=4;
[index_cluster,cluster] = kmeans_func(data,cluster_num);
%% 画出聚类效果
figure(2)
% subplot(2,1,1)
a=unique(index_cluster); %找出分类出的个数
C=cell(1,length(a));
for i=1:length(a)
   C(1,i)={find(index_cluster==a(i))};
end
for j=1:cluster_num
    data_get=data(C{1,j},:);
    scatter(data_get(:,1),data_get(:,2),100,'filled','MarkerFaceAlpha',.6,'MarkerEdgeAlpha',.9);
    hold on
end
%绘制聚类中心
plot(cluster(:,1),cluster(:,2),'ks','LineWidth',2);
hold on
sc_t=mean(silhouette(data,index_cluster'));
title_str=['原理推导K均值聚类','  聚类数为:',num2str(cluster_num),'  SC轮廓系数:',num2str(sc_t)];
title(title_str)

4. Summary

The above is the entire code of the kmeans clustering algorithm implemented by matlab, and the data set can be replaced on the basis of the above code and applied to other scenarios. If there are friends who don’t understand, welcome to leave a comment or send a private message, and you can also send a private message to the blogger for code customization.

Guess you like

Origin blog.csdn.net/qq_37904531/article/details/128839657