Chinese Whisper 人脸聚类算法实现

Chinese whispers clustering

目录

1Chinese_Whispers简介
2Algorithm
3Strengths and Weaknesses
4Dlib库调用方法
5分类时间统计
6Python版实现

Chinese_Whispers简介

Chinese Whispers is a clustering method used in network science named after the famous whispering game.[1] Clustering methods are basically used to identify communities of nodes or links in a given network. This algorithm was designed by Chris Biemann and Sven Teresniak in 2005.[1] The name comes from the fact that the process can be modeled as a separation of communities where the nodes send the same type of information to each other.[1]

Chinese Whispers is a hard partitioning, randomized, flat clustering (no hierarchical relations between clusters) method.[1] The random property means that running the process on the same network several times can lead to different results, while because of hard partitioning one node can only belong to one cluster at a given moment. The original algorithm is applicable to undirected, weighted and unweighted graphs. Chinese Whispers is time linear which means that it is extremely fast even if the number of nodes and links are very high in the network.[1]

算法流程图
Chinese whisper.png

Algorithm

The algorithm works in the following way in an undirected unweighted graph:

  1. All nodes are assigned to a random class. The number of initial classes equals the number of nodes.
  2. Then all of the network nodes are selected one by one in a random order. Every node moves to the class which the given node connects with the most links. In the case of equality the cluster is randomly chosen from the equally linked classes.
  3. Step two repeats itself until a predetermined number of iteration or until the process converges. In the end the emerging classes represent the clusters of the network.
  4. The predetermined threshold for the number of the iterations is needed because it is possible, that process does not converge. On the other hand in a network with approximately 10000 nodes the clusters does not change significantly after 40-50 iterations even if there is no convergence.

该算法核心:

  1. 构建无向图,将每个人脸做为无向图中的一个节点,人脸之间的相似度,作为节点之间的边,如果人脸之间的相似度小于上面设定的阈值

那么.这两个人脸对应的节点之间就没有边,

  1. 迭代开始时,将每个人脸都赋予一个id,该id作为该人脸的类别,也就是说初始化时,每个人脸都是一个类别
  2. 开始第一次迭代,随机选取某个节点,对该节点的所有邻居依次进行下面的处理:
    1. 如果是初始化的时候,由于每个节点都有自己所属的类别,就将所有邻居中权重最大的节点对应的类做为该节点的类别,完成对该节点的类别更新
    2. 如果迭代到第2次,那么对某个节点,就可能会出现有两个邻居属于同一个类,那么就将同一个类下的邻居权重累加,最后,再看该节点下的所有邻居节点所属的类别的累加权重,取权重最大的类别作为当前节点的类别.
  3. 当所有的节点都完成后,就完成了一次迭代,重复2步骤,直到达到迭代次数.

该方法基于图进行聚类,将图中一个节点对应一个人脸,节点之间的边对应两个节点的相似度,也就是两个人脸的相似度,通过迭代查找一个节点对应的相似权重累加和来查找类别并进行聚类,使用facenet embedding得到的特征对ms-celeb数据集进行cluster,经过检验,模型和阈值选择适当的时候,只经过10次迭代,就可以达到比较好的效果。 该算法的结果主要依赖于模型的效果和阈值的选择,在迭代时,将相似度作为权重.

Strengths and Weaknesses

The main strength of Chinese Whispers lies in its time linear property. Because of the processing time increases linearly with the number of nodes, the algorithm is capable of identifying communities in a network very fast. For this reason Chinese Whispers is a good tool to analyze community structures in graph with a very high number of nodes. The effectiveness of the method increases further if the network has the small world property.[1]

On the other hand because the algorithm is not deterministic in the case of small node number the resulting clusters often significantly differ from each other. The reason for this is that in the case of a small network it matters more from which node the iteration process starts while in large networks the relevance of starting points disappears.[1] For this reason for small graphs other clustering methods are recommended.

Dlib库调用方法

Chinese Whispers 聚类算法用于当你不知道有多少个类时。他的基本算法步骤是:

  1. 对于所有节点v,都赋值一个初始的类class(vi)=i
  2. 随机选取一个节点vt,找到v所有的临接节点,对临接节点所属的类进行打分。例如一个节点1的临接节点有2,3,4,5,分别属于a,b,c,b类别,边1-2,1-3,1-4,1-5的权值都为1,那么类a的得分就是1,类b得分2,类c得分1
  3. 将得分最高的类别赋值给vt
  4. 返回2

下面上dlib的代码:

    std::vector<sample_pair> edges;
    for (size_t i = 0; i < face_descriptors.size(); ++i)
    {
        for (size_t j = i+1; j < face_descriptors.size(); ++j)
        {
            // Faces are connected in the graph if they are close enough.  Here we check if
            // the distance between two face descriptors is less than 0.6, which is the
            // decision threshold the network was trained to use.  Although you can
            // certainly use any other threshold you find useful.
            if (length(face_descriptors[i]-face_descriptors[j]) < randis)
                edges.push_back(sample_pair(i,j));
        }
    }
    std::vector<unsigned long> labels;
    const auto num_clusters = chinese_whispers(edges, labels);

     face_descriptors :所有的人脸特征
     randis:距离阈值
     edges:算法的输入,是一个连接图结构。
     labels:是最后的返回值,标明每个样本属于第几个类别。

     
     //函数实现
     inline unsigned long chinese_whispers (  
        const std::vector<ordered_sample_pair>& edges,  
        std::vector<unsigned long>& labels,  
        const unsigned long num_iterations,  
        dlib::rand& rnd  
    )  
    {  
        // make sure requires clause is not broken,传进来的边集需要排好序  
        DLIB_ASSERT(is_ordered_by_index(edges),  
                    "\t unsigned long chinese_whispers()"  
                    << "\n\t Invalid inputs were given to this function"  
        );  
  
        labels.clear();  
        if (edges.size() == 0)  
            return 0;  
  
        std::vector<std::pair<unsigned long, unsigned long> > neighbors;  
        find_neighbor_ranges(edges, neighbors);  
  
        // Initialize the labels, each node gets a different label.  
          
        labels.resize(neighbors.size());  
        for (unsigned long i = 0; i < labels.size(); ++i)  
            labels[i] = i;  
  
  
        for (unsigned long iter = 0; iter < neighbors.size()*num_iterations; ++iter)  
        {  
            // Pick a random node.随机挑选一个节点  
            const unsigned long idx = rnd.get_random_64bit_number()%neighbors.size();  
  
            // Count how many times each label happens amongst our neighbors.对节点的临接几点所属的类别进行统计打分  
            std::map<unsigned long, double> labels_to_counts;  
            const unsigned long end = neighbors[idx].second;  
            for (unsigned long i = neighbors[idx].first; i != end; ++i)  
            {  
                labels_to_counts[labels[edges[i].index2()]] += edges[i].distance();  
            }  
  
            // find the most common label.找到得分最高的类,并给该节点归类  
            std::map<unsigned long, double>::iterator i;  
            double best_score = -std::numeric_limits<double>::infinity();  
            unsigned long best_label = labels[idx];  
            for (i = labels_to_counts.begin(); i != labels_to_counts.end(); ++i)  
            {  
                if (i->second > best_score)  
                {  
                    best_score = i->second;  
                    best_label = i->first;  
                }  
            }  
  
            labels[idx] = best_label;  
        }  
  
  
        // Remap the labels into a contiguous range.  First we find the  
        // mapping.因为上述找到的类别可能不是连续的0,1,2,3...,需要对类别进行重新映射为连续的编号  
        std::map<unsigned long,unsigned long> label_remap;  
        for (unsigned long i = 0; i < labels.size(); ++i)  
        {  
            const unsigned long next_id = label_remap.size();  
            if (label_remap.count(labels[i]) == 0)  
                label_remap[labels[i]] = next_id;  
        }  
        // now apply the mapping to all the labels.给所有节点赋值类别  
        for (unsigned long i = 0; i < labels.size(); ++i)  
        {  
            labels[i] = label_remap[labels[i]];  
        }  
  
        return label_remap.size();  
    }

分类时间统计

Dlib face cluster.jpg

Python版实现

import networkx
# build nodes and edge lists
nodes = [
    (1,{'attr1':1}),
    (2,{'attr1':1})
    ...
]
edges = [
    (1,2,{'weight': 0.732})
    ....
]
# initialize the graph
G = nx.Graph()
# Add nodes
G.add_nodes_from(nodes)
# CW needs an arbitrary, unique class for each node before initialisation
# Here I use the ID of the node since I know it's unique
# You could use a random number or a counter or anything really
for n, v in enumerate(nodes):
    G.node[n]['class'] = v
# add edges
G.add_edges_from(edges)
# run Chinese Whispers
# I default to 10 iterations. This number is usually low.
# After a certain number (individual to the data set) no further clustering occurs
iterations = 10
for z in range(0,iterations):
    gn = G.nodes()
    # I randomize the nodes to give me an arbitrary start point
    shuffle(gn)
    for node in gn:
        neighs = G[node]
        classes = {}
        # do an inventory of the given nodes neighbours and edge weights
        for ne in neighs:
            if isinstance(ne, int) :
                if G.node[ne]['class'] in classes:
                    classes[G.node[ne]['class']] += G[node][ne]['weight']
                else:
                    classes[G.node[ne]['class']] = G[node][ne]['weight']
        # find the class with the highest edge weight sum
        max = 0
        maxclass = 0
        for c in classes:
            if classes[c] > max:
                max = classes[c]
                maxclass = c
        # set the class of target node to the winning local class
        G.node[node]['class'] = maxclass

猜你喜欢

转载自blog.csdn.net/u011808673/article/details/78644485
今日推荐