"Graph Learning Column": High Density Subgraph Mining

This week we introduce another "clustering" algorithm on the graph - Dense Subgraph Mining. The difference between this type of algorithm and community detection is that the data objects processed are heterogeneous graphs (the type of nodes is not a single , there are generally no edges between nodes of the same type) , community detection needs to divide the nodes of the entire graph into communities, while the high-density subgraph only cares about the closest community; the same thing is that both of them are the same thinking mode: definition A measure of density, the value is continuously optimized heuristically .

lockstep behavior mode

In practical application scenarios, high-density subgraphs are generally used for gang fraud detection. Because the densest community found by this type of algorithm generally indicates a synchronous, massively connected behavior pattern, this behavior pattern is called lockstep, and it often fits the essential characteristics of fraudulent gangs in real scenarios . For example, when gangs of water army accounts perform malicious ratings and false likes, they will have such a behavior pattern.

From the above figure, the lockstep mode is to find a closely related bipartite graph structure, which corresponds to a certain Dense Block on the adjacency matrix. In the article "EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs", the authors explore the visualization effect of this pattern on the spectral space:

The above figure is the result of performing SVD decomposition on the adjacency matrix, and then visualizing the first two columns of the u vector and v vector respectively. A basic conclusion obtained by the author is that this type of pattern will have obvious clustering phenomenon in the spectral space, and each cluster in the spectral space corresponds to a certain Dense Block on the adjacency matrix. Using this phenomenon, it is very convenient to observe the high-density subgraph structure in the graph.

 

But how to automate this discovery process and avoid the high time complexity of SVD decomposition? The FRAUDAR algorithm to be explained below can achieve this goal well.

 

FRAUDAR Algorithm

The FRAUDAR algorithm automatically mines the high-density subgraph in the bipartite graph, and is very resistant to the camouflage behavior of fraudsters (Camouflage). The linear time complexity also makes the algorithm very good on industrial-level business data. Applicability, and won the best paper award in KDD2016.

 

Let's take a look at how to find high-density subgraphs in this article:

problem definition

Consider that in the scenario where users rate products, m users constitute

Composition of n items

A bipartite graph is formed between users and products

We need to find suspected fraudulent users. Further notation conventions are as follows:

metrics

In order to measure the suspiciousness of the communities we found, the author defines a very simple metric:

 

Indicates the suspiciousness of the i-th node in the community S,

Indicates the suspiciousness of the edge formed by i-node and j-node in the community.

 

generally

The suspiciousness of the node can be given by a priori and other information, and this value will be set to 0 in the following text.

It can be replaced by the weight of the edge. In the most special case, the value is 1, so that the entire measure of suspiciousness degenerates into the density of edges in the community.

 

In order to illustrate the rationality of this metric, the authors prove that this metric has the following 4 properties:

1.node suspiciousness: the higher the suspiciousness of the node, the higher the suspiciousness of the community.

2.edge suspiciousness : the higher the suspiciousness of the edge, the higher the suspiciousness of the community.

3.size : The edge density is the same, the larger the community, the higher the suspiciousness.

4.concentration : The sum of the suspiciousness of nodes and edges is the same, and the smaller the community, the higher the suspiciousness.

algorithm

Like many heuristic algorithms, once you have this metric, you can greedily optimize this value. Specifically, starting from the entire graph, a certain node is repeatedly deleted to maximize the suspiciousness index of the remaining graph. A core acceleration here is: deleting a certain node i will only affect the changes in the structure of nodes connected to the i node. Since the metric used in this article is ultimately the density of community edges, it is only necessary to consider deleting the node with the smallest degree each time, and then update the degree of the nodes connected to it. This process of continuously updating the minimum degree can be achieved through data such as priority trees The structure is implemented quickly.

 

Below we give the algorithm steps:

It can be seen that the algorithm continuously deletes nodes so that the remaining nodes form the most suspicious community, and then records the round with the highest community suspiciousness in the entire deletion process, then the subgraph composed of the remaining nodes in this round is the most suspicious.

against camouflage

Earlier we talked about how to use the heuristic algorithm to quickly find high-density subgraphs in the graph, but there are two problems:

  1. A natural dense block with close connections will be formed between real users and popular products. How does the algorithm distinguish this kind of community from the community formed by fraudulent behavior?
  2. Fraudsters can disguise themselves through interactions with popular products, such as randomly commenting on some other products or popular products to disguise their own data representation, as shown in the following figure:

For the above two problems, how to solve it?

 

The core point here is that no matter how you pretend, for those products that buy ratings, the rating weight of fraudulent users is higher, while the rating weight of real users of other products is higher. After considering this point clearly, we can lower the weight of the suspicious degree Cij of the edge ij according to the degree of the column node (weight sum of column). For example, the higher the degree of a product j, the lower the cij should be processed. This is because a product with a high degree is not necessarily suspicious, such as a popular product. distribution, it will not reduce the suspicion of fraudulent communities at all. You can compare the two camouflage methods in the above picture, and consider the changes in community suspiciousness before and after camouflage to understand the truth.

 

Specifically, the weight reduction function used in this paper is

c is a constant, set to 5 in the experiment. The algorithm has achieved the best results in a series of experiments on industrial data sets in the following text.

Dense Subtensor

The above FRAUDAR algorithm only mines high-density subgraphs in the bipartite graph. When the data has higher dimensions, such as in a scenario of forwarding posts, each piece of data corresponds to the 4-dimensional information of account, IP, post ID, and time. The data thus constitutes a Tensor of rank 4. Our task becomes to find Dense Subtensor in this high-dimensional Tensor. Although the dimension of the data has become higher, the idea of ​​solving the problem is still consistent with the FRAUDAR algorithm. For example, the two types of Dense Subtensor Mining algorithms, M-Zoom and D-Cube, first define a measurement index, and then heuristic iterative optimization. Those who are interested can read related papers by themselves.

 

Generally speaking, in high-dimensional data, if dense regions can be found, the suspiciousness of the region will be greatly increased. Because this requires more dimensional information for cross-validation, normal user groups gather in 2-dimensional space at most. We can see the comparison of the following two figures:

Welcome to follow our WeChat public account: Geetest_jy Add technical assistant: geetest1024 to join the group communication!

references:

EigenSpokes

http://people.cs.vt.edu/badityap/papers/eigenspokes-pakdd10.pdf

CHEAT

https://www.cs.cmu.edu/~neilshah/research/papers/FRAUDAR.KDD.16.pdf

M-Zoom

https://www.cs.cmu.edu/~kijungs/papers/mzoomPKDD2016.pdf

D-Cube

http://www.cs.cmu.edu/~kijungs/papers/dcubeWSDM2017.pdf

Guess you like

Origin blog.csdn.net/geek_wh2016/article/details/82216818