A detailed explanation of the algorithm for generating frequent itemsets through FP trees - machine learning

foreword

In ( Summary of Apriori Algorithm Principles in Machine Learning (22)), the principle of Apriori algorithm is summarized. As an algorithm for mining frequent itemsets, the Apriori algorithm needs to scan data multiple times, and I/O is a big bottleneck. In order to solve this problem, the FP Tree algorithm adopts some tricks, no matter how much data, it only needs to scan the data set twice, thus improving the efficiency of the algorithm. Below we will make a summary of the FP Tree algorithm.

FP Tree data structure

In order to reduce the number of I/Os, the FP Tree algorithm introduces some data structures to temporarily store data. This data structure consists of three parts, as shown in the following figure:


640?wx_fmt=png&wxfrom=5&wx_lazy=1

figure 1

The first part is the item header table, which records the number of occurrences of all frequent sets of one item, arranged in descending order of the number. For example, in the above figure, B appears 8 times in all 10 sets of data, so it ranks first. This part is easy to understand. The second part is FP Tree , which maps the original data set to an FP tree in memory. This FP tree is difficult to understand. How is it established? We will talk about this later. The third part is the node linked list . One frequent set in all item header tables is the head of a node linked list, which in turn points to the position where the one frequent set appears in the FP tree. This is mainly to facilitate the search and update of the connection between the item header table and the FP Tree, and it is easy to understand.


Item header table creation

The establishment of the FP tree requires the establishment of the dependency header table first. First, let's look at how to create an item header table.


Scan the data for the first time to get the counts of all frequent item sets. Then delete the items whose support is lower than the threshold, put a frequent set of items into the item header table, and arrange them in descending order of support. Then scan the data for the second time, remove the non-frequent itemsets from the original data read, and arrange them in descending order of support.


The above paragraph is very abstract, we use the following example to explain it in detail.

As shown in Figure 2, there are 10 pieces of data. First, scan the data for the first time and count 1 item set. It is found that O, I, L, J, P, M, N all appear only once, and the support is less than 20%. threshold, so they do not appear in the item header table below. The remaining A, C, E, G, B, D, and F are arranged in descending order according to the size of the support to form our item header table.


0?wx_fmt=png


Then we scan the data for the second time, remove the non-frequent itemsets for each data, and arrange them in descending order of support. For example, the data item ABCEFO, in which O is a non-frequent 1 item set, is eliminated, leaving only ABCEF. Sorted in order of support, it becomes ACEBF. And so on for other data items. Why sort the frequent 1 data items in the original data set? This is to share the ancestor nodes as much as possible when we build the FP tree later.


After two scans, the item header table has been established, and the sorted data set has been obtained. Let's take a look at how to build an FP tree.


The establishment of FP Tree

With the item header table and the sorted data set, the establishment of the FP tree can be started. At the beginning, there is no data in the FP tree. When building the FP tree, we read the sorted data set one by one, insert it into the FP tree, and insert it into the FP tree according to the sorted order. At the back are descendant nodes. If there is a common ancestor, the corresponding common ancestor node count is incremented by 1. After insertion, if a new node appears, the node corresponding to the item header table will be linked to the new node through the node linked list. Until all data is inserted into the FP tree, the establishment of the FP tree is completed.


Seems too abstract

Or use the example in the previous section to describe

First, insert the first piece of data ACEBF, as shown in Figure 3. At this time, the FP tree has no nodes, so ACEBF is an independent path, all nodes are counted as 1, and the item header table is linked to the corresponding new node on the node linked list.


0?wx_fmt=png



Then insert the data ACG, as shown in FIG. 4 . Since the ACG and the existing FP tree can have a common ancestor node sequence AC, it is only necessary to add a new node G, and record the count of the new node G as 1. At the same time, the counts of A and C increase by 1 to become 2. Of course, the node linked list of the corresponding G node needs to be updated.


0?wx_fmt=png



The same method can update the following 8 data, as shown in the following 8 pictures. Since the principles are similar, you can try to insert and understand and compare by yourself. I believe that if everyone can insert these 10 pieces of data independently, then the process of establishing the FP tree will not be difficult.


0?wx_fmt=png


0?wx_fmt=png


0?wx_fmt=png


0?wx_fmt=png


0?wx_fmt=png


0?wx_fmt=png


0?wx_fmt=png


0?wx_fmt=png


Mining of FP Tree

After the FP tree is established, how to mine frequent itemsets? The following describes how to mine frequent itemsets from the FP tree. The FP tree, the item header table and the node linked list are obtained. First, dig up the items from the bottom of the item header table. For each item in the item header table corresponding to the FP tree, we have to find its conditional pattern base. The so-called conditional pattern base is the FP subtree corresponding to the node we want to mine as the leaf node. Get this FP subtree, set the count of each node in the subtree to the count of leaf nodes, and delete nodes whose count is lower than support. From this conditional pattern base, we can recursively mine frequent itemsets.


Or use the above example to explain

Start with the bottom F node, and look for the conditional pattern base of the F node. Since F has only one node in the FP tree, there is only one path shown on the left of the following figure as a candidate, corresponding to {A:8,C:8, E:6,B:2,F:2}. Then set all ancestor node counts to leaf node counts, that is, the FP subtree becomes {A:2,C:2,E:2,B:2,F:2}. Generally, our conditional pattern base does not need to write leaf nodes, so the final conditional pattern base of F is shown on the right side of the figure below.


0?wx_fmt=png


It is easy to get the frequent 2 itemsets of F as {A:2,F:2}, {C:2,F:2}, {E:2,F:2}, {B:2,F:2 }. Recursively merge binomial sets to obtain frequent three-item sets as {A:2,C:2,F:2}, {A:2,E:2,F:2},...and some frequent three-item sets , do not write. Of course, recursively, the largest frequent itemset is the frequent 5 itemset, which is {A:2,C:2,E:2,B:2,F:2}.


After F mining is finished, we start mining D node. The D node is a bit more complicated than the F node, because it has two leaf nodes, so the FP subtree obtained first is as shown on the left. Then set the counts of all ancestor nodes to the counts of leaf nodes, that is, it becomes {A:2, C:2, E:1 G:1, D:1, D:1} At this time, the E node and the G node are in the The support in the conditional pattern base is lower than the threshold, and we delete it. Finally, after removing the low-support nodes and excluding leaf nodes, the conditional pattern base of D is {A:2, C:2}. Through it, we can easily get the frequent 2 itemsets of D as {A:2,D:2}, {C:2,D:2}. Recursively merge binomial sets to get frequent three-item sets as {A:2,C:2,D:2}. The largest frequent itemset corresponding to D is frequent 3 itemsets.


0?wx_fmt=png


In the same way, the conditional pattern base of B can be obtained on the right side of the figure below. The maximum frequent itemsets of B are recursively mined to be frequent 4 itemsets {A:2, C:2, E:2, B:2}.


0?wx_fmt=png


Continue to mine the frequent itemsets of G, and the conditional pattern base of G found on the right side of the figure below, recursively mine the maximum frequent itemsets of G as frequent 4 itemsets {A:5, C:5, E:4, G:4 }.


0?wx_fmt=png


The conditional pattern base of E is shown on the right side of the figure below. The maximum frequent itemsets of E are recursively mined to be frequent 3 itemsets {A:6, C:6, E:6}.


0?wx_fmt=png


The conditional pattern base of C is shown on the right side of the figure below. The maximum frequent itemsets of C recursively mined are frequent 2 itemsets {A:8, C:8}.


0?wx_fmt=png


As for A, since its conditional pattern base is empty, there is no need to dig.


So far we have obtained all frequent itemsets, if we only want the largest frequent K itemsets, we can see from the above analysis that the largest frequent itemsets are 5 itemsets. Include {A:2, C:2, E:2, B:2, F:2}.


FP Tree algorithm flow

Make an induction on the FP Tree algorithm flow. The FP Tree algorithm consists of three steps:

1) Scan the data to get the count of all frequent item sets. Then delete the items whose support is lower than the threshold, put a frequent set of items into the item header table, and arrange them in descending order of support.

2) Scan the data, remove the non-frequent itemsets from the original data read, and arrange them in descending order of support.

3) Read the sorted data set, insert it into the FP tree, and insert it into the FP tree according to the sorted order. The node at the top of the sorting order is the ancestor node, and the node at the back is the descendant node. If there is a common ancestor, the corresponding common ancestor node count is incremented by 1. After insertion, if a new node appears, the node corresponding to the item header table will be linked to the new node through the node linked list. Until all data is inserted into the FP tree, the establishment of the FP tree is completed.

4) From the bottom item of the item header table, find the conditional pattern base corresponding to the item header table item in order. Recursive mining of the conditional pattern base to obtain frequent itemsets of item header entries.

5) If the number of frequent itemsets is not limited, return all frequent itemsets in step 4, otherwise only return frequent itemsets that meet the requirements of the number of items.


FP T Summary

The FP Tree algorithm improves the I/O bottleneck of the Apriori algorithm, and cleverly uses the tree structure, which reminds us of BIRCH clustering. BIRCH clustering also cleverly uses the tree structure to improve the running speed of the algorithm. The use of in-memory data structures to trade space for time is a common way to improve the running time bottleneck of an algorithm. In practice, the FP Tree algorithm is an association algorithm that can be used in a production environment, while the Apriori algorithm acts as a pioneer and plays the role of an association algorithm to indicate lights. In addition to FP Tree, algorithms like GSP, CBA are of the Apriori faction.


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325405820&siteId=291194637