PageRank algorithm Summary

PageRank

This semester the course of data mining, the end to do a report on link analysis algorithm, which is a summary of PR algorithm.

algorithm

PR algorithm based on the authority level of thinking, and not only consider the number of links pointing to that page, point to the importance of taking into account the site's pages.
PR algorithm is a static page rank method, because it is calculated for each page offline PR value, regardless of the query. The PR value can be calculated as the basis for ranking web pages.
From the corner of authority who, PR value should reflect the following two points:

  1. From a web link to another page on the target site is an implicit recognition of authority, that is to say, the more points to a page number of connections, PR value of the page the higher;
  2. W of a page pointing to the page itself has PR value. In real life, we know that a highly recognized authority people do more credibility than to be recognized by a low authority. Therefore, if a page is recognized by other pages with high PR value, then the PR value of the page should also be higher.

Based on the above ideas, we can derive the formula to calculate the PR value. The link relation between the pages as a directed graph G (V, E) , where V is the set of all nodes (i.e., web pages), E is all directed edges (i.e., hyperlink) collection. Suppose | V | = n defined, PR value is as follows:
\ [P (I) = \ SUM \ Limits _ {(J, I) \ in E} \ FRAC {P (J)} {for O_j} \]
wherein \ (O_j \) the number of links in the chain of the page. The knowledge of linear algebra, the above equation can be written in matrix form.
You may wish to use P represents a column vector of PR value, so that A in FIG G adjacency matrix, there is:
\ [A = \ left \ {\ the begin {Array} {C} \ FRAC {. 1} {O_i}, (I, J .) \ in E \\ 0, \ text { another} \ end {array} \ right
\] can then write the following form:
\ [^ TP a = P \]
obviously, P is a characteristic value of a corresponding feature vector. However, the appeal of the Web graph equation is not necessarily valid. In order to improve this equation, Markov chain used here to re-derivation of this equation.
In the Markov chain model, each page is considered to be a state, each state has a certain probability of transfer to another state, that the web browser as a random process. It will randomly browsing the web user behavior seen as a Markov chain state transition to another state
\ [\ Left [\ begin { matrix} A_ {11} & A_ {12} & \ cdots & A_ {1n} \\ A_ {21} & A_ {22} & \ cdots & A_ {2n} \\ \ vdots & \ vdots & \ ddots & \
vdots \\ A_ {n1} & A_ {n2} & \ cdots & A_ {nn} \\ \ end {matrix} \ right] \] wherein, \ (A_ ij of {} \) j is the probability of web browser users browsing the web i. Visible same as defined previously herein matrix A.
If the A element and each row is 1, it can be said that A random transition matrix of a Markov chain. But this condition does not hold, in many cases, because many web pages do not link out of the chain, then the corresponding element on the A line of all zeros. First may assume that A is a random transfer matrix, then the distribution of the probability via the k-th state transition:
\ [P_K = A ^ Tp_ {k-. 1} \]
The knowledge of stochastic processes, if the matrix A is irreducible and non-periodic, then the Markov chain will converge to a probability distribution, and the probability distribution is unique. That
\ [\ lim_ {k \ to \ infty} p_k = \ PI \]
\ (\ PI \) reflects the long-term probability of a user browsing a page, the higher the probability of the higher authority of the page. So in the PageRank algorithm, we can be here \ (\ pi \) as a vector P PR value, so we get a
\ [P = A ^ TP \
] but in reality, Web page and not in meet the above conditions.
First, because a lot of pages out there is no chain link, A is often not a random transition matrix. To solve this problem, we can be elements of the matrix A whole line is replaced with 0 \ (1 / the n-\) , will soon no chain links of a web page links to all other pages. After the conversion matrix is denoted \ (\ overline {A} \ ) , in this case the matrix \ (\ overline {A} \ ) is a random transfer matrix.
Second, the matrix \ (\ overline {A} \ ) often not irreducible, which means that FIGS G is not strongly connected graph. In fact, it does not guarantee that the Web can link up between any two pages.
Finally, \ (\ overline {A} \) is not a non-periodic. This is because after the page is often the case after a few links back to the original web page. In the Markov chain, which means from the state i always start after several transfers back to state i. Obviously this is not the effect we want.
In order to solve the above two problems, you can give each page to add a transition probability \ (1-d \) links to all pages. That is, when a user browses a web page, he randomly selected out of a chain link for browsing the probability is d, rather than click on a link in the page to jump to another page to continue browsing the Introduction is 1-d. As a result, Web becomes strongly connected FIG. And i start back to the state from the state i have a variety of different paths, that is to say the Markov chain has become a non-periodic.
At this time, the PR value is calculated as
\ [P = ((1-
d) \ frac {E} {n} + d \ overline {A} ^ T) P \] wherein E is an element of order n are all 1 square, said attenuation coefficient d, values between 0 and 1.
In the real world there are many pages, the PR value is calculated to convergence can be expensive. In fact, we only focus on sorting cases page, only need to iterate to an acceptable extent.

Advantages and disadvantages

A great advantage of PageRank algorithm is offline PR values are calculated and preserved, rather than re-calculated when a user search, you can improve the efficiency of query.
Another advantage is that it has a certain anti-cheating abilities. A website owner is difficult to add links to their own web pages to other important pages.
But the Magic goes Road ridge, there are ways to enhance the PR value is considered. Attach some important page in the comments section of "zombie" web addresses, these "zombie" page has links to landing pages, so that we can achieve the effect of lifting the value of PR.
Another disadvantage is that the algorithm does not consider the time, there is often a longer time page, the more links to it. That calculate the PR value of more favorable old page, making some new high-quality website can not get high rankings in the search.
Since the PageRank algorithm non-relevant features, the query results may deviate.

Guess you like

Origin www.cnblogs.com/beeblog72/p/11969288.html