Laboratory of Intelligent Networks (LINK) Youn-Hee Han Google PageRank - Basic Principles and Algebraic/Stochastic Interpretation - Laboratory of Intelligent Networks (LINK) Youn-Hee Han
Backgrond History Target Good Reference Proposed by Sergey Brin and Lawrence Page (Google’s Bosses) in 1998 at Stanford. Algorithm of the first generation of Google Search Engine. “The Anatomy of a Large-Scale Hypertextual Web Search Engine”. Target Measure the importance of Web page based on the link structure alone. Assign each node a numerical score between 0 and 1: PageRank. Rank Web pages based on PageRank values. Good Reference http://en.wikipedia.org/wiki/PageRank http://www.emh.co.kr/xhtml/google_pagerank_citation_ranking.html (Korean) PageRank
Backgrond Sergey Brin and Lawrence Page Sergey Brin received his B.S. degree in mathematics and computer science from the University of Maryland at College Park in 1993. Currently, he is a Ph.D. candidate in computer science at Stanford University where he received his M.S. in 1995. He is a recipient of a National Science Foundation Graduate Fellowship. His research interests include search engines, information extraction from unstructured sources, and data mining of large text collections and scientific data. Lawrence Page was born in East Lansing, Michigan, and received a B.S.E. in Computer Engineering at the University of Michigan Ann Arbor in 1995. He is currently a Ph.D. candidate in Computer Science at Stanford University. Some of his research interests include the link structure of the web, human computer interaction, search engines, scalability of information access interfaces, and personal data mining Google Inc. in 09/98 (google.com - 09/97) PageRank
Backgrond Stanford WebBase project (1996 - 1999) http://dbpubs.stanford.edu:8091/~testbed/doc2/WebBase/ http://dbpubs.stanford.edu:8091/diglib/ The PageRank Citation Ranking: Bringing Order to the Web it is a technical report! (working paper) Stanford Digital Libraries SIDL-WP-1999-0120 from the paper: web size = 150M web pages 2005: Google claims to index more than 8B pages http://blog.searchenginewatch.com/blog/041111-084221 http://www.cs.uiowa.edu/~asignori/web-size Claim that the estimated size of the indexable Web to at least 11.5 billion pages as of the end of January 2005 PageRank
Backgrond The Philosophy of PageRank PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B PageRank
Backgrond Scenario: Idea A random surfer who begins at a Web page A. Execute a random walk from A to a randomly chosen Web page that A hyperlinks to. Some nodes are visited more often. Intuitively, these are nodes with many links coming in from other frequently visited nodes. Idea Pages visited more often in this walk are more important. “The rank of a page can be interpreted as the probability that a surfer will be at the page after following a large number of forward links.” PageRank
Basics based on link structure of the web pages = nodes && links = edges forward links = outlinks backlinks = inlinks A and B are Backlinks of C PageRank
Basic Principles Basic Principles about PageRanks 1) a link from page A to page B is a vote from A to B 2) Pages with lots of backlinks are important www.stanford.edu has 23,400 inlinks www.joe-schmoe.com has 1 inlink 3) Backlinks coming from important pages convey more importance to a page combination of PR and text-matching techniques result in highly relevant search results PageRank
Basic Principles Basic Principles about PageRanks 3) Backlinks coming from important pages convey more importance to a page Taher’s Home Page Sep’s Home Page DB Pub Server CS361 Yahoo! CNN Linked by 2 Unimportant pages Linked by 2 Important Pages PageRank
Basic Principles Design of Equation to get Page Importance importance of page j importance of page i number of outlinks from page j pages j that link to page i PageRank
Basic Principles Design of Equation to get Page Importance 0.25 Taher 0.05 Taher Sep 1/2 1 DB Pub Server CNN 0.1 PageRank
Basic Principles Exact Equation of PageRank u, v: web pages Bu: set of pages pointing (back link) to u Nv: the number of pages v points (forward link) to d: damping factor Possibility that a user clicks links in webpages continuously. 0~1 0: a user always types URL and visit the page of the URL. 1: a user permanently clicks links of pages over his/her surf PageRank
Basic Principles Exact Equation of PageRank Example PageRank
Basic Principles Iteration PageRank figures from:http://www.iprcom.com/papers/pagerank/ and http://en.wikipedia.org/wiki/Pagerank PageRank
Basic Principles Iteration (another example) 0.333 0.333 0.333 Initialize all nodes to rank PageRank
Basic Principles Iteration (another example) 0.5 0.167 0.333 0.333 Propagate ranks across links (multiplying by link weights) PageRank
Basic Principles Iteration (another example) 0.333 0.167 0.5 0.5 0.167 Propagate ranks again across links (multiplying by link weights) 0.167 PageRank
Basic Principles Iteration (another example) 0.4 0.4 0.2 After a while… PageRank
Basic Principles Algorithm Initialize: Repeat until convergence: importance of page i pages j that link to page i number of outlinks from page j importance of page j PageRank
Algebraic Interpretation PageRank
Algebraic Interpretation Source: How Google Finds Your Needle in the Web's Haystack http://www.ams.org/samplings/feature-column/fcarc-pagerank Hyperlink Matrix Suppose that page Pj has Nj links If one of those links is to page Pi , then Pj will pass on 1/Nj of its importance to Pi The importance ranking of Pi PageRank
Algebraic Interpretation Hyperlink Matrix Hyperlink Matrix H = [Hij] in which the entry in the ith row and jth column is Matrix H is stochastic H entries are all nonnegative The sum of the entries in a column is one PageRank
Algebraic Interpretation Stationary Vector I We will also form a vector whose components are PageRanks An important condition the vector I is an eigenvector of the matrix H with eigenvalue 1. We also call I a stationary vector of H. the sum of the entries in the vector I be one PageRank
Algebraic Interpretation Stationary Vector I 25 billion web pages indicates H has about N = 25 billion columns and rows. However, most of the entries in H are zero; in fact, studies show that web pages have an average of about 10 links, meaning that, on average, all but 10 entries in every column are zero. We will choose a method known as the power method for finding the stationary vector I of the matrix H. We begin by choosing a vector I 0 then producing a sequence of vectors I k by General principle: The sequence Ik will converge to the stationary vector I. PageRank
Algebraic Interpretation Stationary Vector I PageRank
Algebraic Interpretation Three Important Questions Does the sequence Ik always converge? Is the vector to which it converges independent of the initial vector I0? Do the importance rankings contain the information that we want? the answer to all three questions is "No!“ However, we'll see how to modify our method so that we can answer "yes" to all three. PageRank
Algebraic Interpretation Problem 1: Dangling Node Consider the following small web consisting of two web pages The importance rating of both pages is zero, which tells us nothing about the relative importance of these pages The problem is that P2 has no links. Pages with no links are called dangling nodes and there are, of course, many of them in the real web. PageRank
Algebraic Interpretation Problem 1: Dangling Node To solve it, we pretend that a dangling node has a link to every other page. This has the effect of modifying the hyperlink matrix H by replacing the column of zeroes corresponding to a dangling node with a column in which each entry is 1/N If A is the matrix whose entries are all zero except for the columns corresponding to dangling nodes, in which each entry is 1/N, then Q = H + A. (we will call Q primitive) Q PageRank
Algebraic Interpretation Problem 2: Smaller Sub-web Think the following Then, Q and I are as follows: PageRanks assigned to the first four web pages are zero Q PageRank
Algebraic Interpretation Problem 2: Smaller Sub-web The problem: it contains a smaller web within it, shown in the blue box below the matrix Q is reducible if Q can be written in block form as if the matrix Q is irreducible, we can guarantee that there is a stationary vector I with all positive entries Q PageRank
Algebraic Interpretation Problem 2: Smaller Sub-web A web is called strongly connected if, given any two pages, there is a way to follow links from the first page to the second. Only strongly connected webs provide irreducible matrices Q. Clearly, the example is not strongly connected. PageRank
Algebraic Interpretation (Revisits) Three Important Questions Does the sequence Ik always converge? Is the vector to which it converges independent of the initial vector I0? Do the importance rankings contain the information that we want? In order to answer the three questions, matrix Q should be 1) Stochastic All entries are nonnegative The sum of the entries in a column is one 2) Primitive 3) Strongly connected PageRank
Algebraic Interpretation Final Modification Two ways to surf web 1) follow(click) links: random surf the movement of random surf is determined by Q 2) type links in the browser: randomly choose any other page all pages have the equal chance to be visited by typing. New matrix 1 (the N*N matrix whose entries are all one) is used. Google Matrix G G is stochastic since it is a combination of stochastic matrices. G is both primitive and irreducible because all the entries of G are positive Therefore, G has a unique stationary vector I PageRank
Algebraic Interpretation Final Modification Google Matrix The meaning of parameter d d=1 (G=H+A): we are only working with the original hyperlink structure of the web. d=0 (G=(1-d)/N 1): we are just type the URL and visit a page we would like to take d close to 1 so that we hyperlink structure of the web is weighted heavily into the computation. Serbey Brin and Larry Page, the creators of PageRank, chose d=0.85 PageRank
Algebraic Interpretation From wikipedia… PageRank
Stochastic Interpretation PageRank
Stochastic Interpretation PageRank – Random Walk over the Web If a user starts at a random web page and sufs by clicking links and randomly entering new URLs, what is the probability that s/he will arrive at a given page? A Markov chain is a discrete-time stochastic process consisting of N states, each Web page corresponds to a state. A Markov chain is characterized by an N*N transition probability matrix P PageRank
Stochastic Interpretation Let assume the following stochastic process with values in a set E, called the state space, while its elements are called state of the process. Let assume the set E is finite or countable PageRank
Stochastic Interpretation Definitions PageRank
Stochastic Interpretation Definitions If state i is recurrent, then it is said to be positive recurrent if, starting in state i, the expected time until the process returns to state i is finite. It can be shown that in a finite-state Markov chain, all recurrent states are positive recurrent. Positive recurrent, aperiodic states are called ergodic. PageRank
Stochastic Interpretation Limiting Probability (Ross Book – pp. 205) It can be shown that , the limiting probability that the process will be in state j at time n, also equals the long-run proportion of time that the process will be in state j PageRank
Stochastic Interpretation Limiting Probability (Ross Book – pp. 206) PageRank
Stochastic Interpretation Google Matrix G Since the matrix Q can be reducible or periodic, the following google matrix G must be considered to ensure that the steady-state probability exists and is unique. G PageRank
Stochastic Interpretation P: Importance Vector of Web Pages The initial importance is chosen according to some probability distribution P0=[pi] pi : the probability that the Markov Chain is in state i at the initial time Pk = a vector whose i-th component is the probability that the Markov Chain is in state i at time k The power method Brin and Page report that 50 - 100 iterations are required to obtain a sufficiently good approximation to P. The calculation is reported to take a few days to complete Stationary distribution P satisfies PT = PT G (steady-state behavior) (Pk+1)T= (Pk)T G (Pk)T = (P0)T Gk P Pk for enough large k PageRank