Laboratory of Intelligent Networks (LINK) Youn-Hee Han

Laboratory of Intelligent Networks (LINK) Youn-Hee Han
Google PageRank - Basic Principles and Algebraic/Stochastic Interpretation - Laboratory of Intelligent Networks (LINK) Youn-Hee Han

Backgrond History Target Good Reference
Proposed by Sergey Brin and Lawrence Page (Google’s Bosses) in 1998 at Stanford. Algorithm of the first generation of Google Search Engine. “The Anatomy of a Large-Scale Hypertextual Web Search Engine”. Target Measure the importance of Web page based on the link structure alone. Assign each node a numerical score between 0 and 1: PageRank. Rank Web pages based on PageRank values. Good Reference (Korean) PageRank

Backgrond Sergey Brin and Lawrence Page
Sergey Brin received his B.S. degree in mathematics and computer science from the University of Maryland at College Park in Currently, he is a Ph.D. candidate in computer science at Stanford University where he received his M.S. in He is a recipient of a National Science Foundation Graduate Fellowship. His research interests include search engines, information extraction from unstructured sources, and data mining of large text collections and scientific data. Lawrence Page was born in East Lansing, Michigan, and received a B.S.E. in Computer Engineering at the University of Michigan Ann Arbor in He is currently a Ph.D. candidate in Computer Science at Stanford University. Some of his research interests include the link structure of the web, human computer interaction, search engines, scalability of information access interfaces, and personal data mining Google Inc. in 09/98 (google.com - 09/97) PageRank

Backgrond Stanford WebBase project (1996 - 1999)
The PageRank Citation Ranking: Bringing Order to the Web it is a technical report! (working paper) Stanford Digital Libraries SIDL-WP from the paper: web size = 150M web pages 2005: Google claims to index more than 8B pages Claim that the estimated size of the indexable Web to at least 11.5 billion pages as of the end of January 2005 PageRank

Backgrond The Philosophy of PageRank
PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B PageRank

Backgrond Scenario: Idea
A random surfer who begins at a Web page A. Execute a random walk from A to a randomly chosen Web page that A hyperlinks to. Some nodes are visited more often. Intuitively, these are nodes with many links coming in from other frequently visited nodes. Idea Pages visited more often in this walk are more important. “The rank of a page can be interpreted as the probability that a surfer will be at the page after following a large number of forward links.” PageRank

Basics based on link structure of the web
pages = nodes && links = edges forward links = outlinks backlinks = inlinks A and B are Backlinks of C PageRank

Basic Principles Basic Principles about PageRanks
1) a link from page A to page B is a vote from A to B 2) Pages with lots of backlinks are important has 23,400 inlinks has 1 inlink 3) Backlinks coming from important pages convey more importance to a page combination of PR and text-matching techniques result in highly relevant search results PageRank

Basic Principles Basic Principles about PageRanks
3) Backlinks coming from important pages convey more importance to a page Taher’s Home Page Sep’s Home Page DB Pub Server CS361 Yahoo! CNN Linked by 2 Unimportant pages Linked by 2 Important Pages PageRank

Basic Principles Design of Equation to get Page Importance
importance of page j importance of page i number of outlinks from page j pages j that link to page i PageRank

Basic Principles Design of Equation to get Page Importance 0.25 Taher
0.05 Taher Sep 1/2 1 DB Pub Server CNN 0.1 PageRank

Basic Principles Exact Equation of PageRank u, v: web pages
Bu: set of pages pointing (back link) to u Nv: the number of pages v points (forward link) to d: damping factor Possibility that a user clicks links in webpages continuously. 0~1 0: a user always types URL and visit the page of the URL. 1: a user permanently clicks links of pages over his/her surf PageRank

Basic Principles Exact Equation of PageRank Example PageRank

Basic Principles Iteration PageRank
figures from: and PageRank

Basic Principles Iteration (another example) 0.333 0.333 0.333
Initialize all nodes to rank PageRank

Basic Principles Iteration (another example) 0.5 0.167 0.333 0.333
Propagate ranks across links (multiplying by link weights) PageRank

Basic Principles Iteration (another example) 0.333 0.167 0.5 0.5 0.167
Propagate ranks again across links (multiplying by link weights) 0.167 PageRank

Basic Principles Iteration (another example) 0.4 0.4 0.2
After a while… PageRank

Basic Principles Algorithm Initialize: Repeat until convergence:
importance of page i pages j that link to page i number of outlinks from page j importance of page j PageRank

Algebraic Interpretation
PageRank

Source: How Google Finds Your Needle in the Web's Haystack Hyperlink Matrix Suppose that page Pj has Nj links If one of those links is to page Pi , then Pj will pass on 1/Nj of its importance to Pi The importance ranking of Pi PageRank

Hyperlink Matrix Hyperlink Matrix H = [Hij] in which the entry in the ith row and jth column is Matrix H is stochastic H entries are all nonnegative The sum of the entries in a column is one PageRank

Stationary Vector I We will also form a vector whose components are PageRanks An important condition the vector I is an eigenvector of the matrix H with eigenvalue 1. We also call I a stationary vector of H. the sum of the entries in the vector I be one PageRank

Stationary Vector I 25 billion web pages indicates H has about N = 25 billion columns and rows. However, most of the entries in H are zero; in fact, studies show that web pages have an average of about 10 links, meaning that, on average, all but 10 entries in every column are zero. We will choose a method known as the power method for finding the stationary vector I of the matrix H. We begin by choosing a vector I 0 then producing a sequence of vectors I k by General principle: The sequence Ik will converge to the stationary vector I. PageRank

Stationary Vector I PageRank

Three Important Questions Does the sequence Ik always converge? Is the vector to which it converges independent of the initial vector I0? Do the importance rankings contain the information that we want? the answer to all three questions is "No!“ However, we'll see how to modify our method so that we can answer "yes" to all three. PageRank

Problem 1: Dangling Node Consider the following small web consisting of two web pages The importance rating of both pages is zero, which tells us nothing about the relative importance of these pages The problem is that P2 has no links. Pages with no links are called dangling nodes and there are, of course, many of them in the real web. PageRank

Problem 1: Dangling Node To solve it, we pretend that a dangling node has a link to every other page. This has the effect of modifying the hyperlink matrix H by replacing the column of zeroes corresponding to a dangling node with a column in which each entry is 1/N If A is the matrix whose entries are all zero except for the columns corresponding to dangling nodes, in which each entry is 1/N, then Q = H + A. (we will call Q primitive) Q PageRank

Problem 2: Smaller Sub-web Think the following Then, Q and I are as follows: PageRanks assigned to the first four web pages are zero Q PageRank

Problem 2: Smaller Sub-web The problem: it contains a smaller web within it, shown in the blue box below the matrix Q is reducible if Q can be written in block form as if the matrix Q is irreducible, we can guarantee that there is a stationary vector I with all positive entries Q PageRank

Problem 2: Smaller Sub-web A web is called strongly connected if, given any two pages, there is a way to follow links from the first page to the second. Only strongly connected webs provide irreducible matrices Q. Clearly, the example is not strongly connected. PageRank

(Revisits) Three Important Questions Does the sequence Ik always converge? Is the vector to which it converges independent of the initial vector I0? Do the importance rankings contain the information that we want? In order to answer the three questions, matrix Q should be 1) Stochastic All entries are nonnegative The sum of the entries in a column is one 2) Primitive 3) Strongly connected PageRank

Final Modification Two ways to surf web 1) follow(click) links: random surf the movement of random surf is determined by Q 2) type links in the browser: randomly choose any other page all pages have the equal chance to be visited by typing. New matrix 1 (the N*N matrix whose entries are all one) is used. Google Matrix G G is stochastic since it is a combination of stochastic matrices. G is both primitive and irreducible because all the entries of G are positive Therefore, G has a unique stationary vector I PageRank

Final Modification Google Matrix The meaning of parameter d d=1 (G=H+A): we are only working with the original hyperlink structure of the web. d=0 (G=(1-d)/N  1): we are just type the URL and visit a page we would like to take d close to 1 so that we hyperlink structure of the web is weighted heavily into the computation. Serbey Brin and Larry Page, the creators of PageRank, chose d=0.85 PageRank

From wikipedia… PageRank

Stochastic Interpretation
PageRank

PageRank – Random Walk over the Web If a user starts at a random web page and sufs by clicking links and randomly entering new URLs, what is the probability that s/he will arrive at a given page? A Markov chain is a discrete-time stochastic process consisting of N states, each Web page corresponds to a state. A Markov chain is characterized by an N*N transition probability matrix P PageRank

Let assume the following stochastic process with values in a set E, called the state space, while its elements are called state of the process. Let assume the set E is finite or countable PageRank

Definitions PageRank

Definitions If state i is recurrent, then it is said to be positive recurrent if, starting in state i, the expected time until the process returns to state i is finite. It can be shown that in a finite-state Markov chain, all recurrent states are positive recurrent. Positive recurrent, aperiodic states are called ergodic. PageRank

Limiting Probability (Ross Book – pp. 205) It can be shown that , the limiting probability that the process will be in state j at time n, also equals the long-run proportion of time that the process will be in state j PageRank

Limiting Probability (Ross Book – pp. 206) PageRank

Google Matrix G Since the matrix Q can be reducible or periodic, the following google matrix G must be considered to ensure that the steady-state probability exists and is unique. G PageRank

P: Importance Vector of Web Pages The initial importance is chosen according to some probability distribution P0=[pi] pi : the probability that the Markov Chain is in state i at the initial time Pk = a vector whose i-th component is the probability that the Markov Chain is in state i at time k The power method Brin and Page report that iterations are required to obtain a sufficiently good approximation to P. The calculation is reported to take a few days to complete Stationary distribution P satisfies PT = PT  G (steady-state behavior) (Pk+1)T= (Pk)T  G  (Pk)T = (P0)T  Gk P  Pk for enough large k PageRank

Laboratory of Intelligent Networks (LINK) Youn-Hee Han

Similar presentations

Presentation on theme: "Laboratory of Intelligent Networks (LINK) Youn-Hee Han"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Laboratory of Intelligent Networks (LINK) Youn-Hee Han

Similar presentations

Presentation on theme: "Laboratory of Intelligent Networks (LINK) Youn-Hee Han"— Presentation transcript:

Similar presentations

About project

Feedback