Download presentation
Presentation is loading. Please wait.
Published bySucianty Gunardi Modified over 6 years ago
1
Laboratory of Intelligent Networks (LINK) Youn-Hee Han
Google PageRank - Basic Principles and Algebraic/Stochastic Interpretation - Laboratory of Intelligent Networks (LINK) Youn-Hee Han
2
Backgrond History Target Good Reference
Proposed by Sergey Brin and Lawrence Page (Google’s Bosses) in 1998 at Stanford. Algorithm of the first generation of Google Search Engine. “The Anatomy of a Large-Scale Hypertextual Web Search Engine”. Target Measure the importance of Web page based on the link structure alone. Assign each node a numerical score between 0 and 1: PageRank. Rank Web pages based on PageRank values. Good Reference (Korean) PageRank
3
Backgrond Sergey Brin and Lawrence Page
Sergey Brin received his B.S. degree in mathematics and computer science from the University of Maryland at College Park in Currently, he is a Ph.D. candidate in computer science at Stanford University where he received his M.S. in He is a recipient of a National Science Foundation Graduate Fellowship. His research interests include search engines, information extraction from unstructured sources, and data mining of large text collections and scientific data. Lawrence Page was born in East Lansing, Michigan, and received a B.S.E. in Computer Engineering at the University of Michigan Ann Arbor in He is currently a Ph.D. candidate in Computer Science at Stanford University. Some of his research interests include the link structure of the web, human computer interaction, search engines, scalability of information access interfaces, and personal data mining Google Inc. in 09/98 (google.com - 09/97) PageRank
4
Backgrond Stanford WebBase project (1996 - 1999)
The PageRank Citation Ranking: Bringing Order to the Web it is a technical report! (working paper) Stanford Digital Libraries SIDL-WP from the paper: web size = 150M web pages 2005: Google claims to index more than 8B pages Claim that the estimated size of the indexable Web to at least 11.5 billion pages as of the end of January 2005 PageRank
5
Backgrond The Philosophy of PageRank
PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B PageRank
6
Backgrond Scenario: Idea
A random surfer who begins at a Web page A. Execute a random walk from A to a randomly chosen Web page that A hyperlinks to. Some nodes are visited more often. Intuitively, these are nodes with many links coming in from other frequently visited nodes. Idea Pages visited more often in this walk are more important. “The rank of a page can be interpreted as the probability that a surfer will be at the page after following a large number of forward links.” PageRank
7
Basics based on link structure of the web
pages = nodes && links = edges forward links = outlinks backlinks = inlinks A and B are Backlinks of C PageRank
8
Basic Principles Basic Principles about PageRanks
1) a link from page A to page B is a vote from A to B 2) Pages with lots of backlinks are important has 23,400 inlinks has 1 inlink 3) Backlinks coming from important pages convey more importance to a page combination of PR and text-matching techniques result in highly relevant search results PageRank
9
Basic Principles Basic Principles about PageRanks
3) Backlinks coming from important pages convey more importance to a page Taher’s Home Page Sep’s Home Page DB Pub Server CS361 Yahoo! CNN Linked by 2 Unimportant pages Linked by 2 Important Pages PageRank
10
Basic Principles Design of Equation to get Page Importance
importance of page j importance of page i number of outlinks from page j pages j that link to page i PageRank
11
Basic Principles Design of Equation to get Page Importance 0.25 Taher
0.05 Taher Sep 1/2 1 DB Pub Server CNN 0.1 PageRank
12
Basic Principles Exact Equation of PageRank u, v: web pages
Bu: set of pages pointing (back link) to u Nv: the number of pages v points (forward link) to d: damping factor Possibility that a user clicks links in webpages continuously. 0~1 0: a user always types URL and visit the page of the URL. 1: a user permanently clicks links of pages over his/her surf PageRank
13
Basic Principles Exact Equation of PageRank Example PageRank
14
Basic Principles Iteration PageRank
figures from: and PageRank
15
Basic Principles Iteration (another example) 0.333 0.333 0.333
Initialize all nodes to rank PageRank
16
Basic Principles Iteration (another example) 0.5 0.167 0.333 0.333
Propagate ranks across links (multiplying by link weights) PageRank
17
Basic Principles Iteration (another example) 0.333 0.167 0.5 0.5 0.167
Propagate ranks again across links (multiplying by link weights) 0.167 PageRank
18
Basic Principles Iteration (another example) 0.4 0.4 0.2
After a while… PageRank
19
Basic Principles Algorithm Initialize: Repeat until convergence:
importance of page i pages j that link to page i number of outlinks from page j importance of page j PageRank
20
Algebraic Interpretation
PageRank
21
Algebraic Interpretation
Source: How Google Finds Your Needle in the Web's Haystack Hyperlink Matrix Suppose that page Pj has Nj links If one of those links is to page Pi , then Pj will pass on 1/Nj of its importance to Pi The importance ranking of Pi PageRank
22
Algebraic Interpretation
Hyperlink Matrix Hyperlink Matrix H = [Hij] in which the entry in the ith row and jth column is Matrix H is stochastic H entries are all nonnegative The sum of the entries in a column is one PageRank
23
Algebraic Interpretation
Stationary Vector I We will also form a vector whose components are PageRanks An important condition the vector I is an eigenvector of the matrix H with eigenvalue 1. We also call I a stationary vector of H. the sum of the entries in the vector I be one PageRank
24
Algebraic Interpretation
Stationary Vector I 25 billion web pages indicates H has about N = 25 billion columns and rows. However, most of the entries in H are zero; in fact, studies show that web pages have an average of about 10 links, meaning that, on average, all but 10 entries in every column are zero. We will choose a method known as the power method for finding the stationary vector I of the matrix H. We begin by choosing a vector I 0 then producing a sequence of vectors I k by General principle: The sequence Ik will converge to the stationary vector I. PageRank
25
Algebraic Interpretation
Stationary Vector I PageRank
26
Algebraic Interpretation
Three Important Questions Does the sequence Ik always converge? Is the vector to which it converges independent of the initial vector I0? Do the importance rankings contain the information that we want? the answer to all three questions is "No!“ However, we'll see how to modify our method so that we can answer "yes" to all three. PageRank
27
Algebraic Interpretation
Problem 1: Dangling Node Consider the following small web consisting of two web pages The importance rating of both pages is zero, which tells us nothing about the relative importance of these pages The problem is that P2 has no links. Pages with no links are called dangling nodes and there are, of course, many of them in the real web. PageRank
28
Algebraic Interpretation
Problem 1: Dangling Node To solve it, we pretend that a dangling node has a link to every other page. This has the effect of modifying the hyperlink matrix H by replacing the column of zeroes corresponding to a dangling node with a column in which each entry is 1/N If A is the matrix whose entries are all zero except for the columns corresponding to dangling nodes, in which each entry is 1/N, then Q = H + A. (we will call Q primitive) Q PageRank
29
Algebraic Interpretation
Problem 2: Smaller Sub-web Think the following Then, Q and I are as follows: PageRanks assigned to the first four web pages are zero Q PageRank
30
Algebraic Interpretation
Problem 2: Smaller Sub-web The problem: it contains a smaller web within it, shown in the blue box below the matrix Q is reducible if Q can be written in block form as if the matrix Q is irreducible, we can guarantee that there is a stationary vector I with all positive entries Q PageRank
31
Algebraic Interpretation
Problem 2: Smaller Sub-web A web is called strongly connected if, given any two pages, there is a way to follow links from the first page to the second. Only strongly connected webs provide irreducible matrices Q. Clearly, the example is not strongly connected. PageRank
32
Algebraic Interpretation
(Revisits) Three Important Questions Does the sequence Ik always converge? Is the vector to which it converges independent of the initial vector I0? Do the importance rankings contain the information that we want? In order to answer the three questions, matrix Q should be 1) Stochastic All entries are nonnegative The sum of the entries in a column is one 2) Primitive 3) Strongly connected PageRank
33
Algebraic Interpretation
Final Modification Two ways to surf web 1) follow(click) links: random surf the movement of random surf is determined by Q 2) type links in the browser: randomly choose any other page all pages have the equal chance to be visited by typing. New matrix 1 (the N*N matrix whose entries are all one) is used. Google Matrix G G is stochastic since it is a combination of stochastic matrices. G is both primitive and irreducible because all the entries of G are positive Therefore, G has a unique stationary vector I PageRank
34
Algebraic Interpretation
Final Modification Google Matrix The meaning of parameter d d=1 (G=H+A): we are only working with the original hyperlink structure of the web. d=0 (G=(1-d)/N 1): we are just type the URL and visit a page we would like to take d close to 1 so that we hyperlink structure of the web is weighted heavily into the computation. Serbey Brin and Larry Page, the creators of PageRank, chose d=0.85 PageRank
35
Algebraic Interpretation
From wikipedia… PageRank
36
Stochastic Interpretation
PageRank
37
Stochastic Interpretation
PageRank – Random Walk over the Web If a user starts at a random web page and sufs by clicking links and randomly entering new URLs, what is the probability that s/he will arrive at a given page? A Markov chain is a discrete-time stochastic process consisting of N states, each Web page corresponds to a state. A Markov chain is characterized by an N*N transition probability matrix P PageRank
38
Stochastic Interpretation
Let assume the following stochastic process with values in a set E, called the state space, while its elements are called state of the process. Let assume the set E is finite or countable PageRank
39
Stochastic Interpretation
Definitions PageRank
40
Stochastic Interpretation
Definitions If state i is recurrent, then it is said to be positive recurrent if, starting in state i, the expected time until the process returns to state i is finite. It can be shown that in a finite-state Markov chain, all recurrent states are positive recurrent. Positive recurrent, aperiodic states are called ergodic. PageRank
41
Stochastic Interpretation
Limiting Probability (Ross Book – pp. 205) It can be shown that , the limiting probability that the process will be in state j at time n, also equals the long-run proportion of time that the process will be in state j PageRank
42
Stochastic Interpretation
Limiting Probability (Ross Book – pp. 206) PageRank
43
Stochastic Interpretation
Google Matrix G Since the matrix Q can be reducible or periodic, the following google matrix G must be considered to ensure that the steady-state probability exists and is unique. G PageRank
44
Stochastic Interpretation
P: Importance Vector of Web Pages The initial importance is chosen according to some probability distribution P0=[pi] pi : the probability that the Markov Chain is in state i at the initial time Pk = a vector whose i-th component is the probability that the Markov Chain is in state i at time k The power method Brin and Page report that iterations are required to obtain a sufficiently good approximation to P. The calculation is reported to take a few days to complete Stationary distribution P satisfies PT = PT G (steady-state behavior) (Pk+1)T= (Pk)T G (Pk)T = (P0)T Gk P Pk for enough large k PageRank
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.