Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented by Zheng Zhao Originally designed by Soumya Sanyal

Similar presentations

Presentation on theme: "Presented by Zheng Zhao Originally designed by Soumya Sanyal"— Presentation transcript:

1 Presented by Zheng Zhao Originally designed by Soumya Sanyal
The PageRank Citation Ranking: Bringing Order to the Web Page L. , Brin S. , Motwani R. , Winograd T. Stanford Digital Library Technologies Project Presented by Zheng Zhao Originally designed by Soumya Sanyal

2 Outline Paper Citations and the Web : Motivation
PageRank : Why it should be considered? More PageRank: Nuts and bolts PageRank Unleashed: Looking under the hood Convergence and Random Walks : Why does it work? Implementation: Getting your hands dirty Personalized PageRank: The invisible source Applications: What wasn’t apparent already Conclusions

3 Paper Citations and the Web : Motivation
Academic Citations link to other well known papers But they are peer reviewed and have quality control Web of academic documents are homogeneous in their quality, usage, citation & length Most web pages link to web pages as well Quality measure of a web page is subjective to the user though Importance of a page is a quantity that isn’t intuitively possible to capture

4 Contd. An user wants to see what is most applicable to her needs first. The job of the retrieval system is to present the more relevant documents up front. The notion of quality or relative importance of a web page magnifies The average quality experienced by an user is higher than the average quality of the average web page. Notations Used: Backlinks (inedges) : Links that point to a certain page Forward Links (outedges): Links that emanate from that page

5 PageRank : Why it should be considered?
Think of a color palette Colors are formed by the mixture of one or more colors The amount and intensity of each color you mix ultimately governs the color of the final mixture not the number of colors !!! Now think of a Web Page A number of back links (inedges) point to this webpage Say a certain back link came from Yahoo! and another came from an obscure home page. Think of the importance of the Yahoo! Page as opposed to the importance of the ‘home page’. Now say the importance of the Yahoo! Page was mapped to the amount (intensity) of one color and the ‘home page’ to another color Importance of back links rather than their number. + +

6 More PageRank: Nuts and bolts
Say for any Web Page u the number of forward links is given by Fu and the number of back links be Bu and Nu=| Fu | R() = Rank of page u ; c = Normalization Constant Note: c < 1 to cover for pages with no outgoing links

7 Contd.. So what does the overall picture look like?
A is designated to be a matrix, u and v correspond to the columns of this matrix

8 Contd.. (Matrices Revisited)
Eigenvectors and eigenvalues Given that A is a matrix, and R be a vector over all the Web pages, the dominant eigenvector is the one associated with the maximal eigenvalue. It can be found out by recursing the previous equation till the recurrence converges. A set of eigenvalues form what is called the eigenspace.

9 Contd.. (A Walk Through Example)
Lets take an example AT=

10 Contd.. A = R = Normalized = Matrix Notation R = c A R = M R A x = λ x
c : eigenvalue R : eigenvector of A A x = λ x | A - λI | x = 0 A = R = Normalized =

11 Contd.. (Markov Chains) Random surfer model
Description of a random walk through the Web graph Interpreted as a transition matrix with asymptotic probability that a surfer is currently browsing that page The above notion is fundamental to any Markovian System. For a discrete notion of the above, the following is assumed. Rt = M Rt-1 M: transition matrix for a first-order Markov chain (stochastic) The question is does it converge to some sensible solution (as t) regardless of the initial ranks ?

12 Contd..(Issues..) The above equation would converge were it not for a little problem This problem is called the ‘Rank Sink’ Problem. The sink accumulates rank, but never distributes it!

13 Contd..() In general many Web pages don’t have either backlinks or forward links. Results in dangling edges of the graph no parent  rank 0 MT converges to a matrix whose last column is all zero no children  no solution MT converges to zero matrix

14 Contd..(More Random Surfer)
How do we escape from this ? A: We actually ‘escape’ from it. Say a surfer is randomly clicking and hopping from one page to the other. If this surfer keeps going back to the ‘same’ set of pages, she will get bored (in reality too) and try and ‘escape’ from this set of pages. Hence, we associate an ‘escape’ factor E to account for this ‘boredom’. How do we model this escape probability We term this E to be a vector over all the web pages that accounts for each page’s escape probability.

15 Contd.. Given this Escape vector, how do we associate this with the original model In matrix notation where It can be rewritten as Hence

16 PageRank Unleashed: Looking under the hood
The main algorithm : What can we say about d and  ? d1 is called the eigengap and it controls the rate of convergence  is the convergence threshold

17 Convergence and Random Walks : Why does it work?
Irreducible Aperiodic Markov Chains with a Primitive transition probability matrix What is the issue all about? We need a transition matrix model that is guaranteed convergence and does indeed converge to a unique stationary distribution vector.

18 Contd.. Addition of the escape vector E, allows us to make the original matrix A be both primitive and stochastic This guarantees convergence What about the addition of new links Whether the link analysis algorithms based on eigenvectors are stable in the sense that results don’t change significantly? The connectivity of a portion of the graph is changed arbitrary How will it affect the results of algorithms? Ng et al. (2001) IJCAI and Bianchini et al. (2002) WWW’02 It is possible to perturb a symmetric matrix by a quantity that grows as d1 that produces a constant perturbation of the dominant eigenvector

19 Contd.. Convergence Experiment(s)
Expander graphs and d1 (every subset S has a neighborhood bounded by some factor  times |S|) Rapidly mixing random walk : Convergence is guaranteed in logarithmic time in the order of the size of the graph

20 Implementation: Getting your hands dirty
24 million web pages Crawler builds an index of links To do this in 5 days, 50 Web pages/second need to be crawled 11 is the average outdegree, 550 links/second 75 million unique URL’s to be compared against URL’s are hashed to unique integer ID No dangling links are kept initially Vector E will help in convergence issues also Weights were kept for 75 million 4 bytes/weight (300MB) Access to link Database is linear since it is sorted `99 – 800 million pages; ` billion; `01 – 4 billion

21 Personalized PageRank: The invisible source
Web Pages are valued because they exist! Web Pages with many related links receive an overly high ranking The other extreme – E for just one web page Netscape Home Page and John McCarthy’s home page

22 Applications: What wasn’t apparent already
Estimating Web Traffic How PageRank corresponds to actual usage Internet proxy cache from NLANR compared to PageRank 2.6 million pages intersect with PageRank’s indexed 75 mil. Web based access is one plausible reason for this disparity People look at certain pages but never link them Backlink Predictor PageRank is a better predictor for future citation counts than citation counts themselves. Experiment starts out with one URL and no other information Goal is to crawl the Web in the order of their importance Importance being an Evaluation function on the number of citation counts (number of backlinks) PageRank escapes local minima, citation count get stuck in these.

23 Conclusions In essence, the importance of one page being dependent on the importance of its predecessors is like a ‘peer’ review. NASDAQ – 17th February, $ : Need I say More?

Download ppt "Presented by Zheng Zhao Originally designed by Soumya Sanyal"

Similar presentations

Ads by Google