Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
The math behind PageRank A detailed analysis of the mathematical aspects of PageRank Computational Mathematics class presentation Ravi S Sinha LIT lab,
Information Networks Link Analysis Ranking Lecture 8.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
The PageRank Citation Ranking “Bringing Order to the Web”
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
Link Analysis, PageRank and Search Engines on the Web
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Presented by Zheng Zhao Originally designed by Soumya Sanyal
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
1 Random Walks on Graphs: An Overview Purnamrita Sarkar, CMU Shortened and modified by Longin Jan Latecki.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Overview of Web Ranking Algorithms: HITS and PageRank
1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Nov.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Google PageRank Algorithm
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
PageRank Google : its search listings always seemed deliver the “good stuff” up front. 1 2 Part of the magic behind it is its PageRank Algorithm PageRank™
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
The PageRank Citation Ranking: Bringing Order to the Web
The PageRank Citation Ranking: Bringing Order to the Web
Link-Based Ranking Seminar Social Media Mining University UC3M
PageRank and Markov Chains
DTMC Applications Ranking Web Pages & Slotted ALOHA
CSE 454 Advanced Internet Systems University of Washington
CSE 454 Advanced Internet Systems University of Washington
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Laboratory of Intelligent Networks (LINK) Youn-Hee Han
Iterative Aggregation Disaggregation
CSE 454 Advanced Internet Systems University of Washington
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CSE 454 Advanced Internet Systems University of Washington
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Presentation transcript:

Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd January 29 th, 1998 Stanford InfoLab Adaptive methods for the computation of PageRank Sepandar Kamvar, Taher Haveliwala, Gene Golub 2004

Agenda Technology Overview Introduction & Motivation Link Structure of the Web Simplified PageRank PageRank Definition How we can get PageRank Dangling Links PageRank Implementation Adaptive Methods for computation of PageRank Searching with PageRnnk Personalized PageRank Application Conclusion References

Technology Overview Recognized the need for a new kind of server setup Linked PCs to quickly find each query’s answers This resulted in: Faster Response Time Greater Scalability Lower costs Google uses more than 200 signals (including PageRank algorithm) to determine which pages are important Google then performs hypertext-matching - Google Corporate Information

Life of a Google Query

Introduction & Motivation WWW is very large and heterogeneous The web pages are extremely diverse in terms of content, quality and structure Challenging for information retrieval on WWW Academic Citations link to other well known papers But they are peer reviewed and have quality control Web of academic documents are homogeneous in their quality, usage, citation & length

Problem: How can the most relevant pages be ranked at the top? Answer: Take advantage of the link structure of the Web to produce ranking of every web page known as PageRank Cont’d Most web pages link to web pages as well Quality measure of a web page is subjective to the user though Importance of a page is a quantity that isn’t intuitively possible to capture

Link Structure of the Web A and B are Backlinks of C Every page has some number of forward links (outedges) and backlinks (inedges) We can never know all the backlinks of a page, but we know all of its forward links Generally, highly linked pages are more “important”

PageRank Definition PageRank - a method for computing a ranking for every web page based on the graph of the web A page has high rank if the sum of the ranks of its backlinks is high Page has many backlinks Page has a few highly ranked backlinks PageRank is a link analysis algorithm that assigns a numerical weight that represents how important a page is on the web The web is democratic i.e., pages vote for pages Google interprets a link from page A to page B as a vote, by page A, for page B. It also analyses the page that cast the vote. A page is important if important pages refer to it

Simple Ranking Function: u: web page Bu: backlinks Nu = |Fu| number of links from u c: factor used for normalization In principle, the PageRanks form a probability distribution over web pages, so the sum of all web pages’ PageRanks will be one Simplified PageRank Calculation Cont’d

Computing PageRank given a Directed Graph The transition matrix A = We get the eigenvalue λ = 1 Calculating the eigenvector

On substituting we get, so the vector u is of the form Choose v to be the unique eigenvector with the sum of all entries equal to 1 PageRank vector Cont’d

It is a Markov chain. Set the probability distribution at time 0: X 0 Set one-step transition probability matrix: A What we would like to get is the unique stationary distribution of the Markov chain: by successively iterating until convergence that This is the principal eigenvector of the matrix A, which is exactly the PageRank vector. How we can get PageRank

Dangling links are links that point to any page with no outgoing links or pages not downloaded yet. Problem : how their weights should be distributed. Solution 1: they are removed from the system until all the PageRanks are calculated. Afterwards, they are added in without affecting things significantly. Problem 1: Dangling Links

Solution 2 (presented in the second paper): Let v be a vector representing a uniform distribution over all nodes Problem 1: Dangling Links (cont’d) In terms of the random walk, the effect of D is to modify the transition probabilities so that a surfer visiting a dangling page randomly jumps to another page in the next time step, using the distribution given by v.

Problem 2: Rank Sink Problem: Some pages form a loop that accumulates rank (rank sink) to the infinity. Solution: Random Surfer Model Jump to a random page based on some distribution E (rank source)

Convergence and Random Walks : Why does it work? Irreducible Aperiodic Markov Chains with a Primitive transition probability matrix What are the issues all about? We need a transition matrix model that can guarantee convergence and does indeed converge to a unique stationary distribution vector.

PageRank Expression: Let E(u) be some vector over the Web pages that corresponds to a source of rank. Then, the PageRank of a set of Web pages is an assignment, R’, to the Web pages which satisfies such that c is maximized and ||R’|| 1 = 1 (||R’|| 1 denotes the L 1 norm of R’). PageRank of document u Number of outlinks from document v PageRank of document v that links to u Normalization factor Vector of web pages that the Surfer randomly jumps to u

Computing PageRank S: any vector over the web pages Loop: Calculate the R i+1 vector using R i Calculate the normalizing factor Find the vector R i+1 using d Find the norm of the difference of 2 vectors while Loop until convergence

PageRank Implementation Convert each URL into a unique integer ID Sort the link structure by ID Remove the dangling links Make an initial assignment of ranks Iteratively compute PageRank until Convergence Add the dangling links back Recompute the rankings NOTE: After adding the dangling links back, we need to iterate as many times as was required to remove the dangling links

The mechanism Web Crawler: Finds and retrieves pages on the web Repository: web pages are compressed and stored here Indexer: each index entry has a list of documents in which the term appears and the location within the text where it occurs

Convergence PR (322 Million Links): 52 iterations PR (161 Million Links): 45 iterations Scaling factor is roughly linear in logn

This paper presents two contributions: First, it shows that most pages in the web converge to their true PageRank quickly, while relatively few pages take much longer to converge. And it further shows that those slow-converging pages generally have high PageRank, and those pages that converge quickly generally have low PageRank. Experimental results supports the findings: Adaptive Methods for the computation of PageRank

Experimental results

Second, the authors develop two algorithms, called Adaptive PageRank and Modified Adaptive PageRank, that exploit this observation to speed up the computation of PageRank by 18% and 28%, respectively. The main ideas of the all the proposed algorithms are the same, which is to speed up the computation of PageRank by reducing the cost (not computing the PageRank of converged pages at each iteration). Notations to be included: A: one-step transition probability matrix. x (k) : probability distribution vector at time k. N = not yet converged; C = converged. Adaptive Algorithms

Adaptive PageRank

Reordering the matrix A at each iteration is expensive. Reducing the cost by introducing sparse (zero) entries. Filter-based Adaptive PageRank

Reducing redundant computation by not recomputing the components of the PageRanks of those pages in N due to links from those pages in C. Split A even further. Filter-based Modified Adaptive PageRank

Performance comparison

Searching with PageRank Two search engines: – Title-based search engine – Full text search engine Title-based search engine – Searches only the “Titles” – Finds all the web pages whose titles contain all the query words – Sorts the results by PageRank – Very simple and cheap to implement – Title match ensures high precision, and PageRank ensures high quality Full text search engine – Called Google – Examines all the words in every stored document and also performs PageRank (Rank Merging) – More precise but more complicated

Title-based search for University

Personalized PageRank Important component of PageRank calculation is E A vector over the web pages (used as source of rank) Powerful parameter to adjust the page ranks E vector corresponds to the distribution of web pages that a random surfer periodically jumps to Having an E vector that is uniform over all the web pages results in some web pages with many related links receiving an overly high rank e.g.: copyright page or forums General Search over the internet Instead in Personalized PageRank E consists of a single web page

Applications Estimating Web Traffic On analyzing the statistics, it was found that there are some sites that have a very high usage, but low PageRank. e.g.: Links to pirated software PageRank as Backlink Predictor The goal is to try to crawl the pages in as close to the optimal order as possible i.e., in the order of their rank according to an evaluation func. PageRank is a better predictor than citation counting User Navigation: The PageRank Proxy The user receives some information about the link before they click on it This proxy can help users decide which links are more likely to be interesting

Conclusion PageRank is a global ranking of all web pages based on their locations in the web graph structure PageRank uses information which is external to the web pages – backlinks Backlinks from important pages are more significant than backlinks from average pages The structure of the web graph is very useful for information retrieval tasks.

References L. Page, S. Brin, R. Motwani, T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web, 1998 Sepandar Kamvar, Taher Haveliwala, Gene Golub, Adaptive methods for the computation of PageRank, Linear Algebra and its Applications 386, 2004 Published by Elsevier Inc., pp 51–65. L. Page and S. Brin. The anatomy of a large-scale hypertextual web search engine, 1998 THE $25,000,000,000 EIGENVECTOR THE LINEAR ALGEBRA BEHIND GOOGLE by KURT BRYAN AND TANYA LEISE Google Corporate Information: lecture3.html

Thank You! Q&A