The math behind PageRank A detailed analysis of the mathematical aspects of PageRank Computational Mathematics class presentation Ravi S Sinha LIT lab,

Slides:



Advertisements
Similar presentations
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Advertisements

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Information Networks Link Analysis Ranking Lecture 8.
Google’s PageRank By Zack Kenz. Outline Intro to web searching Review of Linear Algebra Weather example Basics of PageRank Solving the Google Matrix Calculating.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
Experiments with MATLAB Experiments with MATLAB Google PageRank Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University, Taiwan
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Cloud Computing Lecture #5 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, October 1, 2008 This work is licensed.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
ICS 278: Data Mining Lecture 15: Mining Web Link Structure
Link Analysis, PageRank and Search Engines on the Web
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Cloud Computing Lecture #4 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, February 6, 2008 This work is licensed.
More Algorithms for Trees and Graphs Eric Roberts CS 106B March 11, 2013.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Google’s PageRank: The Math Behind the Search Engine Author:Rebecca S. Wills, 2006 Instructor: Dr. Yuan Presenter: Wayne.
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Methods of Computing the PageRank Vector Tom Mangan.
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
1 Random Walks on Graphs: An Overview Purnamrita Sarkar, CMU Shortened and modified by Longin Jan Latecki.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Overview of Web Ranking Algorithms: HITS and PageRank
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Nov.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Graph Algorithms. Graph Algorithms: Topics  Introduction to graph algorithms and graph represent ations  Single Source Shortest Path (SSSP) problem.
How works M. Ram Murty, FRSC Queen’s Research Chair Queen’s University or How linear algebra powers the search engine.
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Google PageRank Algorithm
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
PageRank Algorithm -- Bringing Order to the Web (Hu Bin)
1 CS 430: Information Discovery Lecture 5 Ranking.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Google's Page Rank. Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page
Random Sampling Algorithms with Applications Kyomin Jung KAIST Aug ERC Workshop.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
PageRank Google : its search listings always seemed deliver the “good stuff” up front. 1 2 Part of the magic behind it is its PageRank Algorithm PageRank™
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
The PageRank Citation Ranking: Bringing Order to the Web
Search Engines and Link Analysis on the Web
Lecture #11 PageRank (II)
Link-Based Ranking Seminar Social Media Mining University UC3M
PageRank and Markov Chains
DTMC Applications Ranking Web Pages & Slotted ALOHA
Laboratory of Intelligent Networks (LINK) Youn-Hee Han
Iterative Aggregation Disaggregation
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Piyush Kumar (Lecture 2: PageRank)
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Presented by Nick Janus
Presentation transcript:

The math behind PageRank A detailed analysis of the mathematical aspects of PageRank Computational Mathematics class presentation Ravi S Sinha LIT lab, UNT

Partial citations of references The Anatomy of a Large-Scale Hypertextual Web Search Engine  Sergey Brin and Lawrence Page Inside PageRank  Monica Bianchini, Marco Gori, and Franco Scarselli Deeper Inside PageRank  Amy Langville and Carl Meyer Efficient Computation of PageRank  Taher Haveliwala Topic Sensitive PageRank  Taher Haveliwala

Overview of the talk Why PageRank What is PageRank How PageRank is used Math More math Remaining math

Why PageRank Need to build a better automatic search engine  Why? Human maintained lists subjective and expensive to build (non-automatic) Automatic engines based on keyword matching do a horrible job (just page content is not enough; cleverly placed words in a page can mislead search engines) Advertisers sometimes mislead search engines Solution: Google [modern day: much more than PageRank; getting smarter]  Exact technology: not public domain  Core technology: PageRank (utilizes link structure) Other uses  Any problem that can be visualized as a graph problem where the centrality of the vertices needs to be computed (NLP, etc.)

What is PageRank A way to find the most ‘important’ vertices in a graph PR(A) = (1-d) + d [ PR(T1) / C(T1) + … + PR(Tn) / C(Tn) ] Forms a probability distribution over the vertices [sum = 1] How does this relate to Web search?  Vertices = pages  Incoming edges = hyperlinks from other pages  Outgoing edges = hyperlinks to other pages

Simple visualization: the simplest variant of PageRank in use [user behavior] Random surfer Damping factor Only one incoming link, yet high PageRank

Lexical Substitution: A crash course Trivial for humans, not for machines Math, statistics, linguistics wrapped within computer programs and algorithms Information retrieval, machine translation, question answering, information security [information hiding in text]

PageRank in use: Lexical Substitution Weights: word similarity Directed/ undirected: whole other realm

And now, the cool stuff

The math behind PageRank Intuitive correctness Mathematical foundation Stability Complexity of computational scheme Critical role of the parameters involved The distribution of the page score Role of dangling pages How to promote certain vertices (Web pages)

Intuitive correctness Concept of ‘voting’  Related to citation in scientific literature  More citations indicate great/ important piece of work Random surfer / random walk A page with many links to it must be important A very important page must point to something equally important

Mathematical foundation Most researchers: Markov chains  Caveat: Only applicable in absence of dangling nodes Basic idea: authority of a Web page unrelated to its contents [comes from the link structure] Simple representation Vector representation I N = [1, 1, 1 … 1]’ Transition matrix: ∑(each column) = 1 or 0

Mathematical foundation (2) Google’s iterative version: converges to a stationary solution Jacobi algorithm Alternative computation ||x(t)|| 1 = 1; normalized

Web communities: Energy balance [measure of authority]

More on energy Migration of scores across graph Lessons Maximize energy References from others Minimize E(out) Minimize E(dp) Dangling pages, external links Maximize E(in)

Even more on energy [community promotion] 1.Split same content into smaller vertices 2.Avoid dangling pages 3.Avoid many outgoing links

Page promotion Treat certain pages as communities Bias certain pages by using a non-uniform distribution in the vector I N Tinker with the connectivity [PageRank is proved to be affected by the regularity of the connection pattern] Original IN [1, 1, 1, …, 1] T Biased IN [1, 1.5, 1.25, …, 1] T

Computation of PageRank PageRank can be computed on a graph changing over time  Practical interest [Web is alive] An optimal algorithm exists for computing PageRank  Practical applications: Search engines, PageRank on billions of pages – efficiency!  Ο(|Η| log 1/ε)  NOT dependent on the connectivity or other dimensions  Ideal computation: stops when the ranking of vertices between two computations does not change [converge]

The Markov model from the Web The PageRank vector can only exist if the Markov chain is irreducible By nature, the Web is non-bipartite, sparse, and produces a reducible Markov chain The Web hyperlinked matrix is forced to be  Stochastic [non-negatives, all columns sum up to 1] Remove dangling nodes/ replace relevant rows/ columns with a small value, usually [1/n].e T Introduce personalization vector  Primitive Non-negative One positive element on the main diagonal Irredicible

More on the Markov structure A convex combination of the original stochastic matrix and a stochastic perturbation matrix  Produces a stochastic, irreducible matrix  The PageRank vector is guaranteed to exist for this matrix Every node directly connected to another node, all probabilities non zero  Irreducible Markov chain, will converge 01/2000 1/607/151/2

There’s more to PageRank Computation  Power method Notoriously slow Method of choice Requires no computation of intermediate matrices Converges quickly  Linear systems method The damping factor [usually 0.85]  Greater value: more iterations required  ‘Truer’ PageRanks Dangling pages Storage issues

The end [for today] Thanks for listening!