1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005

Slides:



Advertisements
Similar presentations
Lecture 18: Link analysis
Advertisements

Markov Models.
Information Networks Link Analysis Ranking Lecture 8.
Graphs, Node importance, Link Analysis Ranking, Random walks
Link Analysis: PageRank
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Advances & Link Analysis
Link Analysis, PageRank and Search Engines on the Web
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Link Structure and Web Mining Shuying Wang
Singular Value Decomposition and Data Management
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Link Analysis HITS Algorithm PageRank Algorithm.
CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine CS 277: Data Mining Mining Web Link Structure.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
Link Analysis.
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Using Hyperlink structure information for web search.
1 Random Walks on Graphs: An Overview Purnamrita Sarkar, CMU Shortened and modified by Longin Jan Latecki.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Overview of Web Ranking Algorithms: HITS and PageRank
CompSci 100E 3.1 Random Walks “A drunk man wil l find his way home, but a drunk bird may get lost forever”  – Shizuo Kakutani Suppose you proceed randomly.
How works M. Ram Murty, FRSC Queen’s Research Chair Queen’s University or How linear algebra powers the search engine.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
COMP4210 Information Retrieval and Search Engines Lecture 9: Link Analysis.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
Chapter 6: Link Analysis
PageRank Algorithm -- Bringing Order to the Web (Hu Bin)
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
CompSci 100E 4.1 Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Roberto Battiti, Mauro Brunato
Quality of a search engine
15-499:Algorithms and Applications
HITS Hypertext-Induced Topic Selection
Search Engines and Link Analysis on the Web
Link-Based Ranking Seminar Social Media Mining University UC3M
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23,

2 Ranking Algorithms

3 Outline The ranking problem PageRank HITS (Hubs & Authorities) Markov Chains and Random Walks PageRank and HITS computation

4 Input:  D: document collection  Q: query space Goal: Find a ranking function rank: D x Q  R s.t.  rank and q induce a ranking (partial order)  q on D  Same as the “relevance scoring function” from previous lecture The Ranking Problem

5 Text-based Ranking Classical ranking functions:  Keyword-based boolean ranking  Cosine similarity + TF-IDF scores Limitations in the context of web search:  The “abundance problem” Recall is not important  Short queries  Web pages are poor in text  Synonymy (cars vs. autos)  Polysemy (java, “Michael Jordan”)  Spam

6 Link-based Ranking Hyperlinks carry important semantics  Recommendation  Critique  Navigation Hypertext IR Principle #1 If p  q, then q is “relevant” to p Hypertext IR Principle #2 If p  q, then p confers “authority” to q

7 Static Ranking Static ranking: rank: D  R, where rank(d) > rank(d’) implies d is more “authoritative” than d’ Use links to come up with a static ranking of all web pages. Given a query q, use text-based ranking to identify a set S of candidate relevant pages. Order S by their static rank. Advantage: static ranking can be computed at a pre- processing step. Disadvantage: no use of Hypertext IR Principle #1.

8 Query-Dependent Ranking Given a query q, use text-based ranking to identify a set S of candidate relevant pages. Use links within S to come up with a ranking rank: S  R, where rank(d) > rank(d’) implies d is more authoritative than d’ with respect to q. Advantage: both Hypertext IR principles are exploited. Disadvantage: less efficient.

9 The Web as a Graph V = a set of pages  In static ranking, V = web  In query dependent ranking, V = S The Web Graph: G = (V,E), where  (p,q) is an edge iff p has a hyperlink to q A = adjacency matrix of G

10 Popularity Ranking rank(p) = in-degree(p) Advantages  Most important pages extracted from millions of matches  No need for text rich documents  Efficiently computable Disadvantages  Bias towards popular pages, irrespective of query  Easily spammable

11 PageRank [Page, Brin, Motwani, Winograd 1998] Motivating principles  Rank of p should be proportional to the rank of the pages that point to p Recommendations from Bill Gates & Steve Jobs vs. from Moishale and Ahuva  Rank of p should depend on the number of pages “co-cited” with p Compare: Bill Gates recommends only me vs. Bill Gates recommends everyone on earth

12 Then:  r is a left eigenvector of B  B must have 1 as an eigenvalue  Since some rows of B are 0, 1 is not necessarily an eigenvalue  Rank is “lost” in sinks PageRank, Attempt #1 r = rank vector B = normalized adjacency matrix:

13 Then:  r is a left eigenvector of B with eigenvalue 1/  Any left eigenvector will do.  Usually will use normalized principal eigenvector.  Rank accumulates at sinks and sink communities. PageRank, Attempt #2 where:

14 PageRank, Attempt #2: Example I /0.8 = /0.8 = 0.69 II 0 01 III

15 Then:  r is a left eigenvector of (B + 1e T ) with eigenvalue 1/  Use normalized principal eigenvector. PageRank, Final Definition E(p) = “rank source” function  Standard setting: E(p) =  /|V| for some  < 1 pagerank is normalized to L 1 unit norm e = rank source vector, r = pagerank vector 1 = the all 1’s vector

16 The Random Surfer Model When visiting a page p, a “random surfer”:  With probability 1 - , selects a random outlink p  q and goes to visit q. (“focused browsing”)  With probability , jumps to a random web page q. (“loss of interest”)  If p has no outlinks, assume it has a self loop. P: probability transition matrix:

17 PageRank & Random Surfer Model Therefore, r is a left eigenvector of (B + 1e T ) with eigenvalue 1/(1 - , iff it is a left eigenvector of P with eigenvalue 1. Suppose: Then:

18 V = state space P = probability transition matrix  Non-negative.  Sum of each row is 1. q 0 = initial distribution on V q t = q 0 P t : distribution on V after t steps P is ergodic if it is:  Irreducible (underlying graph is strongly connected)  Aperiodic (for all states u,v, the gcd of the lengths of paths from u to v is 1) Theorem If P is ergodic, then it has a “stationary distribution” . Furthermore, for all q 0, q t   as t tends to infinity.  P = .  is a left eigenvector of P with e.v. 1. Markov Chain Primer

19 PageRank & Markov Chains Conclusion: The pagerank vector r is the stationary distribution of the random surfer Markov Chain. pagerank(p) = r p = probability random surfer visits p at the limit. Note: “random jump” guarantees Markov Chain is irreducible and aperiodic.

20 PageRank Computation In practice: about 50 iterations suffices

21 HITS: Hubs and Authorities [Kleinberg, 1997] HITS: Hyperlink Induced Topic Search Main principle: every page p is associated with two scores:  Authority score: how “authoritative” a page is about the query’s topic Ex: query: “IR”; authorities: scientific IR papers Ex: query: “automobile manufacturers”; authorities: Mazda, Toyota, and GM web sites  Hub score: how good the page is as a “resource list” about the query’s topic Ex: query: “IR”; hubs: surveys and books about IR Ex: query: “automobile manufacturers”; hubs: KBB, car link lists

22 Mutual Reinforcement HITS principles: p is a good authority, if it is linked by many good hubs. p is a good hub, if it points to many good authorities.

23 HITS: Algebraic Form a: authority vector h: hub vector A: adjacency matrix Then: Therefore: a is principal eigenvector of A T A h is principal eigenvector of AA T

24 Co-Citation and Bibilographic Coupling A T A: co-citation matrix  A T A p,q = # of pages that link both to p and to q.  Thus: authority scores propagate through co-citation. AA T : bibliographic coupling matrix  AA T p,q = # of pages that both p and q link to.  Thus: hub scores propagate through bibliographic coupling. p q p q

25 HITS Computation

26 Principal Eigenvector Computation E: n by n matrix | 1 | > | 2 | >= | 3 | … >= | n | : eigenvalues of E v 1,…,v n : corresponding eigenvectors Eigenvectors are linearly independent Input:  The matrix E  The principal eigenvalue 1  A unit vector u, which is not orthogonal to v 1 Goal: computer v 1

27 The Power Method

28 Why Does It Work?

29 End of Lecture 3