Download presentation
Presentation is loading. Please wait.
1
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005 http://www.ee.technion.ac.il/courses/049011
2
2 Ranking Algorithms
3
3 Outline The ranking problem PageRank HITS (Hubs & Authorities) Markov Chains and Random Walks PageRank and HITS computation
4
4 Input: D: document collection Q: query space Goal: Find a ranking function rank: D x Q R s.t. rank and q induce a ranking (partial order) q on D Same as the “relevance scoring function” from previous lecture The Ranking Problem
5
5 Text-based Ranking Classical ranking functions: Keyword-based boolean ranking Cosine similarity + TF-IDF scores Limitations in the context of web search: The “abundance problem” Recall is not important Short queries Web pages are poor in text Synonymy (cars vs. autos) Polysemy (java, “Michael Jordan”) Spam
6
6 Link-based Ranking Hyperlinks carry important semantics Recommendation Critique Navigation Hypertext IR Principle #1 If p q, then q is “relevant” to p Hypertext IR Principle #2 If p q, then p confers “authority” to q
7
7 Static Ranking Static ranking: rank: D R, where rank(d) > rank(d’) implies d is more “authoritative” than d’ Use links to come up with a static ranking of all web pages. Given a query q, use text-based ranking to identify a set S of candidate relevant pages. Order S by their static rank. Advantage: static ranking can be computed at a pre- processing step. Disadvantage: no use of Hypertext IR Principle #1.
8
8 Query-Dependent Ranking Given a query q, use text-based ranking to identify a set S of candidate relevant pages. Use links within S to come up with a ranking rank: S R, where rank(d) > rank(d’) implies d is more authoritative than d’ with respect to q. Advantage: both Hypertext IR principles are exploited. Disadvantage: less efficient.
9
9 The Web as a Graph V = a set of pages In static ranking, V = web In query dependent ranking, V = S The Web Graph: G = (V,E), where (p,q) is an edge iff p has a hyperlink to q A = adjacency matrix of G
10
10 Popularity Ranking rank(p) = in-degree(p) Advantages Most important pages extracted from millions of matches No need for text rich documents Efficiently computable Disadvantages Bias towards popular pages, irrespective of query Easily spammable
11
11 PageRank [Page, Brin, Motwani, Winograd 1998] Motivating principles Rank of p should be proportional to the rank of the pages that point to p Recommendations from Bill Gates & Steve Jobs vs. from Moishale and Ahuva Rank of p should depend on the number of pages “co-cited” with p Compare: Bill Gates recommends only me vs. Bill Gates recommends everyone on earth
12
12 Then: r is a left eigenvector of B B must have 1 as an eigenvalue Since some rows of B are 0, 1 is not necessarily an eigenvalue Rank is “lost” in sinks PageRank, Attempt #1 r = rank vector B = normalized adjacency matrix:
13
13 Then: r is a left eigenvector of B with eigenvalue 1/ Any left eigenvector will do. Usually will use normalized principal eigenvector. Rank accumulates at sinks and sink communities. PageRank, Attempt #2 where:
14
14 PageRank, Attempt #2: Example 0.5 0.30.2 I 0 0.25/0.8 = 0.31 0.65/0.8 = 0.69 II 0 01 III
15
15 Then: r is a left eigenvector of (B + 1e T ) with eigenvalue 1/ Use normalized principal eigenvector. PageRank, Final Definition E(p) = “rank source” function Standard setting: E(p) = /|V| for some < 1 pagerank is normalized to L 1 unit norm e = rank source vector, r = pagerank vector 1 = the all 1’s vector
16
16 The Random Surfer Model When visiting a page p, a “random surfer”: With probability 1 - , selects a random outlink p q and goes to visit q. (“focused browsing”) With probability , jumps to a random web page q. (“loss of interest”) If p has no outlinks, assume it has a self loop. P: probability transition matrix:
17
17 PageRank & Random Surfer Model Therefore, r is a left eigenvector of (B + 1e T ) with eigenvalue 1/(1 - , iff it is a left eigenvector of P with eigenvalue 1. Suppose: Then:
18
18 V = state space P = probability transition matrix Non-negative. Sum of each row is 1. q 0 = initial distribution on V q t = q 0 P t : distribution on V after t steps P is ergodic if it is: Irreducible (underlying graph is strongly connected) Aperiodic (for all states u,v, the gcd of the lengths of paths from u to v is 1) Theorem If P is ergodic, then it has a “stationary distribution” . Furthermore, for all q 0, q t as t tends to infinity. P = . is a left eigenvector of P with e.v. 1. Markov Chain Primer
19
19 PageRank & Markov Chains Conclusion: The pagerank vector r is the stationary distribution of the random surfer Markov Chain. pagerank(p) = r p = probability random surfer visits p at the limit. Note: “random jump” guarantees Markov Chain is irreducible and aperiodic.
20
20 PageRank Computation In practice: about 50 iterations suffices
21
21 HITS: Hubs and Authorities [Kleinberg, 1997] HITS: Hyperlink Induced Topic Search Main principle: every page p is associated with two scores: Authority score: how “authoritative” a page is about the query’s topic Ex: query: “IR”; authorities: scientific IR papers Ex: query: “automobile manufacturers”; authorities: Mazda, Toyota, and GM web sites Hub score: how good the page is as a “resource list” about the query’s topic Ex: query: “IR”; hubs: surveys and books about IR Ex: query: “automobile manufacturers”; hubs: KBB, car link lists
22
22 Mutual Reinforcement HITS principles: p is a good authority, if it is linked by many good hubs. p is a good hub, if it points to many good authorities.
23
23 HITS: Algebraic Form a: authority vector h: hub vector A: adjacency matrix Then: Therefore: a is principal eigenvector of A T A h is principal eigenvector of AA T
24
24 Co-Citation and Bibilographic Coupling A T A: co-citation matrix A T A p,q = # of pages that link both to p and to q. Thus: authority scores propagate through co-citation. AA T : bibliographic coupling matrix AA T p,q = # of pages that both p and q link to. Thus: hub scores propagate through bibliographic coupling. p q p q
25
25 HITS Computation
26
26 Principal Eigenvector Computation E: n by n matrix | 1 | > | 2 | >= | 3 | … >= | n | : eigenvalues of E v 1,…,v n : corresponding eigenvectors Eigenvectors are linearly independent Input: The matrix E The principal eigenvalue 1 A unit vector u, which is not orthogonal to v 1 Goal: computer v 1
27
27 The Power Method
28
28 Why Does It Work?
29
29 End of Lecture 3
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.