Presentation is loading. Please wait.

Presentation is loading. Please wait.

PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Similar presentations


Presentation on theme: "PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney."— Presentation transcript:

1

2 PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney (UT, Austin)

3 Evolution of Search Engines 1 st Generation : Retrieved documents that matched keyword-based queries using boolean model. 2 nd Generation : Incorporated content- specific relevance ranking based on vector space model (TF-IDF), to deal with high recall. 3 rd Generation: Incorporated content- independent source ranking using hyperlink structure to overcome spamming and to exploit “collective web wisdom”. PrasadL21LinkAnalysis2

4 3 HTML Structure & Feature Weighting Weight tokens under particular HTML tags more heavily: tokens (Google seems to like title matches), … tokens keyword tokens Parse a page into conceptual sections (e.g. navigation links vs. page content) and weight tokens differently based on section.

5 (cont’d) 3 rd Generation: Tried to glean relative semantic emphasis of words based on syntactic features such as fonts and span of query term hits to enhance efficacy of VSM. Future search engines will incorporate (i) trusted sources, (ii) background knowledge, user profile, and search context from past query history, to personalize search, and (iii) apply additional reasoning and heuristics to improve satisfaction of information need. PrasadL21LinkAnalysis4

6 5 Meta-Search Engines Search engine that passes query to several other search engines and integrates results. Submit queries to host sites. Parse resulting HTML pages to extract search results. Integrate multiple rankings (rank aggregation) into a “consensus” ranking. Present integrated results to user. Examples: Metacrawler, SavvySearch,Dogpile MetacrawlerSavvySearch Dogpile

7 6 Bibliometrics: Citation Analysis Many standard documents include bibliographies (or references), explicit citations to other previously published documents. Using citations as links, standard corpora can be viewed as a graph. The structure of this graph, independent of the content, can provide interesting information about the similarity of documents for clustering and about their significance for ranking.

8 7 Bibliographic Coupling: Similarity Measure Measure of similarity of documents introduced by Kessler in 1963. The bibliographic coupling of two documents A and B is the number of documents cited by both A and B. Size of the intersection of their bibliographies. Maybe want to normalize by size of bibliographies ? AB

9 8 Co-Citation : Similarity Measure An alternate citation-based measure of similarity introduced by Small in 1973. Number of documents that cite both A and B. Maybe want to normalize by total number of documents citing either A or B ? AB

10 9 Impact Factor (of a journal) Developed by Garfield in 1972 to measure the importance (quality, influence) of scientific journals. Measure of how often papers in the journal are cited by other scientists. Computed and published annually by the Institute for Scientific Information (ISI). The impact factor of a journal J in year Y is the average number of citations (from indexed documents published in year Y) to a paper published in J in year Y  1 or Y  2. Does not account for the quality of the citing article.

11 10 Authorities Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic. In-degree (number of pointers to a page) is one simple measure of authority. However in-degree treats all links as equal. Should links from pages that are themselves authoritative count for more?

12 11 Hubs Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities). Hub pages for IR are included in the course home page: http://www.cs.utexas.edu/users/mo oney/ir-course http://www.cs.utexas.edu/users/mo oney/ir-course

13 12 Hubs and Authorities Together they tend to form a bipartite graph: HubsAuthorities

14 Introduction to Information Retrieval Simple iterative logic  The Good, The Bad and The Unknown  Good nodes won’t point to Bad nodes  All other combinations plausible 13 ? ? ? ? Good Bad

15 Introduction to Information Retrieval Simple iterative logic  Good nodes won’t point to Bad nodes  If you point to a Bad node, you’re Bad  If a Good node points to you, you’re Good 14 ? ? ? ? Good Bad

16 Introduction to Information Retrieval Simple iterative logic  Good nodes won’t point to Bad nodes  If you point to a Bad node, you’re Bad  If a Good node points to you, you’re Good 15 ? ? Good Bad

17 Introduction to Information Retrieval Simple iterative logic  Good nodes won’t point to Bad nodes  If you point to a Bad node, you’re Bad  If a Good node points to you, you’re Good 16 ? Good Bad Sometimes need probabilistic analogs – e.g., mail spam

18 17 Today’s lecture: Basics of 3 rd Generation Search Engines Role of anchor text Link analysis for ranking Pagerank and variants (Page and Brin) BONUS: Illustrates how to solve an important problem by adapting topology/statistics of large dataset for mathematically sound analysis and a practical scalable algorithm HITS (Klienberg) PrasadL21LinkAnalysis

19 18 The Web as a Directed Graph Assumption 1: A hyperlink between pages denotes author perceived relevance (quality signal) Assumption 2: The anchor text of the hyperlink describes the target page (textual context) Page A hyperlink Page B Anchor PrasadL21LinkAnalysis

20 19 Anchor Text WWW Worm - McBryan [Mcbr94] For ibm how to distinguish between: IBM’s home page (mostly graphical) IBM’s copyright page (high term freq. for ‘ibm’) Rival’s spam page (arbitrarily high term freq.) www.ibm.com “ibm” “ibm.com” “IBM home page” A million pieces of anchor text with “ibm” send a strong signal PrasadL21LinkAnalysis

21 Introduction to Information Retrieval 20 Anchor text containing IBM pointing to www.ibm.com

22 21 Indexing anchor text When indexing a document D, include anchor text from links pointing to D. www.ibm.com Armonk, NY-based computer giant IBM announced today Joe’s computer hardware links Compaq HP IBM Big Blue today announced record profits for the quarter

23 Google Ranking Controversy surrounding “Attention : Good vs Bad” http://www.nytimes.com/2010/11/28/business /28borker.html?pagewanted=all&_r=0 http://www.nytimes.com/2010/11/28/business /28borker.html?pagewanted=all&_r=0 http://www.pcworld.com/article/227804/Googl e_PageRank_Bully_Pleads_Guilty_to_Harassment_ Fraud.html http://www.pcworld.com/article/227804/Googl e_PageRank_Bully_Pleads_Guilty_to_Harassment_ Fraud.html http://www.nytimes.com/2012/09/07/business /vitaly-borker-owner-of-decormyeyes-sentenced- for-threats-to-customers.html http://www.nytimes.com/2012/09/07/business /vitaly-borker-owner-of-decormyeyes-sentenced- for-threats-to-customers.html http://en.wikipedia.org/wiki/DecorMyEyes PrasadL21LinkAnalysis 22

24 Introduction to Information Retrieval 23 Google bombs  A Google bomb is a search with “bad” results due to maliciously manipulated anchor text.  Google introduced a new weighting function in January 2007 that fixed many Google bombs.  Still some remnants: [dangerous cult] on Google, Bing, Yahoo  Coordinated link creation by those who dislike the Church of Scientology  Defused Google bombs: [dumb motherf…], [who is a failure?], [evil empire]

25 24 Anchor Text Other applications Weighting/filtering links in the graph HITS [Chak98], Hilltop [Bhar01] Generating page descriptions from anchor text [Amit98, Amit00] PrasadL21LinkAnalysis

26 25 Citation Analysis Citation frequency Co-citation coupling frequency Cocitations with a given author measures “impact” Cocitation analysis [Mcca90] Bibliographic coupling frequency Articles that co-cite the same articles are related Citation indexing Who is author cited by? (Garfield [Garf72]) Pagerank preview: Pinsker and Narin ’60s PrasadL21LinkAnalysis

27 26 Query-independent ordering First generation: using link counts as simple measures of popularity. Two basic suggestions: Undirected popularity: Each page gets a score = the number of in-links plus the number of out-links (3+2=5). Directed popularity: Score of a page = number of its in-links (3). Prasad

28 27 Query processing First retrieve all pages meeting the text query (say venture capital). Order these by their link popularity (either variant on the previous page). PrasadL21LinkAnalysis

29 Motivation and Introduction to PageRank Why is Page Importance Rating important? New challenges for information retrieval on the World Wide Web. # Web pages: 150 million by 1998, 1000 billion by 2008 Indexed by Google in 2014: ~ 50 billion of > 1 trillion Diversity of web pages: different topics, quality, etc. What is PageRank? A method for rating the importance of web pages objectively and mechanically using the link structure of the web.

30 29 Initial PageRank Idea Can view it as a process of PageRank “flowing” from pages to the pages they cite..1.09.05.03.08.03

31 30 Sample Stable Fixpoint 0.4 0.2 0.4

32 31 Pagerank scoring : More details Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, with equal probability “In the steady state” each page has a long- term visit rate - use this as the page’s score. 1/3 PrasadL21LinkAnalysis

33 32 Not quite enough The web is full of dead-ends. Random walk can get stuck in dead-ends. Makes no sense to talk about long-term visit rates. ?? PrasadL21LinkAnalysis

34 33 Teleporting At a dead end, jump to a random web page. At any non-dead end, with probability 10%, jump to a random web page. With remaining probability (90%), go out on a random link. 10% - a parameter. PrasadL21LinkAnalysis

35 34 Result of teleporting Now cannot get stuck locally. There is a long-term rate at which any page is visited (not obvious, will show this). How do we compute this visit rate? PrasadL21LinkAnalysis

36 35 Markov chains A Markov chain consists of n states, plus an n  n transition probability matrix P. At each step, we are in exactly one of these states. For 1  i,j  n, the matrix entry P ij tells us the probability of j being the next state, given we are currently in state i. ij P ij P ii >0 is OK. Prasad

37 36 Markov chains Clearly, for all i, Markov chains are abstractions of random walks. Exercise: represent the teleporting random walk (eliminating dead ends) as a Markov chain. For this case, start with …: Prasad

38 37 Ergodic Markov chains A Markov chain is ergodic if you have a path from any state to any other. For any start state, after a finite transient time T 0, the probability of being in any state at a fixed time T>T 0 is nonzero. Not ergodic (even/ odd). Prasad

39 38 Ergodic Markov chains For any ergodic Markov chain, there is a unique long-term visit rate for each state. Steady-state probability distribution. Over a long time-period, we visit each state in proportion to this rate. It doesn’t matter where we start. PrasadL21LinkAnalysis

40 39 Probability vectors A probability (row) vector x = (x 1, … x n ) tells us where the walk is, at any point. E.g., (000…1…000) means we’re in state i. in1 More generally, the vector x = (x 1, … x n ) means the walk is in state i with probability x i. Prasad

41 40 Change in probability vector If the probability vector is x = (x 1, … x n ) at this step, what is it at the next step? Recall that row i of the transition probability matrix P tells us where we go next from state i. So from x, our next state is distributed as xP. PrasadL21LinkAnalysis

42 41 Steady state example The steady state looks like a vector of probabilities a = (a 1, … a n ): a i is the probability of being in state i. 12 3/4 1/4 3/41/4 For this example, a 1 =1/4 and a 2 =3/4. PrasadL21LinkAnalysis

43 42 How do we compute this vector? Let a = (a 1, … a n ) denote the row vector of steady-state probabilities. If current state is described by a, then the next step is distributed as aP. But a is the steady state, so a=aP. So a is the (left) eigenvector for P. Corresponds to the “principal” eigenvector of P with the largest eigenvalue. Transition probability matrices always have largest eigenvalue 1 (and others are smaller). Principal eigenvector can be computed iteratively from any initial vector. Prasad

44 43 One way of computing a Recall, regardless of where we start, we eventually reach the steady state a. Start with any distribution (say x=(10…0)). After one step, we’re at xP; after two steps at xP 2, then xP 3 and so on. “Eventually” means for “large” k, xP k = a. Algorithm: multiply x by increasing powers of P until the product looks stable. PrasadL21LinkAnalysis

45 An example of Simplified PageRank (Transposed version) PageRank Calculation: first iteration M ij is the weight of link from j to i. ½

46 An example of Simplified PageRank (Transposed version) PageRank Calculation: second iteration M ij is the weight of link from j to i. ½

47 An example of Simplified PageRank (Transposed version) Convergence after some iterations M ij is the weight of link from j to i. ½

48 47 Power method: Example  What is the PageRank / steady state in this example?

49 Introduction to Information Retrieval 48 Computing PageRank: Power Example

50 Introduction to Information Retrieval 49 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 01= xP t1t1 = xP 2 t2t2 = xP 3 t3t3 = xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

51 Introduction to Information Retrieval 50 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 = xP 2 t2t2 = xP 3 t3t3 = xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

52 Introduction to Information Retrieval 51 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.7= xP 2 t2t2 = xP 3 t3t3 = xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

53 Introduction to Information Retrieval 52 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 = xP 3 t3t3 = xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

54 Introduction to Information Retrieval 53 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.76= xP 3 t3t3 = xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

55 Introduction to Information Retrieval 54 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 = xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

56 Introduction to Information Retrieval 55 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 0.2520.748= xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

57 Introduction to Information Retrieval 56 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 0.2520.7480.24960.7504= xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

58 Introduction to Information Retrieval 57 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 0.2520.7480.24960.7504= xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

59 Introduction to Information Retrieval 58 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 0.2520.7480.24960.7504= xP 4... t∞t∞ 0.250.75= xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

60 Introduction to Information Retrieval 59 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 0.2520.7480.24960.7504= xP 4... t∞t∞ 0.250.750.250.75= xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

61 Introduction to Information Retrieval 60 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 0.2520.7480.24960.7504= xP 4... t∞t∞ 0.250.750.250.75= xP ∞ PageRank vector =  = (  ,   ) = (0.25, 0.75) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

62 Introduction to Information Retrieval 61 Power method: Example  What is the PageRank / steady state in this example?  The steady state distribution (= the PageRanks) in this example are 0.25 for d 1 and 0.75 for d 2.

63 Introduction to Information Retrieval 62 Exercise: Compute PageRank using power method

64 Introduction to Information Retrieval 63 Exercise: Compute PageRank using power method

65 Introduction to Information Retrieval 64 Solution

66 Introduction to Information Retrieval 65 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 01 t1t1 t2t2 t3t3 t∞t∞ PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

67 Introduction to Information Retrieval 66 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 t2t2 t3t3 t∞t∞ PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

68 Introduction to Information Retrieval 67 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.8 t2t2 t3t3 t∞t∞ PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

69 Introduction to Information Retrieval 68 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 t3t3 t∞t∞ PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

70 Introduction to Information Retrieval 69 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.7 t3t3 t∞t∞ PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

71 Introduction to Information Retrieval 70 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.70.350.65 t3t3 t∞t∞ PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

72 Introduction to Information Retrieval 71 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.70.350.65 t3t3 0.350.65 t∞t∞ PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

73 Introduction to Information Retrieval 72 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.70.350.65 t3t3 0.350.650.3750.625 t∞t∞ PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

74 Introduction to Information Retrieval 73 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.70.350.65 t3t3 0.350.650.3750.625... t∞t∞ PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

75 Introduction to Information Retrieval 74 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.70.350.65 t3t3 0.350.650.3750.625... t∞t∞ 0.40.6 PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

76 Introduction to Information Retrieval 75 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.70.350.65 t3t3 0.350.650.3750.625... t∞t∞ 0.40.60.40.6 PageRank vector =  = (  ,   ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22

77 Introduction to Information Retrieval Example web graph

78 Introduction to Information Retrieval 77 Transition (probability) matrix d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0.00 1.000.00 d1d1 0.50 0.00 d2d2 0.330.000.33 0.00 d3d3 0.50 0.00 d4d4 1.00 d5d5 0.00 0.50 d6d6 0.00 0.33 0.000.33

79 Introduction to Information Retrieval 78 Transition matrix with teleporting d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0.02 0.880.02 d1d1 0.45 0.02 d2d2 0.310.020.31 0.02 d3d3 0.45 0.02 d4d4 0.88 d5d5 0.02 0.45 d6d6 0.02 0.31 0.020.31

80 Introduction to Information Retrieval 79 Power method vectors xP k xxP 1 xP 2 xP 3 xP 4 xP 5 xP 6 xP 7 xP 8 xP 9 xP 10 xP 11 xP 12 xP 13 d0d0 0.140.060.090.07 0.06 0.05 d1d1 0.140.080.060.04 d2d2 0.140.250.180.170.150.140.130.12 0.11 d3d3 0.140.160.230.24 0.25 d4d4 0.140.120.160.19 0.200.21 d5d5 0.140.080.060.04 d6d6 0.140.250.230.250.270.280.29 0.30 0.31

81 Introduction to Information Retrieval 80 Example web graph PageRank d0d0 0.05 d1d1 0.04 d2d2 0.11 d3d3 0.25 d4d4 0.21 d5d5 0.04 d6d6 0.31

82 Link Structure of the Web 150 million web pages  1.7 billion links Backlinks and Forward links:  A and B are C’s backlinks  C is A and B’s forward link Intuitively, a webpage is important if it has a lot of backlinks. What if a webpage has only one link off www.yahoo.com?

83 A Simple Version of PageRank u: a web page B u : the set of u’s backlinks N v : the number of forward links of page v c: the normalization factor to make ||R|| L1 = 1 (||R|| L1 = |R 1 + … + R n |)

84 A Problem with Simplified PageRank A loop: During each iteration, the loop accumulates rank but never distributes rank to other pages!

85 An example of the Problem ½

86 ½

87 ½

88 Random Walks in Graphs The Random Surfer Model The simplified model: the standing probability distribution of a random walk on the graph of the web. User simply keeps clicking successive links at random. The modified model: the “random surfer” simply keeps clicking successive links at random, but periodically “gets bored” and jumps to a random page based on the distribution of E.

89 Modified Version of PageRank E(u): a distribution of ranks of web pages that “users” jump to when they “get bored” after successive links at random.

90 An example of Modified PageRank 89 ½

91 90 Pagerank summary Preprocessing: Given graph of links, build matrix P. From it, compute a. The entry a i is a number between 0 and 1: the pagerank of page i. Query processing: Retrieve pages meeting query. Rank them by their pagerank. Order is query-independent. PrasadL21LinkAnalysis

92 91 The reality Pagerank is used in google, but with so many other clever heuristics. PrasadL21LinkAnalysis

93 92 Pagerank: Issues and Variants How realistic is the random surfer model? What if we modeled the back button? [Fagi00] Surfer behavior sharply skewed towards short paths [Hube98] Search engines, bookmarks & directories make jumps non-random. Biased Surfer Models Weight link traversal probabilities based on match with topic/query (skewed selection) Bias jumps to pages on topic (e.g., based on personal bookmarks & categories of interest) Prasad

94 93 Topic Specific Pagerank [Have02] Conceptually, we use a random surfer who teleports, with say 10% probability, using the following rule: Select a category (say, one of the 16 top level ODP categories) based on a query & user -specific distribution over the categories Teleport to a page uniformly at random within the chosen category Sounds hard to implement: can’t compute PageRank at query time! Prasad

95 94 Topic Specific Pagerank [Have02] Implementation offline:Compute pagerank distributions w.r.t individual categories Query independent model as before Each page has multiple pagerank scores – one for each ODP category, with teleportation only to that category online: Distribution of weights over categories computed by query context classification Generate a dynamic pagerank score for each page - weighted sum of category-specific pageranks PrasadL21LinkAnalysis

96 95 Influencing PageRank (“Personalization”) Input: Web graph W influence vector v v : (page  degree of influence) Output: Rank vector r: (page  page importance wrt v) r = PR(W, v) PrasadL21LinkAnalysis

97 96 Non-uniform Teleportation Teleport with 10% probability to a Sports page Sports Prasad

98 97 Interpretation of Composite Score For a set of personalization vectors {v j }  j [w j · PR(W, v j )] = PR(W,  j [w j · v j ]) Weighted sum of rank vectors itself forms a valid rank vector, because PR() is linear wrt v j PrasadL21LinkAnalysis

99 98 Interpretation 10% Sports teleportation Sports Prasad

100 99 Interpretation Health 10% Health teleportation Prasad

101 100 Interpretation Sports Health pr = (0.9 PR sports + 0.1 PR health ) gives you: 9% sports teleportation, 1% health teleportation Prasad

102 Hubs and Authorities Alternative to Pagerank PrasadL21LinkAnalysis101

103 Hyperlink-Induced Topic Search (HITS) - Klei98 In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages: Hub pages are good lists of links on a subject. e.g., “Bob’s list of cancer-related links.” Authority pages occur recurrently on good hubs for the subject. Best suited for “broad topic” queries rather than for page-finding queries. Gets at a broader slice of common opinion.

104 103 Hubs and Authorities Thus, a good hub page for a topic points to many authoritative pages for that topic. A good authority page for a topic is pointed to by many good hubs for that topic. Circular definition - will turn this into an iterative computation. PrasadL21LinkAnalysis

105 104 The hope Long distance telephone companies Hubs Authorities Prasad

106 Introduction to Information Retrieval 105 Example for hubs and authorities

107 106 High-level scheme Extract from the web a base set of pages that could be good hubs or authorities. From these, identify a small set of top hub and authority pages;  iterative algorithm. PrasadL21LinkAnalysis

108 107 Base set Given text query (say browser), use a text index to get all pages containing browser. Call this the root set of pages. Add in any page that either points to a page in the root set, or is pointed to by a page in the root set. Call this the base set. PrasadL21LinkAnalysis

109 108 Visualization Root set Base set PrasadL21LinkAnalysis

110 109 Assembling the base set [Klei98] Root set typically 200-1000 nodes. Base set may have up to 5000 nodes. How do you find the base set nodes? Follow out-links by parsing root set pages. Get in-links (and out-links) from a connectivity server. (Actually, suffices to text-index strings of the form href=“URL” to get in-links to URL.) PrasadL21LinkAnalysis

111 110 Distilling hubs and authorities Compute, for each page x in the base set, a hub score h(x) and an authority score a(x). Initialize: for all x, h(x)  1; a(x)  1; Iteratively update all h(x), a(x); After iterations output pages with highest h() scores as top hubs highest a() scores as top authorities. Key PrasadL21LinkAnalysis

112 111 Illustrated Update Rules 2 3 a 4 = h 1 + h 2 + h 3 1 5 7 6 4 4 h 4 = a 5 + a 6 + a 7

113 112 Iterative update Repeat the following updates, for all x: x x PrasadL21LinkAnalysis y’sy’s y’sy’s

114 113 Scaling To prevent the h() and a() values from getting too big, can normalize after each iteration. Scaling factor doesn’t really matter: we only care about the relative values of the scores. PrasadL21LinkAnalysis

115 114 Old Results Authorities for query: “Java” java.sun.com comp.lang.java FAQ Authorities for query: “search engine” Yahoo.com Excite.com Lycos.com Altavista.com Authorities for query: “Gates” Microsoft.com roadahead.com

116 115 Similar Page Results from Hubs Given “honda.com” toyota.com ford.com bmwusa.com saturncars.com nissanmotors.com audi.com volvocars.com

117 116 Finding Similar Pages Using Link Structure Given a page, P, let R (the root set) be t (e.g. 200) pages that point to P. Grow a base set S from R. Run HITS on S. Return the best authorities in S as the best similar-pages for P. Finds authorities in the “link neighbor-hood” of P.

118 117 HITS for Clustering An ambiguous query can result in the principal eigenvector only covering one of the possible meanings. Non-principal eigenvectors may contain hubs & authorities for other meanings. Example: “jaguar”: Atari video game (principal eigenvector) NFL Football team (2 nd non-princ. eigenvector) Automobile (3 rd non-princ. eigenvector)

119 118 Japan Elementary Schools The American School in Japan The Link Page ‰ªèŽs—§ˆä“c¬ŠwZƒz[ƒ€ƒy[ƒW Kids' Space ˆÀéŽs—§ˆÀé¼”¬ŠwZ ‹{é‹³ˆç‘åŠw‘®¬ŠwZ KEIMEI GAKUEN Home Page ( Japanese ) Shiranuma Home Page fuzoku-es.fukui-u.ac.jp welcome to Miasa E&J school _“ސ쌧E‰¡lŽs—§’†ì¼¬ŠwZ‚̃y http://www...p/~m_maru/index.html fukui haruyama-es HomePage Torisu primary school goo Yakumo Elementary,Hokkaido,Japan FUZOKU Home Page Kamishibun Elementary School... schools LINK Page-13 “ú–{‚ÌŠwZ a‰„¬ŠwZƒz[ƒ€ƒy[ƒW 100 Schools Home Pages (English) K-12 from Japan 10/...rnet and Education ) http://www...iglobe.ne.jp/~IKESAN ‚l‚f‚j¬ŠwZ‚U”N‚P‘g¨Œê ÒŠ—’¬—§ÒŠ—“Œ¬ŠwZ Koulutus ja oppilaitokset TOYODA HOMEPAGE Education Cay's Homepage(Japanese) –y“쏬ŠwZ‚̃z[ƒ€ƒy[ƒW UNIVERSITY ‰J—³¬ŠwZ DRAGON97-TOP ŽÂ‰ª¬ŠwZ‚T”N‚P‘gƒz[ƒ€ƒy[ƒW ¶µ°é¼ÂÁ© ¥á¥Ë¥å¡¼ ¥á¥Ë¥å¡¼ HubsAuthorities

120 Introduction to Information Retrieval 119 Authorities for query [Chicago Bulls] 0.85www.nba.com/bulls 0.25www.essex1.com/people/jmiller/bulls.htm “da Bulls” 0.20www.nando.net/SportServer/basketball/nba/chi.html “The Chicago Bulls” 0.15Users.aol.com/rynocub/bulls.htm “The Chicago Bulls Home Page ” 0.13www.geocities.com/Colosseum/6095 “Chicago Bulls” (Ben Shaul et al, WWW8)

121 Introduction to Information Retrieval 120 Hubs for query [Chicago Bulls] 1.62www.geocities.com/Colosseum/1778 “Unbelieveabulls!!!!!” 1.24www.webring.org/cgi-bin/webring?ring=chbulls “Chicago Bulls” 0.74www.geocities.com/Hollywood/Lot/3330/Bulls.html “Chicago Bulls” 0.52www.nobull.net/web_position/kw-search-15-M2.html “Excite Search Results: bulls ” 0.52www.halcyon.com/wordltd/bball/bulls.html “Chicago Bulls Links” (Ben Shaul et al, WWW8)

122 121 Things to note Pulled together good pages regardless of language of page content. Use link analysis only after base set assembled iterative scoring is query-independent. Iterative computation after text index retrieval - significant overhead. PrasadL21LinkAnalysis

123 122 Proof of convergence n  n adjacency matrix A: each of the n pages in the base set has a row and column in the matrix. Entry A ij = 1 if page i links to page j, else = 0. 12 3 1 2 3 123123 0 1 0 1 1 1 1 0 0 PrasadL21LinkAnalysis

124 123 Hub/authority vectors View the hub scores h() and the authority scores a() as vectors with n components. Recall the iterative updates Prasad

125 124 Rewrite in matrix form h=Aa. a=A t h. Recall A t is the transpose of A. Substituting, h=AA t h and a=A t Aa. Thus, h is an eigenvector of AA t and a is an eigenvector of A t A. Further, we know the algorithm is for computing eigenvector: the power iteration method. Guaranteed to converge. Prasad

126 Introduction to Information Retrieval Example web graph

127 Introduction to Information Retrieval Raw matrix A for HITS d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0010000 d1d1 0110000 d2d2 1012000 d3d3 0001100 d4d4 0000001 d5d5 0000011 d6d6 0002101

128 Introduction to Information Retrieval Hub vectors h 0, h i = A*a i, i ≥1 h0h0 h1h1 h2h2 h3h3 h4h4 h5h5 d0d0 0.140.060.04 0.03 d1d1 0.140.080.050.04 d2d2 0.140.280.320.33 d3d3 0.14 0.170.18 d4d4 0.140.060.04 d5d5 0.140.080.050.04 d6d6 0.140.300.330.340.35

129 Introduction to Information Retrieval Authority vector a = A T *h i-1, i ≥ 1 a1a1 a2a2 a3a3 a4a4 a5a5 a6a6 a7a7 d0d0 0.060.090.10 d1d1 0.060.030.01 d2d2 0.190.140.130.12 d3d3 0.310.430.46 0.47 d4d4 0.130.140.16 d5d5 0.060.030.020.01 d6d6 0.190.140.13

130 Introduction to Information Retrieval 129 Example web graph ah d0d0 0.100.03 d1d1 0.010.04 d2d2 0.120.33 d3d3 0.470.18 d4d4 0.160.04 d5d5 0.010.04 d6d6 0.130.35

131 Introduction to Information Retrieval Top-ranked pages  Pages with highest in-degree: d 2, d 3, d 6  Pages with highest out-degree: d 2, d 6  Pages with highest PageRank: d 6  Pages with highest in-degree: d 6 (close: d 2 )  Pages with highest authority score: d 3

132 Introduction to Information Retrieval 131 PageRank vs. HITS: Discussion  PageRank can be precomputed, HITS has to be computed at query time.  HITS is too expensive in most application scenarios.  PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to.  These two are orthogonal.  We could also apply HITS to the entire web and PageRank to a small base set.  Claim: On the web, a good hub almost always is also a good authority.  The actual difference between PageRank ranking and HITS ranking is therefore not as large as one might expect.

133 132 Issues Topic Drift Off-topic pages can cause off-topic “authorities” to be returned E.g., the neighborhood graph can be about a “super topic” Mutually Reinforcing Affiliates Affiliated pages/sites can boost each others’ scores Linkage between affiliated pages is not a useful signal PrasadL21LinkAnalysis


Download ppt "PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney."

Similar presentations


Ads by Google