Download presentation
Presentation is loading. Please wait.
Published byAlexandra Daniel Modified over 9 years ago
1
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis
2
Introduction to Information Retrieval 2 The web as a directed graph
3
Introduction to Information Retrieval 3 The web as a directed graph
4
Introduction to Information Retrieval 4 The web as a directed graph Assumption 1: A hyperlink is a quality signal.
5
Introduction to Information Retrieval 5 The web as a directed graph Assumption 1: A hyperlink is a quality signal. The hyperlink d 1 → d 2 indicates that d 1 ‘s author deems d 2 high-quality and relevant.
6
Introduction to Information Retrieval 6 The web as a directed graph Assumption 1: A hyperlink is a quality signal. The hyperlink d 1 → d 2 indicates that d 1 ‘s author deems d 2 high-quality and relevant. Assumption 2: The anchor text describes the content of d 2.
7
Introduction to Information Retrieval 7 The web as a directed graph Assumption 1: A hyperlink is a quality signal. The hyperlink d 1 → d 2 indicates that d 1 ‘s author deems d 2 high-quality and relevant. Assumption 2: The anchor text describes the content of d 2. We use anchor text somewhat loosely here for: the text surrounding the hyperlink.
8
Introduction to Information Retrieval 8 The web as a directed graph Assumption 1: A hyperlink is a quality signal. The hyperlink d 1 → d 2 indicates that d 1 ‘s author deems d 2 high-quality and relevant. Assumption 2: The anchor text describes the content of d 2. We use anchor text somewhat loosely here for: the text surrounding the hyperlink. Example: “You can find cheap cars ˂a href =http://…˃here ˂/a ˃. ”
9
Introduction to Information Retrieval 9 The web as a directed graph Assumption 1: A hyperlink is a quality signal. The hyperlink d 1 → d 2 indicates that d 1 ‘s author deems d 2 high-quality and relevant. Assumption 2: The anchor text describes the content of d 2. We use anchor text somewhat loosely here for: the text surrounding the hyperlink. Example: “You can find cheap cars ˂a href =http://…˃here ˂/a ˃. ” Anchor text: “You can find cheap here”
10
Introduction to Information Retrieval 10 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ]
11
Introduction to Information Retrieval 11 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ] Searching on [text of d 2 ] + [anchor text → d 2 ] is often more effective than searching on [text of d 2 ] only.
12
Introduction to Information Retrieval 12 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ] Searching on [text of d 2 ] + [anchor text → d 2 ] is often more effective than searching on [text of d 2 ] only. Example: Query IBM
13
Introduction to Information Retrieval 13 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ] Searching on [text of d 2 ] + [anchor text → d 2 ] is often more effective than searching on [text of d 2 ] only. Example: Query IBM Matches IBM’s copyright page
14
Introduction to Information Retrieval 14 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ] Searching on [text of d 2 ] + [anchor text → d 2 ] is often more effective than searching on [text of d 2 ] only. Example: Query IBM Matches IBM’s copyright page Matches many spam pages
15
Introduction to Information Retrieval 15 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ] Searching on [text of d 2 ] + [anchor text → d 2 ] is often more effective than searching on [text of d 2 ] only. Example: Query IBM Matches IBM’s copyright page Matches many spam pages Matches IBM wikipedia article
16
Introduction to Information Retrieval 16 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ] Searching on [text of d 2 ] + [anchor text → d 2 ] is often more effective than searching on [text of d 2 ] only. Example: Query IBM Matches IBM’s copyright page Matches many spam pages Matches IBM wikipedia article May not match IBM home page!
17
Introduction to Information Retrieval 17 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ] Searching on [text of d 2 ] + [anchor text → d 2 ] is often more effective than searching on [text of d 2 ] only. Example: Query IBM Matches IBM’s copyright page Matches many spam pages Matches IBM wikipedia article May not match IBM home page! … if IBM home page is mostly graphics
18
Introduction to Information Retrieval 18 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ] Searching on [text of d 2 ] + [anchor text → d 2 ] is often more effective than searching on [text of d 2 ] only. Example: Query IBM Matches IBM’s copyright page Matches many spam pages Matches IBM wikipedia article May not match IBM home page! … if IBM home page is mostly graphics Searching on [anchor text → d 2 ] is better for the query IBM.
19
Introduction to Information Retrieval 19 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ] Searching on [text of d 2 ] + [anchor text → d 2 ] is often more effective than searching on [text of d 2 ] only. Example: Query IBM Matches IBM’s copyright page Matches many spam pages Matches IBM wikipedia article May not match IBM home page! … if IBM home page is mostly graphics Searching on [anchor text → d 2 ] is better for the query IBM. In this representation, the page with most occurences of IBM is www.ibm.com
20
Introduction to Information Retrieval 20 Anchor text containing IBM pointing to www.ibm.com
21
Introduction to Information Retrieval 21 Anchor text containing IBM pointing to www.ibm.com
22
Introduction to Information Retrieval 22 Indexing anchor text
23
Introduction to Information Retrieval 23 Indexing anchor text Thus: Anchor text is often a better description of a page’s content than the page itself.
24
Introduction to Information Retrieval 24 Indexing anchor text Thus: Anchor text is often a better description of a page’s content than the page itself. Anchor text can be weighted more highly than document text. (based on Assumption 1&2)
25
Introduction to Information Retrieval 25 Google bombs
26
Introduction to Information Retrieval 26 Google bombs A Google bomb is a search with “bad” results due to maliciously manipulated anchor text.
27
Introduction to Information Retrieval 27 Google bombs A Google bomb is a search with “bad” results due to maliciously manipulated anchor text. Google introduced a new weighting function in January 2007 that fixed many Google bombs.
28
Introduction to Information Retrieval 28 Google bombs A Google bomb is a search with “bad” results due to maliciously manipulated anchor text. Google introduced a new weighting function in January 2007 that fixed many Google bombs. Still some remnants: [dangerous cult] on Google, Bing, Yahoo
29
Introduction to Information Retrieval 29 Google bombs A Google bomb is a search with “bad” results due to maliciously manipulated anchor text. Google introduced a new weighting function in January 2007 that fixed many Google bombs. Still some remnants: [dangerous cult] on Google, Bing, Yahoo Coordinated link creation by those who dislike the Church of Scientology
30
Introduction to Information Retrieval 30 Google bombs A Google bomb is a search with “bad” results due to maliciously manipulated anchor text. Google introduced a new weighting function in January 2007 that fixed many Google bombs. Still some remnants: [dangerous cult] on Google, Bing, Yahoo Coordinated link creation by those who dislike the Church of Scientology Defused Google bombs: [dumb motherf…], [who is a failure?], [evil empire]
31
Introduction to Information Retrieval 31 Origins of PageRank: Citation analysis (1)
32
Introduction to Information Retrieval 32 Origins of PageRank: Citation analysis (1) Citation analysis: analysis of citations in the scientific literature.
33
Introduction to Information Retrieval 33 Origins of PageRank: Citation analysis (1) Citation analysis: analysis of citations in the scientific literature. Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.”
34
Introduction to Information Retrieval 34 Origins of PageRank: Citation analysis (1) Citation analysis: analysis of citations in the scientific literature. Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.” We can view “Miller (2001)” as a hyperlink linking two scientific articles.
35
Introduction to Information Retrieval 35 Origins of PageRank: Citation analysis (1) Citation analysis: analysis of citations in the scientific literature. Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.” We can view “Miller (2001)” as a hyperlink linking two scientific articles. One application of these “hyperlinks” in the scientific literature:
36
Introduction to Information Retrieval 36 Origins of PageRank: Citation analysis (1) Citation analysis: analysis of citations in the scientific literature. Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.” We can view “Miller (2001)” as a hyperlink linking two scientific articles. One application of these “hyperlinks” in the scientific literature: Measure the similarity of two articles by the overlap of other articles citing them.
37
Introduction to Information Retrieval 37 Origins of PageRank: Citation analysis (1) Citation analysis: analysis of citations in the scientific literature. Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.” We can view “Miller (2001)” as a hyperlink linking two scientific articles. One application of these “hyperlinks” in the scientific literature: Measure the similarity of two articles by the overlap of other articles citing them. This is called cocitation similarity.
38
Introduction to Information Retrieval 38 Origins of PageRank: Citation analysis (1) Citation analysis: analysis of citations in the scientific literature. Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.” We can view “Miller (2001)” as a hyperlink linking two scientific articles. One application of these “hyperlinks” in the scientific literature: Measure the similarity of two articles by the overlap of other articles citing them. This is called cocitation similarity. Cocitation similarity on the web: Google’s “find pages like this” or “Similar” feature.
39
Introduction to Information Retrieval 39 Origins of PageRank: Citation analysis (2)
40
Introduction to Information Retrieval 40 Origins of PageRank: Citation analysis (2) Another application: Citation frequency can be used to measure the impact of an article.
41
Introduction to Information Retrieval 41 Origins of PageRank: Citation analysis (2) Another application: Citation frequency can be used to measure the impact of an article. Simplest measure: Each article gets one vote – not very accurate.
42
Introduction to Information Retrieval 42 Origins of PageRank: Citation analysis (2) Another application: Citation frequency can be used to measure the impact of an article. Simplest measure: Each article gets one vote – not very accurate. On the web: citation frequency = inlink count
43
Introduction to Information Retrieval 43 Origins of PageRank: Citation analysis (2) Another application: Citation frequency can be used to measure the impact of an article. Simplest measure: Each article gets one vote – not very accurate. On the web: citation frequency = inlink count A high inlink count does not necessarily mean high quality...
44
Introduction to Information Retrieval 44 Origins of PageRank: Citation analysis (2) Another application: Citation frequency can be used to measure the impact of an article. Simplest measure: Each article gets one vote – not very accurate. On the web: citation frequency = inlink count A high inlink count does not necessarily mean high quality... ... mainly because of link spam.
45
Introduction to Information Retrieval 45 Origins of PageRank: Citation analysis (2) Another application: Citation frequency can be used to measure the impact of an article. Simplest measure: Each article gets one vote – not very accurate. On the web: citation frequency = inlink count A high inlink count does not necessarily mean high quality... ... mainly because of link spam. Better measure: weighted citation frequency or citation rank
46
Introduction to Information Retrieval 46 Origins of PageRank: Citation analysis (2) Another application: Citation frequency can be used to measure the impact of an article. Simplest measure: Each article gets one vote – not very accurate. On the web: citation frequency = inlink count A high inlink count does not necessarily mean high quality... ... mainly because of link spam. Better measure: weighted citation frequency or citation rank An article’s vote is weighted according to its citation impact.
47
Introduction to Information Retrieval 47 Origins of PageRank: Citation analysis (2) Another application: Citation frequency can be used to measure the impact of an article. Simplest measure: Each article gets one vote – not very accurate. On the web: citation frequency = inlink count A high inlink count does not necessarily mean high quality... ... mainly because of link spam. Better measure: weighted citation frequency or citation rank An article’s vote is weighted according to its citation impact. Circular? No: can be formalized in a well-defined way.
48
Introduction to Information Retrieval 48 Origins of PageRank: Citation analysis (3)
49
Introduction to Information Retrieval 49 Origins of PageRank: Citation analysis (3) Better measure: weighted citation frequency or citation rank.
50
Introduction to Information Retrieval 50 Origins of PageRank: Citation analysis (3) Better measure: weighted citation frequency or citation rank. This is basically PageRank.
51
Introduction to Information Retrieval 51 Origins of PageRank: Citation analysis (3) Better measure: weighted citation frequency or citation rank. This is basically PageRank. PageRank was invented in the context of citation analysis by Pinsker and Narin in the 1960s.
52
Introduction to Information Retrieval 52 Origins of PageRank: Citation analysis (3) Better measure: weighted citation frequency or citation rank. This is basically PageRank. PageRank was invented in the context of citation analysis by Pinsker and Narin in the 1960s. Citation analysis is a big deal: The budget and salary of this lecturer are / will be determined by the impact of his publications!
53
Introduction to Information Retrieval 53 Origins of PageRank: Summary
54
Introduction to Information Retrieval 54 Origins of PageRank: Summary We can use the same formal representation for
55
Introduction to Information Retrieval 55 Origins of PageRank: Summary We can use the same formal representation for citations in the scientific literature
56
Introduction to Information Retrieval 56 Origins of PageRank: Summary We can use the same formal representation for citations in the scientific literature hyperlinks on the web
57
Introduction to Information Retrieval 57 Origins of PageRank: Summary We can use the same formal representation for citations in the scientific literature hyperlinks on the web Appropriately weighted citation frequency is an excellent measure of quality...
58
Introduction to Information Retrieval 58 Origins of PageRank: Summary We can use the same formal representation for citations in the scientific literature hyperlinks on the web Appropriately weighted citation frequency is an excellent measure of quality... ... both for web pages and for scientific publications.
59
Introduction to Information Retrieval 59 Origins of PageRank: Summary We can use the same formal representation for citations in the scientific literature hyperlinks on the web Appropriately weighted citation frequency is an excellent measure of quality... ... both for web pages and for scientific publications. Next: PageRank algorithm for computing weighted citation frequency on the web.
60
Introduction to Information Retrieval 60 Model behind PageRank: Random walk
61
Introduction to Information Retrieval 61 Model behind PageRank: Random walk Imagine a web surfer doing a random walk on the web
62
Introduction to Information Retrieval 62 Model behind PageRank: Random walk Imagine a web surfer doing a random walk on the web Start at a random page
63
Introduction to Information Retrieval 63 Model behind PageRank: Random walk Imagine a web surfer doing a random walk on the web Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably
64
Introduction to Information Retrieval 64 Model behind PageRank: Random walk Imagine a web surfer doing a random walk on the web Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably In the steady state, each page has a long-term visit rate.
65
Introduction to Information Retrieval 65 Model behind PageRank: Random walk Imagine a web surfer doing a random walk on the web Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably In the steady state, each page has a long-term visit rate. This long-term visit rate is the page’s PageRank.
66
Introduction to Information Retrieval 66 Model behind PageRank: Random walk Imagine a web surfer doing a random walk on the web Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably In the steady state, each page has a long-term visit rate. This long-term visit rate is the page’s PageRank. PageRank = long-term visit rate = steady state probability.
67
Introduction to Information Retrieval 67 Formalization of random walk: Markov chains
68
Introduction to Information Retrieval 68 Formalization of random walk: Markov chains A Markov chain consists of N states, plus an N ×N transition probability matrix P.
69
Introduction to Information Retrieval 69 Formalization of random walk: Markov chains A Markov chain consists of N states, plus an N ×N transition probability matrix P. state = page
70
Introduction to Information Retrieval 70 Formalization of random walk: Markov chains A Markov chain consists of N states, plus an N ×N transition probability matrix P. state = page At each step, we are on exactly one of the pages.
71
Introduction to Information Retrieval 71 Formalization of random walk: Markov chains A Markov chain consists of N states, plus an N ×N transition probability matrix P. state = page At each step, we are on exactly one of the pages. For 1 ≤ i, j ≥ N, the matrix entry P ij tells us the probability of j being the next page, given we are currently on page i. Clearly, for all i,
72
Introduction to Information Retrieval Example web graph
73
Introduction to Information Retrieval 73 Link matrix for example
74
Introduction to Information Retrieval 74 Link matrix for example d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0010000 d1d1 0110000 d2d2 1011000 d3d3 0001100 d4d4 0000001 d5d5 0000011 d6d6 0001101
75
Introduction to Information Retrieval 75 Transition probability matrix P for example
76
Introduction to Information Retrieval 76 Transition probability matrix P for example d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0.00 1.000.00 d1d1 0.50 0.00 d2d2 0.330.000.33 0.00 d3d3 0.50 0.00 d4d4 1.00 d5d5 0.00 0.50 d6d6 0.00 0.33 0.000.33
77
Introduction to Information Retrieval 77 Long-term visit rate
78
Introduction to Information Retrieval 78 Long-term visit rate Recall: PageRank = long-term visit rate.
79
Introduction to Information Retrieval 79 Long-term visit rate Recall: PageRank = long-term visit rate. Long-term visit rate of page d is the probability that a web surfer is at page d at a given point in time.
80
Introduction to Information Retrieval 80 Long-term visit rate Recall: PageRank = long-term visit rate. Long-term visit rate of page d is the probability that a web surfer is at page d at a given point in time. Next: what properties must hold of the web graph for the long-term visit rate to be well defined?
81
Introduction to Information Retrieval 81 Long-term visit rate Recall: PageRank = long-term visit rate. Long-term visit rate of page d is the probability that a web surfer is at page d at a given point in time. Next: what properties must hold of the web graph for the long-term visit rate to be well defined? The web graph must correspond to an ergodic Markov chain.
82
Introduction to Information Retrieval 82 Long-term visit rate Recall: PageRank = long-term visit rate. Long-term visit rate of page d is the probability that a web surfer is at page d at a given point in time. Next: what properties must hold of the web graph for the long-term visit rate to be well defined? The web graph must correspond to an ergodic Markov chain. First a special case: The web graph must not contain dead ends.
83
Introduction to Information Retrieval 83 Dead ends
84
Introduction to Information Retrieval 84 Dead ends
85
Introduction to Information Retrieval 85 Dead ends The web is full of dead ends.
86
Introduction to Information Retrieval 86 Dead ends The web is full of dead ends. Random walk can get stuck in dead ends.
87
Introduction to Information Retrieval 87 Dead ends The web is full of dead ends. Random walk can get stuck in dead ends. If there are dead ends, long-term visit rates are not well-defined (or non-sensical).
88
Introduction to Information Retrieval 88 Teleporting – to get us of dead ends
89
Introduction to Information Retrieval 89 Teleporting – to get us of dead ends At a dead end, jump to a random web page with prob. 1/ N.
90
Introduction to Information Retrieval 90 Teleporting – to get us of dead ends At a dead end, jump to a random web page with prob. 1/ N. At a non-dead end, with probability 10%, jump to a random web page (to each with a probability of 0.1/N ).
91
Introduction to Information Retrieval 91 Teleporting – to get us of dead ends At a dead end, jump to a random web page with prob. 1/ N. At a non-dead end, with probability 10%, jump to a random web page (to each with a probability of 0.1/N ). With remaining probability (90%), go out on a random hyperlink.
92
Introduction to Information Retrieval 92 Teleporting – to get us of dead ends At a dead end, jump to a random web page with prob. 1/ N. At a non-dead end, with probability 10%, jump to a random web page (to each with a probability of 0.1/N ). With remaining probability (90%), go out on a random hyperlink. For example, if the page has 4 outgoing links: randomly choose one with probability (1-0.10)/4=0.225
93
Introduction to Information Retrieval 93 Teleporting – to get us of dead ends At a dead end, jump to a random web page with prob. 1/ N. At a non-dead end, with probability 10%, jump to a random web page (to each with a probability of 0.1/N ). With remaining probability (90%), go out on a random hyperlink. For example, if the page has 4 outgoing links: randomly choose one with probability (1-0.10)/4=0.225 10% is a parameter, the teleportation rate.
94
Introduction to Information Retrieval 94 Teleporting – to get us of dead ends At a dead end, jump to a random web page with prob. 1/ N. At a non-dead end, with probability 10%, jump to a random web page (to each with a probability of 0.1/N ). With remaining probability (90%), go out on a random hyperlink. For example, if the page has 4 outgoing links: randomly choose one with probability (1-0.10)/4=0.225 10% is a parameter, the teleportation rate. Note: “jumping” from dead end is independent of teleportation rate.
95
Introduction to Information Retrieval 95 Result of teleporting
96
Introduction to Information Retrieval 96 Result of teleporting With teleporting, we cannot get stuck in a dead end.
97
Introduction to Information Retrieval 97 Result of teleporting With teleporting, we cannot get stuck in a dead end. But even without dead ends, a graph may not have well- defined long-term visit rates.
98
Introduction to Information Retrieval 98 Result of teleporting With teleporting, we cannot get stuck in a dead end. But even without dead ends, a graph may not have well- defined long-term visit rates. More generally, we require that the Markov chain be ergodic.
99
Introduction to Information Retrieval 99 Ergodic Markov chains A Markov chain is ergodic if it is irreducible and aperiodic.
100
Introduction to Information Retrieval 100 Ergodic Markov chains A Markov chain is ergodic if it is irreducible and aperiodic. Irreducibility. Roughly: there is a path from any other page.
101
Introduction to Information Retrieval 101 Ergodic Markov chains A Markov chain is ergodic if it is irreducible and aperiodic. Irreducibility. Roughly: there is a path from any other page. Aperiodicity. Roughly: The pages cannot be partitioned such that the random walker visits the partitions sequentially.
102
Introduction to Information Retrieval 102 Ergodic Markov chains A Markov chain is ergodic if it is irreducible and aperiodic. Irreducibility. Roughly: there is a path from any other page. Aperiodicity. Roughly: The pages cannot be partitioned such that the random walker visits the partitions cyclically. A non-ergodic Markov chain:
103
Introduction to Information Retrieval 103 Ergodic Markov chains Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state.
104
Introduction to Information Retrieval 104 Ergodic Markov chains Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state. This is the steady-state probability distribution.
105
Introduction to Information Retrieval 105 Ergodic Markov chains Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state. This is the steady-state probability distribution. Over a long time period, we visit each state in proportion to this rate.
106
Introduction to Information Retrieval 106 Ergodic Markov chains Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state. This is the steady-state probability distribution. Over a long time period, we visit each state in proportion to this rate. It doesn’t matter where we start.
107
Introduction to Information Retrieval 107 Ergodic Markov chains Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state. This is the steady-state probability distribution. Over a long time period, we visit each state in proportion to this rate. It doesn’t matter where we start. Teleporting makes the web graph ergodic.
108
Introduction to Information Retrieval 108 Ergodic Markov chains Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state. This is the steady-state probability distribution. Over a long time period, we visit each state in proportion to this rate. It doesn’t matter where we start. Teleporting makes the web graph ergodic. Web-graph+teleporting has a steady-state probability distribution.
109
Introduction to Information Retrieval 109 Ergodic Markov chains Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state. This is the steady-state probability distribution. Over a long time period, we visit each state in proportion to this rate. It doesn’t matter where we start. Teleporting makes the web graph ergodic. Web-graph+teleporting has a steady-state probability distribution. Each page in the web-graph+teleporting has a PageRank.
110
Introduction to Information Retrieval 110 Formalization of “visit”: Probability vector
111
Introduction to Information Retrieval 111 Formalization of “visit”: Probability vector A probability (row) vector x = (x 1,..., x N ) tells us where the random walk is at any point.
112
Introduction to Information Retrieval 112 Formalization of “visit”: Probability vector A probability (row) vector x = (x 1,..., x N ) tells us where the random walk is at any point. Example: (000…1…000) 123…i…N-2N-1N
113
Introduction to Information Retrieval 113 Formalization of “visit”: Probability vector A probability (row) vector x = (x 1,..., x N ) tells us where the random walk is at any point. Example: More generally: the random walk is on the page i with probability x i. (000…1…000) 123…i…N-2N-1N
114
Introduction to Information Retrieval 114 Formalization of “visit”: Probability vector A probability (row) vector x = (x 1,..., x N ) tells us where the random walk is at any point. Example: More generally: the random walk is on the page i with probability x i. Example: (000…1…000) 123…i…N-2N-1N (0.050.010.0…0.2…0.010.050.03) 123…i…N-2N-1N
115
Introduction to Information Retrieval 115 Formalization of “visit”: Probability vector A probability (row) vector x = (x 1,..., x N ) tells us where the random walk is at any point. Example: More generally: the random walk is on the page i with probability x i. Example: x i = 1 (000…1…000) 123…i…N-2N-1N (0.050.010.0…0.2…0.010.050.03) 123…i…N-2N-1N
116
Introduction to Information Retrieval 116 Change in probability vector If the probability vector is x = (x 1,..., x N ), at this step, what is it at the next step?
117
Introduction to Information Retrieval 117 Change in probability vector If the probability vector is x = (x 1,..., x N ), at this step, what is it at the next step? Recall that row i of the transition probability matrix P tells us where we go next from state i.
118
Introduction to Information Retrieval 118 Change in probability vector If the probability vector is x = (x 1,..., x N ), at this step, what is it at the next step? Recall that row i of the transition probability matrix P tells us where we go next from state i. So from x, our next state is distributed as xP.
119
Introduction to Information Retrieval 119 Steady state in vector notation
120
Introduction to Information Retrieval 120 Steady state in vector notation The steady state in vector notation is simply a vector = ( , , …, ) of probabilities.
121
Introduction to Information Retrieval 121 Steady state in vector notation The steady state in vector notation is simply a vector = ( , , …, ) of probabilities. (We use to distinguish it from the notation for the probability vector x.)
122
Introduction to Information Retrieval 122 Steady state in vector notation The steady state in vector notation is simply a vector = ( , , …, ) of probabilities. (We use to distinguish it from the notation for the probability vector x.) is the long-term visit rate (or PageRank) of page i.
123
Introduction to Information Retrieval 123 Steady state in vector notation The steady state in vector notation is simply a vector = ( , , …, ) of probabilities. (We use to distinguish it from the notation for the probability vector x.) is the long-term visit rate (or PageRank) of page i. So we can think of PageRank as a very long vector – one entry per page.
124
Introduction to Information Retrieval 124 Steady-state distribution: Example
125
Introduction to Information Retrieval 125 Steady-state distribution: Example What is the PageRank / steady state in this example?
126
Introduction to Information Retrieval 126 Steady-state distribution: Example
127
Introduction to Information Retrieval 127 Steady-state distribution: Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.25 P 21 = 0.25 P 12 = 0.75 P 22 = 0.75 t0t1t0t1 0.250.75 PageRank vector = = ( , ) = (0.25, 0.75) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
128
Introduction to Information Retrieval 128 Steady-state distribution: Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.25 P 21 = 0.25 P 12 = 0.75 P 22 = 0.75 t0t1t0t1 0.250.750.250.75 PageRank vector = = ( , ) = (0.25, 0.75) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
129
Introduction to Information Retrieval 129 Steady-state distribution: Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.25 P 21 = 0.25 P 12 = 0.75 P 22 = 0.75 t0t1t0t1 0.25 0.75 0.250.75 PageRank vector = = ( , ) = (0.25, 0.75) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
130
Introduction to Information Retrieval 130 Steady-state distribution: Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.25 P 21 = 0.25 P 12 = 0.75 P 22 = 0.75 t0t1t0t1 0.25 0.75 0.250.75 (convergence) PageRank vector = = ( , ) = (0.25, 0.75) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
131
Introduction to Information Retrieval 131 How do we compute the steady state vector?
132
Introduction to Information Retrieval 132 How do we compute the steady state vector? In other words: how do we compute PageRank?
133
Introduction to Information Retrieval 133 How do we compute the steady state vector? In other words: how do we compute PageRank? Recall: = ( 1, 2, …, N ) is the PageRank vector, the vector of steady-state probabilities...
134
Introduction to Information Retrieval 134 How do we compute the steady state vector? In other words: how do we compute PageRank? Recall: = ( 1, 2, …, N ) is the PageRank vector, the vector of steady-state probabilities... … and if the distribution in this step is x, then the distribution in the next step is xP.
135
Introduction to Information Retrieval 135 How do we compute the steady state vector? In other words: how do we compute PageRank? Recall: = ( 1, 2, …, N ) is the PageRank vector, the vector of steady-state probabilities... … and if the distribution in this step is x, then the distribution in the next step is xP. But is the steady state!
136
Introduction to Information Retrieval 136 How do we compute the steady state vector? In other words: how do we compute PageRank? Recall: = ( 1, 2, …, N ) is the PageRank vector, the vector of steady-state probabilities... … and if the distribution in this step is x, then the distribution in the next step is xP. But is the steady state! So: = P
137
Introduction to Information Retrieval 137 How do we compute the steady state vector? In other words: how do we compute PageRank? Recall: = ( 1, 2, …, N ) is the PageRank vector, the vector of steady-state probabilities... … and if the distribution in this step is x, then the distribution in the next step is xP. But is the steady state! So: = P Solving this matrix equation gives us .
138
Introduction to Information Retrieval 138 How do we compute the steady state vector? In other words: how do we compute PageRank? Recall: = ( 1, 2, …, N ) is the PageRank vector, the vector of steady-state probabilities... … and if the distribution in this step is x, then the distribution in the next step is xP. But is the steady state! So: = P Solving this matrix equation gives us . is the principal left eigenvector for P …
139
Introduction to Information Retrieval 139 How do we compute the steady state vector? In other words: how do we compute PageRank? Recall: = ( 1, 2, …, N ) is the PageRank vector, the vector of steady-state probabilities... … and if the distribution in this step is x, then the distribution in the next step is xP. But is the steady state! So: = P Solving this matrix equation gives us . is the principal left eigenvector for P … … that is, is the left eigenvector with the largest eigenvalue.
140
Introduction to Information Retrieval 140 How do we compute the steady state vector? In other words: how do we compute PageRank? Recall: = ( 1, 2, …, N ) is the PageRank vector, the vector of steady-state probabilities... … and if the distribution in this step is x, then the distribution in the next step is xP. But is the steady state! So: = P Solving this matrix equation gives us . is the principal left eigenvector for P … … that is, is the left eigenvector with the largest eigenvalue. All transition probability matrices have largest eigenvalue 1.
141
Introduction to Information Retrieval 141 One way of computing the PageRank
142
Introduction to Information Retrieval 142 One way of computing the PageRank Start with any distribution x, e.g., uniform distribution
143
Introduction to Information Retrieval 143 One way of computing the PageRank Start with any distribution x, e.g., uniform distribution After one step, we’re at xP.
144
Introduction to Information Retrieval 144 One way of computing the PageRank Start with any distribution x, e.g., uniform distribution After one step, we’re at xP. After two steps, we’re at xP 2.
145
Introduction to Information Retrieval 145 One way of computing the PageRank Start with any distribution x, e.g., uniform distribution After one step, we’re at xP. After two steps, we’re at xP 2. After k steps, we’re at xP k.
146
Introduction to Information Retrieval 146 One way of computing the PageRank Start with any distribution x, e.g., uniform distribution After one step, we’re at xP. After two steps, we’re at xP 2. After k steps, we’re at xP k. Algorithm: multiply x by increasing powers of P until convergence.
147
Introduction to Information Retrieval 147 One way of computing the PageRank Start with any distribution x, e.g., uniform distribution After one step, we’re at xP. After two steps, we’re at xP 2. After k steps, we’re at xP k. Algorithm: multiply x by increasing powers of P until convergence. This is called the power method.
148
Introduction to Information Retrieval 148 One way of computing the PageRank Start with any distribution x, e.g., uniform distribution After one step, we’re at xP. After two steps, we’re at xP 2. After k steps, we’re at xP k. Algorithm: multiply x by increasing powers of P until convergence. This is called the power method. Recall: regardless of where we start, we eventually reach the steady state
149
Introduction to Information Retrieval 149 One way of computing the PageRank Start with any distribution x, e.g., uniform distribution After one step, we’re at xP. After two steps, we’re at xP 2. After k steps, we’re at xP k. Algorithm: multiply x by increasing powers of P until convergence. This is called the power method. Recall: regardless of where we start, we eventually reach the steady state Thus: we will eventually (in asymptotia) reach the steady state.
150
Introduction to Information Retrieval 150 Power method: Example
151
Introduction to Information Retrieval 151 Power method: Example What is the PageRank / steady state in this example?
152
Introduction to Information Retrieval 152 Computing PageRank: Power Example
153
Introduction to Information Retrieval 153 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 01= xP t1t1 = xP 2 t2t2 = xP 3 t3t3 = xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
154
Introduction to Information Retrieval 154 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 = xP 2 t2t2 = xP 3 t3t3 = xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
155
Introduction to Information Retrieval 155 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.7= xP 2 t2t2 = xP 3 t3t3 = xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
156
Introduction to Information Retrieval 156 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 = xP 3 t3t3 = xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
157
Introduction to Information Retrieval 157 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.76= xP 3 t3t3 = xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
158
Introduction to Information Retrieval 158 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 = xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
159
Introduction to Information Retrieval 159 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 0.2520.748= xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
160
Introduction to Information Retrieval 160 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 0.2520.7480.24960.7504= xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
161
Introduction to Information Retrieval 161 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 0.2520.7480.24960.7504= xP 4... t∞t∞ = xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
162
Introduction to Information Retrieval 162 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 0.2520.7480.24960.7504= xP 4... t∞t∞ 0.250.75= xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
163
Introduction to Information Retrieval 163 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 0.2520.7480.24960.7504= xP 4... t∞t∞ 0.250.750.250.75= xP ∞ P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
164
Introduction to Information Retrieval 164 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t0 010.30.7= xP t1t1 0.30.70.240.76= xP 2 t2t2 0.240.760.2520.748= xP 3 t3t3 0.2520.7480.24960.7504= xP 4... t∞t∞ 0.250.750.250.75= xP ∞ PageRank vector = = ( , ) = (0.25, 0.75) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
165
Introduction to Information Retrieval 165 Power method: Example What is the PageRank / steady state in this example? The steady state distribution (= the PageRanks) in this example are 0.25 for d 1 and 0.75 for d 2.
166
Introduction to Information Retrieval 166 Exercise: Compute PageRank using power method
167
Introduction to Information Retrieval 167 Exercise: Compute PageRank using power method
168
Introduction to Information Retrieval 168 Solution
169
Introduction to Information Retrieval 169 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 01 t1t1 t2t2 t3t3 t∞t∞ PageRank vector = = ( , ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
170
Introduction to Information Retrieval 170 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 t2t2 t3t3 t∞t∞ PageRank vector = = ( , ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
171
Introduction to Information Retrieval 171 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.8 t2t2 t3t3 t∞t∞ PageRank vector = = ( , ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
172
Introduction to Information Retrieval 172 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 t3t3 t∞t∞ PageRank vector = = ( , ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
173
Introduction to Information Retrieval 173 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.7 t3t3 t∞t∞ PageRank vector = = ( , ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
174
Introduction to Information Retrieval 174 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.70.350.65 t3t3 t∞t∞ PageRank vector = = ( , ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
175
Introduction to Information Retrieval 175 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.70.350.65 t3t3 0.350.65 t∞t∞ PageRank vector = = ( , ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
176
Introduction to Information Retrieval 176 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.70.350.65 t3t3 0.350.650.3750.625 t∞t∞ PageRank vector = = ( , ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
177
Introduction to Information Retrieval 177 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.70.350.65 t3t3 0.350.650.3750.625... t∞t∞ PageRank vector = = ( , ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
178
Introduction to Information Retrieval 178 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.70.350.65 t3t3 0.350.650.3750.625... t∞t∞ 0.40.6 PageRank vector = = ( , ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
179
Introduction to Information Retrieval 179 Solution x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t0t0 010.20.8 t1t1 0.20.80.30.7 t2t2 0.30.70.350.65 t3t3 0.350.650.3750.625... t∞t∞ 0.40.60.40.6 PageRank vector = = ( , ) = (0.4, 0.6) P t (d 1 ) = P t-1 (d 1 ) * P 11 + P t-1 (d 2 ) * P 21 P t (d 2 ) = P t-1 (d 1 ) * P 12 + P t-1 (d 2 ) * P 22
180
Introduction to Information Retrieval 180 PageRank summary
181
Introduction to Information Retrieval 181 PageRank summary Preprocessing
182
Introduction to Information Retrieval 182 PageRank summary Preprocessing Given graph of links, build matrix P
183
Introduction to Information Retrieval 183 PageRank summary Preprocessing Given graph of links, build matrix P Apply teleportation
184
Introduction to Information Retrieval 184 PageRank summary Preprocessing Given graph of links, build matrix P Apply teleportation From modified matrix, compute
185
Introduction to Information Retrieval 185 PageRank summary Preprocessing Given graph of links, build matrix P Apply teleportation From modified matrix, compute i is the PageRank of page i.
186
Introduction to Information Retrieval 186 PageRank summary Preprocessing Given graph of links, build matrix P Apply teleportation From modified matrix, compute i is the PageRank of page i. Query processing
187
Introduction to Information Retrieval 187 PageRank summary Preprocessing Given graph of links, build matrix P Apply teleportation From modified matrix, compute i is the PageRank of page i. Query processing Retrieve pages satisfying the query
188
Introduction to Information Retrieval 188 PageRank summary Preprocessing Given graph of links, build matrix P Apply teleportation From modified matrix, compute i is the PageRank of page i. Query processing Retrieve pages satisfying the query Rank them by their PageRank
189
Introduction to Information Retrieval 189 PageRank summary Preprocessing Given graph of links, build matrix P Apply teleportation From modified matrix, compute i is the PageRank of page i. Query processing Retrieve pages satisfying the query Rank them by their PageRank Return reranked list to the user
190
Introduction to Information Retrieval 190 PageRank issues
191
Introduction to Information Retrieval 191 PageRank issues Real surfers are not random surfers.
192
Introduction to Information Retrieval 192 PageRank issues Real surfers are not random surfers. Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search!
193
Introduction to Information Retrieval 193 PageRank issues Real surfers are not random surfers. Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search! → Markov model is not a good model of surfing.
194
Introduction to Information Retrieval 194 PageRank issues Real surfers are not random surfers. Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search! → Markov model is not a good model of surfing. But it’s good enough as a model for our purposes.
195
Introduction to Information Retrieval 195 PageRank issues Real surfers are not random surfers. Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search! → Markov model is not a good model of surfing. But it’s good enough as a model for our purposes. Simple PageRank ranking (as described on previous slide) produces bad results for many pages.
196
Introduction to Information Retrieval 196 PageRank issues Real surfers are not random surfers. Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search! → Markov model is not a good model of surfing. But it’s good enough as a model for our purposes. Simple PageRank ranking (as described on previous slide) produces bad results for many pages. Consider the query [video service].
197
Introduction to Information Retrieval 197 PageRank issues Real surfers are not random surfers. Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search! → Markov model is not a good model of surfing. But it’s good enough as a model for our purposes. Simple PageRank ranking (as described on previous slide) produces bad results for many pages. Consider the query [video service]. The Yahoo home page (i) has a very high PageRank and (ii) contains both video and service.
198
Introduction to Information Retrieval 198 PageRank issues Real surfers are not random surfers. Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search! → Markov model is not a good model of surfing. But it’s good enough as a model for our purposes. Simple PageRank ranking (as described on previous slide) produces bad results for many pages. Consider the query [video service]. The Yahoo home page (i) has a very high PageRank and (ii) contains both video and service. If we rank all Boolean hits according to PageRank, then the Yahoo home page would be top-ranked.
199
Introduction to Information Retrieval 199 PageRank issues Real surfers are not random surfers. Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search! → Markov model is not a good model of surfing. But it’s good enough as a model for our purposes. Simple PageRank ranking (as described on previous slide) produces bad results for many pages. Consider the query [video service]. The Yahoo home page (i) has a very high PageRank and (ii) contains both video and service. If we rank all Boolean hits according to PageRank, then the Yahoo home page would be top-ranked. Clearly not desireble.
200
Introduction to Information Retrieval 200 PageRank issues In practice: rank according to weighted combination of raw text match, anchor text match, PageRank & other factors.
201
Introduction to Information Retrieval 201 PageRank issues In practice: rank according to weighted combination of raw text match, anchor text match, PageRank & other factors.
202
Introduction to Information Retrieval Example web graph
203
Introduction to Information Retrieval 203 Transition (probability) matrix
204
Introduction to Information Retrieval 204 Transition (probability) matrix d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0.00 1.000.00 d1d1 0.50 0.00 d2d2 0.330.000.33 0.00 d3d3 0.50 0.00 d4d4 1.00 d5d5 0.00 0.50 d6d6 0.00 0.33 0.000.33
205
Introduction to Information Retrieval 205 Transition matrix with teleporting
206
Introduction to Information Retrieval 206 Transition matrix with teleporting d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0.02 0.880.02 d1d1 0.45 0.02 d2d2 0.310.020.31 0.02 d3d3 0.45 0.02 d4d4 0.88 d5d5 0.02 0.45 d6d6 0.02 0.31 0.020.31
207
Introduction to Information Retrieval 207 Power method vectors xP k
208
Introduction to Information Retrieval 208 Power method vectors xP k xxP 1 xP 2 xP 3 xP 4 xP 5 xP 6 xP 7 xP 8 xP 9 xP 10 xP 11 xP 12 xP 13 d0d0 0.140.060.090.07 0.06 0.05 d1d1 0.140.080.060.04 d2d2 0.140.250.180.170.150.140.130.12 0.11 d3d3 0.140.160.230.24 0.25 d4d4 0.140.120.160.19 0.200.21 d5d5 0.140.080.060.04 d6d6 0.140.250.230.250.270.280.29 0.30 0.31
209
Introduction to Information Retrieval 209 Example web graph PageRank d0d0 0.05 d1d1 0.04 d2d2 0.11 d3d3 0.25 d4d4 0.21 d5d5 0.04 d6d6 0.31
210
Introduction to Information Retrieval 210 How important is PageRank?
211
Introduction to Information Retrieval 211 How important is PageRank? Frequent claim: PageRank is the most important component of web ranking.
212
Introduction to Information Retrieval 212 How important is PageRank? Frequent claim: PageRank is the most important component of web ranking. The reality:
213
Introduction to Information Retrieval 213 How important is PageRank? Frequent claim: PageRank is the most important component of web ranking. The reality: There are several components that are at least as important: e.g., anchor text, phrases, proximity, tiered indexes...
214
Introduction to Information Retrieval 214 How important is PageRank? Frequent claim: PageRank is the most important component of web ranking. The reality: There are several components that are at least as important: e.g., anchor text, phrases, proximity, tiered indexes... Rumor has it that PageRank in his original form (as presented here) now has a negligible impact on ranking!
215
Introduction to Information Retrieval 215 How important is PageRank? Frequent claim: PageRank is the most important component of web ranking. The reality: There are several components that are at least as important: e.g., anchor text, phrases, proximity, tiered indexes... Rumor has it that PageRank in his original form (as presented here) now has a negligible impact on ranking! However, variants of a page’s PageRank are still an essential part of ranking.
216
Introduction to Information Retrieval 216 How important is PageRank? Frequent claim: PageRank is the most important component of web ranking. The reality: There are several components that are at least as important: e.g., anchor text, phrases, proximity, tiered indexes... Rumor has it that PageRank in his original form (as presented here) now has a negligible impact on ranking! However, variants of a page’s PageRank are still an essential part of ranking. Adressing link spam is difficult and crucial.
217
Introduction to Information Retrieval 217 Hits – Hyperlink-Induced Topic Search
218
Introduction to Information Retrieval 218 Hits – Hyperlink-Induced Topic Search Premise: there are two different types of relevance on the web.
219
Introduction to Information Retrieval 219 Hits – Hyperlink-Induced Topic Search Premise: there are two different types of relevance on the web. Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need.
220
Introduction to Information Retrieval 220 Hits – Hyperlink-Induced Topic Search Premise: there are two different types of relevance on the web. Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need. E.g, for query [chicago bulls]: Bob’s list of recommended resources on the Chicago Bulls sports team
221
Introduction to Information Retrieval 221 Hits – Hyperlink-Induced Topic Search Premise: there are two different types of relevance on the web. Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need. E.g, for query [chicago bulls]: Bob’s list of recommended resources on the Chicago Bulls sports team Relevance type 2: Authorities. An authority page is a direct answer to the information need.
222
Introduction to Information Retrieval 222 Hits – Hyperlink-Induced Topic Search Premise: there are two different types of relevance on the web. Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need. E.g, for query [chicago bulls]: Bob’s list of recommended resources on the Chicago Bulls sports team Relevance type 2: Authorities. An authority page is a direct answer to the information need. The home page of the Chicago Bulls sports team
223
Introduction to Information Retrieval 223 Hits – Hyperlink-Induced Topic Search Premise: there are two different types of relevance on the web. Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need. E.g, for query [chicago bulls]: Bob’s list of recommended resources on the Chicago Bulls sports team Relevance type 2: Authorities. An authority page is a direct answer to the information need. The home page of the Chicago Bulls sports team By definition: Links to authority pages occur repeatedly on hub pages.
224
Introduction to Information Retrieval 224 Hits – Hyperlink-Induced Topic Search Premise: there are two different types of relevance on the web. Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need. E.g, for query [chicago bulls]: Bob’s list of recommended resources on the Chicago Bulls sports team Relevance type 2: Authorities. An authority page is a direct answer to the information need. The home page of the Chicago Bulls sports team By definition: Links to authority pages occur repeatedly on hub pages. Most approaches to search (including PageRank ranking) don’t make the distinction between these two very different types of relevance.
225
Introduction to Information Retrieval 225 Hubs and authorities : Definition
226
Introduction to Information Retrieval 226 Hubs and authorities : Definition A good hub page for a topic links to many authority pages for that topic.
227
Introduction to Information Retrieval 227 Hubs and authorities : Definition A good hub page for a topic links to many authority pages for that topic. A good authority page for a topic is linked to by many hub pages for that topic.
228
Introduction to Information Retrieval 228 Hubs and authorities : Definition A good hub page for a topic links to many authority pages for that topic. A good authority page for a topic is linked to by many hub pages for that topic. Circular definition – we will turn this into an iterative computation.
229
Introduction to Information Retrieval 229 Example for hubs and authorities
230
Introduction to Information Retrieval 230 Example for hubs and authorities
231
Introduction to Information Retrieval 231 How to compute hub and authority scores
232
Introduction to Information Retrieval 232 How to compute hub and authority scores Do a regular web search first
233
Introduction to Information Retrieval 233 How to compute hub and authority scores Do a regular web search first Call the search result the root set
234
Introduction to Information Retrieval 234 How to compute hub and authority scores Do a regular web search first Call the search result the root set Find all pages that are linked to or link to pages in the root set
235
Introduction to Information Retrieval 235 How to compute hub and authority scores Do a regular web search first Call the search result the root set Find all pages that are linked to or link to pages in the root set Call first larger set the base set
236
Introduction to Information Retrieval 236 How to compute hub and authority scores Do a regular web search first Call the search result the root set Find all pages that are linked to or link to pages in the root set Call first larger set the base set Finally, compute hubs and authorities for the base set (which we’ll view as a small web graph)
237
Introduction to Information Retrieval Root set and base set (1) The root set root set
238
Introduction to Information Retrieval Root set and base set (1) Nodes that root set nodes link to root set
239
Introduction to Information Retrieval Root set and base set (1) Nodes that link to root set nodes root set
240
Introduction to Information Retrieval Root set and base set (1) The base set root set base set
241
Introduction to Information Retrieval 241 Root set and base set (2)
242
Introduction to Information Retrieval 242 Root set and base set (2) Root set typically has 200-1000 nodes.
243
Introduction to Information Retrieval 243 Root set and base set (2) Root set typically has 200-1000 nodes. Base set may have up to 5000 nodes.
244
Introduction to Information Retrieval 244 Root set and base set (2) Root set typically has 200-1000 nodes. Base set may have up to 5000 nodes. Computation of base set, as shown on previous slide:
245
Introduction to Information Retrieval 245 Root set and base set (2) Root set typically has 200-1000 nodes. Base set may have up to 5000 nodes. Computation of base set, as shown on previous slide: Follow outlinks by parsing the pages in the root set
246
Introduction to Information Retrieval 246 Root set and base set (2) Root set typically has 200-1000 nodes. Base set may have up to 5000 nodes. Computation of base set, as shown on previous slide: Follow outlinks by parsing the pages in the root set Find d’s inlinks by searching for all pages containing a link to d
247
Introduction to Information Retrieval 247 Hub and authority scores
248
Introduction to Information Retrieval 248 Hub and authority scores Compute for each page d in the base set a hub score h(d) and an authority score a(d)
249
Introduction to Information Retrieval 249 Hub and authority scores Compute for each page d in the base set a hub score h(d) and an authority score a(d) Initialization: for all d: h(d) = 1, a(d) = 1
250
Introduction to Information Retrieval 250 Hub and authority scores Compute for each page d in the base set a hub score h(d) and an authority score a(d) Initialization: for all d: h(d) = 1, a(d) = 1 Iteratively update all h(d), a(d)
251
Introduction to Information Retrieval 251 Hub and authority scores Compute for each page d in the base set a hub score h(d) and an authority score a(d) Initialization: for all d: h(d) = 1, a(d) = 1 Iteratively update all h(d), a(d) After convergence:
252
Introduction to Information Retrieval 252 Hub and authority scores Compute for each page d in the base set a hub score h(d) and an authority score a(d) Initialization: for all d: h(d) = 1, a(d) = 1 Iteratively update all h(d), a(d) After convergence: Output pages with highest h scores as top hubs
253
Introduction to Information Retrieval 253 Hub and authority scores Compute for each page d in the base set a hub score h(d) and an authority score a(d) Initialization: for all d: h(d) = 1, a(d) = 1 Iteratively update all h(d), a(d) After convergence: Output pages with highest h scores as top hubs Output pages with highest a scores as top authrities
254
Introduction to Information Retrieval 254 Hub and authority scores Compute for each page d in the base set a hub score h(d) and an authority score a(d) Initialization: for all d: h(d) = 1, a(d) = 1 Iteratively update all h(d), a(d) After convergence: Output pages with highest h scores as top hubs Output pages with highest a scores as top authrities So we output two ranked lists
255
Introduction to Information Retrieval 255 Iterative update
256
Introduction to Information Retrieval 256 Iterative update For all d: h(d) =
257
Introduction to Information Retrieval 257 Iterative update For all d: h(d) = For all d: a(d) =
258
Introduction to Information Retrieval 258 Iterative update For all d: h(d) = For all d: a(d) = Iterate these two steps until convergence
259
Introduction to Information Retrieval 259 Details
260
Introduction to Information Retrieval 260 Details Scaling
261
Introduction to Information Retrieval 261 Details Scaling To prevent the a() and h() values from getting too big, can scale down after each iteration.
262
Introduction to Information Retrieval 262 Details Scaling To prevent the a() and h() values from getting too big, can scale down after each iteration. Scaling factor doesn’t really matter.
263
Introduction to Information Retrieval 263 Details Scaling To prevent the a() and h() values from getting too big, can scale down after each iteration. Scaling factor doesn’t really matter. We care about the relative (as opposed to absolute) values of the scores.
264
Introduction to Information Retrieval 264 Details Scaling To prevent the a() and h() values from getting too big, can scale down after each iteration. Scaling factor doesn’t really matter. We care about the relative (as opposed to absolute) values of the scores. In most cases, the algorithm converges after a few iterations.
265
Introduction to Information Retrieval 265 Authorities for query [Chicago Bulls]
266
Introduction to Information Retrieval 266 Authorities for query [Chicago Bulls] 0.85www.nba.com/bulls 0.25www.essex1.com/people/jmiller/bulls.htm “da Bulls” 0.20www.nando.net/SportServer/basketball/nba/chi.html “The Chicago Bulls” 0.15Users.aol.com/rynocub/bulls.htm “The Chicago Bulls Home Page ” 0.13www.geocities.com/Colosseum/6095 “Chicago Bulls” (Ben Shaul et al, WWW8)
267
Introduction to Information Retrieval 267 The authority page for [Chicago Bulls]
268
Introduction to Information Retrieval 268 The authority page for [Chicago Bulls]
269
Introduction to Information Retrieval 269 Hubs for query [Chicago Bulls]
270
Introduction to Information Retrieval 270 Hubs for query [Chicago Bulls] 1.62www.geocities.com/Colosseum/1778 “Unbelieveabulls!!!!!” 1.24www.webring.org/cgi-bin/webring?ring=chbulls “Chicago Bulls” 0.74www.geocities.com/Hollywood/Lot/3330/Bulls.html “Chicago Bulls” 0.52www.nobull.net/web_position/kw-search-15-M2.html “Excite Search Results: bulls ” 0.52www.halcyon.com/wordltd/bball/bulls.html “Chicago Bulls Links” (Ben Shaul et al, WWW8)
271
Introduction to Information Retrieval 271 A hub page for [Chicago Bulls]
272
Introduction to Information Retrieval 272 A hub page for [Chicago Bulls]
273
Introduction to Information Retrieval 273 Hub & Authorities: Comments
274
Introduction to Information Retrieval 274 Hub & Authorities: Comments HITS can pull together good pages regardless of page content.
275
Introduction to Information Retrieval 275 Hub & Authorities: Comments HITS can pull together good pages regardless of page content. Once the base set is assembles, we only do link analysis, no text matching.
276
Introduction to Information Retrieval 276 Hub & Authorities: Comments HITS can pull together good pages regardless of page content. Once the base set is assembles, we only do link analysis, no text matching. Pages in the base set often do not contain any of the query words.
277
Introduction to Information Retrieval 277 Hub & Authorities: Comments HITS can pull together good pages regardless of page content. Once the base set is assembles, we only do link analysis, no text matching. Pages in the base set often do not contain any of the query words. In theory, an English query can retrieve Japanese-language pages!
278
Introduction to Information Retrieval 278 Hub & Authorities: Comments HITS can pull together good pages regardless of page content. Once the base set is assembles, we only do link analysis, no text matching. Pages in the base set often do not contain any of the query words. In theory, an English query can retrieve Japanese-language pages! If supported by the link structures between English and Japanese pages!
279
Introduction to Information Retrieval 279 Hub & Authorities: Comments HITS can pull together good pages regardless of page content. Once the base set is assembles, we only do link analysis, no text matching. Pages in the base set often do not contain any of the query words. In theory, an English query can retrieve Japanese-language pages! If supported by the link structures between English and Japanese pages! Danger: topic drift – the pages found by following links may not be related to the original query.
280
Introduction to Information Retrieval 280 Proof of convergence
281
Introduction to Information Retrieval 281 Proof of convergence We define an N×N adjacency matrix A. (We called this the link matrix earlier).
282
Introduction to Information Retrieval 282 Proof of convergence We define an N×N adjacency matrix A. (We called this the link matrix earlier). For 1 ≤ i, j ≤ N, the matrix entry A ij tells us whether there is a link from page i to page j ( A ij = 1) or not (A ij = 0).
283
Introduction to Information Retrieval 283 Proof of convergence We define an N×N adjacency matrix A. (We called this the link matrix earlier). For 1 ≤ i, j ≤ N, the matrix entry A ij tells us whether there is a link from page i to page j ( A ij = 1) or not (A ij = 0). Example:
284
Introduction to Information Retrieval 284 Proof of convergence We define an N×N adjacency matrix A. (We called this the link matrix earlier). For 1 ≤ i, j ≤ N, the matrix entry A ij tells us whether there is a link from page i to page j ( A ij = 1) or not (A ij = 0). Example: d1d1 d2d2 d3d3 d1d1 010 d2d2 111 d3d3 100
285
Introduction to Information Retrieval 285 Write update rules as matrix operations
286
Introduction to Information Retrieval 286 Write update rules as matrix operations Define the hub vector h = (h 1,..., h N ) as the vector of hub scores. h is the hub score of page d.
287
Introduction to Information Retrieval 287 Write update rules as matrix operations Define the hub vector h = (h 1,..., h N ) as the vector of hub scores. h is the hub score of page d. Similarity for a, the vector of authority scores
288
Introduction to Information Retrieval 288 Write update rules as matrix operations Define the hub vector h = (h 1,..., h N ) as the vector of hub scores. h is the hub score of page d. Similarity for a, the vector of authority scores Now we can write h(d) = as a matrix operation: h = Aa...
289
Introduction to Information Retrieval 289 Write update rules as matrix operations Define the hub vector h = (h 1,..., h N ) as the vector of hub scores. h is the hub score of page d. Similarity for a, the vector of authority scores Now we can write h(d) = as a matrix operation: h = Aa... ... and we can write a(d) = as a a = A T h
290
Introduction to Information Retrieval 290 Write update rules as matrix operations Define the hub vector h = (h 1,..., h N ) as the vector of hub scores. h is the hub score of page d. Similarity for a, the vector of authority scores Now we can write h(d) = as a matrix operation: h = Aa... ... and we can write a(d) = as a a = A T h HITS algorithm in matrix notation:
291
Introduction to Information Retrieval 291 Write update rules as matrix operations Define the hub vector h = (h 1,..., h N ) as the vector of hub scores. h is the hub score of page d. Similarity for a, the vector of authority scores Now we can write h(d) = as a matrix operation: h = Aa... ... and we can write a(d) = as a a = A T h HITS algorithm in matrix notation: Compute h = Aa
292
Introduction to Information Retrieval 292 Write update rules as matrix operations Define the hub vector h = (h 1,..., h N ) as the vector of hub scores. h is the hub score of page d. Similarity for a, the vector of authority scores Now we can write h(d) = as a matrix operation: h = Aa... ... and we can write a(d) = as a a = A T h HITS algorithm in matrix notation: Compute h = Aa Compute a = A T h
293
Introduction to Information Retrieval 293 Write update rules as matrix operations Define the hub vector h = (h 1,..., h N ) as the vector of hub scores. h is the hub score of page d. Similarity for a, the vector of authority scores Now we can write h(d) = as a matrix operation: h = Aa... ... and we can write a(d) = as a a = A T h HITS algorithm in matrix notation: Compute h = Aa Compute a = A T h Iterate until convergence
294
Introduction to Information Retrieval 294 HITS as eigenvector problem HITS algorithm in matrix notation. Iterate: Compute h = Aa Compute a = A T h
295
Introduction to Information Retrieval 295 HITS as eigenvector problem HITS algorithm in matrix notation. Iterate: Compute h = Aa Compute a = A T h By substitution we get: h = AA T h and a = A T Aa
296
Introduction to Information Retrieval 296 HITS as eigenvector problem HITS algorithm in matrix notation. Iterate: Compute h = Aa Compute a = A T h By substitution we get: h = AA T h and a = A T Aa Thus, h is an eigenvector of AA T and a is an eigenvector of A T A.
297
Introduction to Information Retrieval 297 HITS as eigenvector problem HITS algorithm in matrix notation. Iterate: Compute h = Aa Compute a = A T h By substitution we get: h = AA T h and a = A T Aa Thus, h is an eigenvector of AA T and a is an eigenvector of A T A. So the HITS algorithm is actually a special case of the power method and hub and authority scores are eigenvector values.
298
Introduction to Information Retrieval 298 HITS as eigenvector problem HITS algorithm in matrix notation. Iterate: Compute h = Aa Compute a = A T h By substitution we get: h = AA T h and a = A T Aa Thus, h is an eigenvector of AA T and a is an eigenvector of A T A. So the HITS algorithm is actually a special case of the power method and hub and authority scores are eigenvector values. HITS and PageRank both formalize link analysis as eigenvector problems.
299
Introduction to Information Retrieval Example web graph
300
Introduction to Information Retrieval Raw matrix A for HITS d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d0 0010000 d1d1 0110000 d2d2 1012000 d3d3 0001100 d4d4 0000001 d5d5 0000011 d6d6 0002101
301
Introduction to Information Retrieval Hub vectors h 0, h i = A*a i, i ≥1 h0h0 h1h1 h2h2 h3h3 h4h4 h5h5 d0d0 0.140.060.04 0.03 d1d1 0.140.080.050.04 d2d2 0.140.280.320.33 d3d3 0.14 0.170.18 d4d4 0.140.060.04 d5d5 0.140.080.050.04 d6d6 0.140.300.330.340.35
302
Introduction to Information Retrieval Authority vector a = A T *h i-1, i ≥ 1 a1a1 a2a2 a3a3 a4a4 a5a5 a6a6 a7a7 d0d0 0.060.090.10 d1d1 0.060.030.01 d2d2 0.190.140.130.12 d3d3 0.310.430.46 0.47 d4d4 0.130.140.16 d5d5 0.060.030.020.01 d6d6 0.190.140.13
303
Introduction to Information Retrieval 303 Example web graph ah d0d0 0.100.03 d1d1 0.010.04 d2d2 0.120.33 d3d3 0.470.18 d4d4 0.160.04 d5d5 0.010.04 d6d6 0.130.35
304
Introduction to Information Retrieval Top-ranked pages
305
Introduction to Information Retrieval Top-ranked pages Pages with highest in-degree: d 2, d 3, d 6
306
Introduction to Information Retrieval Top-ranked pages Pages with highest in-degree: d 2, d 3, d 6 Pages with highest out-degree: d 2, d 6
307
Introduction to Information Retrieval Top-ranked pages Pages with highest in-degree: d 2, d 3, d 6 Pages with highest out-degree: d 2, d 6 Pages with highest PageRank: d 6
308
Introduction to Information Retrieval Top-ranked pages Pages with highest in-degree: d 2, d 3, d 6 Pages with highest out-degree: d 2, d 6 Pages with highest PageRank: d 6 Pages with highest in-degree: d 6 (close: d 2 )
309
Introduction to Information Retrieval Top-ranked pages Pages with highest in-degree: d 2, d 3, d 6 Pages with highest out-degree: d 2, d 6 Pages with highest PageRank: d 6 Pages with highest in-degree: d 6 (close: d 2 ) Pages with highest authority score: d 3
310
Introduction to Information Retrieval 310 PageRank vs. HITS: Discussion
311
Introduction to Information Retrieval 311 PageRank vs. HITS: Discussion PageRank can be precomputed, HITS has to be computed at query time.
312
Introduction to Information Retrieval 312 PageRank vs. HITS: Discussion PageRank can be precomputed, HITS has to be computed at query time. HITS is too expensive in most application scenarios.
313
Introduction to Information Retrieval 313 PageRank vs. HITS: Discussion PageRank can be precomputed, HITS has to be computed at query time. HITS is too expensive in most application scenarios. PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to.
314
Introduction to Information Retrieval 314 PageRank vs. HITS: Discussion PageRank can be precomputed, HITS has to be computed at query time. HITS is too expensive in most application scenarios. PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to. These two are orthogonal.
315
Introduction to Information Retrieval 315 PageRank vs. HITS: Discussion PageRank can be precomputed, HITS has to be computed at query time. HITS is too expensive in most application scenarios. PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to. These two are orthogonal. We could also apply HITS to the entire web and PageRank to a small base set.
316
Introduction to Information Retrieval 316 PageRank vs. HITS: Discussion PageRank can be precomputed, HITS has to be computed at query time. HITS is too expensive in most application scenarios. PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to. These two are orthogonal. We could also apply HITS to the entire web and PageRank to a small base set. Claim: On the web, a good hub almost always is also a good authority.
317
Introduction to Information Retrieval 317 PageRank vs. HITS: Discussion PageRank can be precomputed, HITS has to be computed at query time. HITS is too expensive in most application scenarios. PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to. These two are orthogonal. We could also apply HITS to the entire web and PageRank to a small base set. Claim: On the web, a good hub almost always is also a good authority. The actual difference between PageRank ranking and HITS ranking is therefore not as large as one might expect.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.