Download presentation
Presentation is loading. Please wait.
Published byHarvey Richardson Modified over 9 years ago
2
1 Page Link Analysis and Anchor Text for Web Search Lecture 9 Many slides based on lectures by Chen Li (UCI) an Raymond Mooney (UTexas)
3
2 PageRank Intuition: –The importance of each page should be decided by what other pages “say” about this page –One naïve implementation: count the # of pages pointing to each page (i.e., # of inlinks) Problem: –We can easily fool this technique by generating many dummy pages that point to our class page
4
3 Initial PageRank Idea Just measuring in-degree (citation count) doesn’t account for the authority of the source of a link. Initial page rank equation for page p: –N q is the total number of out-links from page q. –A page, q, “gives” an equal fraction of its authority to all the pages it points to (e.g. p). –c is a normalizing constant set so that the rank of all pages always sums to 1.
5
4 Initial PageRank Idea (cont.) Can view it as a process of PageRank “flowing” from pages to the pages they cite..1.09.05.03.08.03
6
5 Initial Algorithm Iterate rank-flowing process until convergence: Let S be the total set of pages. Initialize p S: R(p) = 1/|S| Until ranks do not change (much) (convergence) For each p S: For each p S: R(p) = cR´(p) (normalize)
7
6 Sample Stable Fixpoint 0.4 0.2 0.4
8
7 Linear Algebra Version Treat R as a vector over web pages. Let A be a 2-d matrix over pages where –A vu = 1/N u if u v else A vu = 0 Then R=cAR R converges to the principal eigenvector of A.
9
8 Example: MiniWeb Our “MiniWeb” has only three web sites: Netscape, Amazon, and Microsoft. Their weights are represented as a vector Ne Am MS For instance, in each iteration, half of the weight of AM goes to NE, and half goes to MS. Materials by courtesy of Jeff Ullman
10
9 Iterative computation Ne Am MS Final result: Netscape and Amazon have the same importance, and twice the importance of Microsoft. Does it capture the intuition? Yes.
11
10 Observations We cannot get absolute weights: –We can only know (and we are only interested in) those relative weights of the pages The matrix is stochastic (sum of each column is 1). So the iterations converge, and compute the principal eigenvector of the following matrix equation:
12
11 Problem 1 of algorithm: dead ends! Ne Am MS MS does not point to anybody Result: weights of the Web “leak out”
13
12 Problem 2 of algorithm: spider traps Ne Am MS MS only points to itself Result: all weights go to MS!
14
13 Google’s solution: “tax each page” Like people paying taxes, each page pays some weight into a public pool, which will be distributed to all pages. Example: assume 20% tax rate in the “spider trap” example.
15
14 The War of Search Engines More companies are realizing the importance of search engines More competitors in the market: Ask.com, Microsoft, Yahoo!, etc.
16
15 HITS Algorithm developed by Kleinberg in 1998. Attempts to computationally determine hubs and authorities on a particular topic through analysis of a relevant subgraph of the web. Based on mutually recursive facts: –Hubs point to lots of authorities. –Authorities are pointed to by lots of hubs.
17
16 Hubs and Authorities Motivation: find web pages to a topic –E.g.: “find all web sites about automobiles” “Authority”: a page that offers info about a topic –E.g.: DBLP is a page about papers –E.g.: google.com, aj.com, teoma.com, lycos.com “Hub”: a page that doesn’t provide much info, but tell us where to find pages about a topic –E.g.: www.searchenginewatch.com is a hub of search engineswww.searchenginewatch.com –http://www.ics.uci.edu/~ics214a/ points to many biblio- search engineshttp://www.ics.uci.edu/~ics214a/
18
17 The hope Long distance telephone companies Hubs Authorities
19
18 Base set Given text query (say browser), use a text index to get all pages containing browser. –Call this the root set of pages. Add in any page that either –points to a page in the root set, or –is pointed to by a page in the root set. Call this the base set.
20
19 Visualization Root set Base set
21
20 Assembling the base set Root set typically 200-1000 nodes. Base set may have up to 5000 nodes. How do you find the base set nodes? –Follow out-links by parsing root set pages. –Get in-links (and out-links) from a connectivity server. –(Actually, suffices to text-index strings of the form href=“URL” to get in-links to URL.)
22
21 Distilling hubs and authorities Compute, for each page x in the base set, a hub score h(x) and an authority score a(x). Initialize: for all x, h(x) 1; a(x) 1; Iteratively update all h(x), a(x); After iterations –output pages with highest h() scores as top hubs – highest a() scores as top authorities. Key
23
22 Iterative update Repeat the following updates, for all x: x x
24
23 Scaling To prevent the h() and a() values from getting too big, can scale down after each iteration. Scaling factor doesn’t really matter: –we only care about the relative values of the scores.
25
24 How many iterations? Claim: relative values of scores will converge after a few iterations: –in fact, suitably scaled, h() and a() scores settle into a steady state! –proof of this comes later. We only require the relative orders of the h() and a() scores - not their absolute values. In practice, ~5 iterations get you close to stability.
26
25 Iterative Algorithm Use an iterative algorithm to slowly converge on a mutually reinforcing set of hubs and authorities. Maintain for each page p S: –Authority score: a p (vector a) –Hub score: h p (vector h) Initialize all a p = h p = 1 Maintain normalized scores:
27
26 HITS Update Rules Authorities are pointed to by lots of good hubs: Hubs point to lots of good authorities:
28
27 Illustrated Update Rules 2 3 a 4 = h 1 + h 2 + h 3 1 5 7 6 4 4 h 4 = a 5 + a 6 + a 7
29
28 HITS Iterative Algorithm Initialize for all p S: a p = h p = 1 For i = 1 to k: For all p S: (update auth. scores) For all p S: (update hub scores) For all p S: a p = a p /c c: For all p S: h p = h p /c c: (normalize a) (normalize h)
30
29 Example: MiniWeb Ne Am MS Normalization! Therefore:
31
30 Example: MiniWeb Ne Am MS
32
31 Convergence Algorithm converges to a fix-point if iterated indefinitely. Define A to be the adjacency matrix for the subgraph defined by S. –A ij = 1 for i S, j S iff i j Authority vector, a, converges to the principal eigenvector of A T A Hub vector, h, converges to the principal eigenvector of AA T In practice, 20 iterations produces fairly stable results.
33
32 Results Authorities for query: “Java” –java.sun.com –comp.lang.java FAQ Authorities for query “search engine” –Yahoo.com –Excite.com –Lycos.com –Altavista.com Authorities for query “Gates” –Microsoft.com –roadahead.com
34
33 Result Comments In most cases, the final authorities were not in the initial root set generated using Altavista. Authorities were brought in from linked and reverse-linked pages and then HITS computed their high authority score.
35
34 Comparison Pagerank Pros –Hard to spam –Computes quality signal for all pages Cons –Non-trivial to compute –Not query specific –Doesn’t work on small graphs Proven to be effective for general purpose ranking HITS & Variants Pros –Easy to compute, real-time execution is hard [Bhar98b, Stat00] –Query specific – Works on small graphs Cons –Local graph structure can be manufactured (spam!) –Provides a signal only when there’s direct connectivity (e.g., home pages) Well suited for supervised directory construction
36
35 Tag/position heuristics Increase weights of terms –in titles –in tags –near the beginning of the doc, its chapters and sections
37
36 Anchor text ( first used WWW Worm - McBryan [Mcbr94] ) Here is a great picture of a tiger Tiger image Cool tiger webpage The text in the vicinity of a hyperlink is descriptive of the page it points to.
38
37 Two uses of anchor text When indexing a page, also index the anchor text of links pointing to it. –Retrieve a page when query matches its anchor text. To weight links in the hubs/authorities algorithm. Anchor text usually taken to be a window of 6-8 words around a link anchor.
39
38 Indexing anchor text When indexing a document D, include anchor text from links pointing to D. www.ibm.com Armonk, NY-based computer giant IBM announced today Joe’s computer hardware links Compaq HP IBM Big Blue today announced record profits for the quarter
40
39 Indexing anchor text Can sometimes have unexpected side effects - e.g., evil empire. Can index anchor text with less weight.
41
40 Weighting links In hub/authority link analysis, can match anchor text to query, then weight link.
42
41 Weighting links What is w(x,y)? Should increase with the number of query terms in anchor text. –E.g.: 1+ number of query terms. www.ibm.com Armonk, NY-based computer giant IBM announced today xy Weight of this link for query computer is 2.
43
42 Weighted hub/authority computation Recall basic algorithm: –Iteratively update all h(x), a(x); –After iteration, output pages with highest h() scores as top hubs highest a() scores as top authorities. Now use weights in iteration. Raises scores of pages with “heavy” links. Do we still have convergence of scores? To what?
44
43 Anchor Text Other applications –Weighting/filtering links in the graph HITS [Chak98], Hilltop [Bhar01] –Generating page descriptions from anchor text [Amit98, Amit00]
45
Behavior-based ranking
46
For each query Q, keep track of which docs in the results are clicked on On subsequent requests for Q, re-order docs in results based on click-throughs First due to DirectHit AskJeeves Relevance assessment based on –Behavior/usage –vs. content
47
Query-doc popularity matrix B Queries Docs q j B qj = number of times doc j clicked-through on query q When query q issued again, order docs by B qj values.
48
Issues to consider Weighing/combining text- and click-based scores. What identifies a query? –Ferrari Mondial –Ferrari mondial –ferrari mondial –“Ferrari Mondial” Can use heuristics, but search parsing slowed.
49
Vector space implementation Maintain a term-doc popularity matrix C –as opposed to query-doc popularity –initialized to all zeros Each column represents a doc j –If doc j clicked on for query q, update C j C j + q (here q is viewed as a vector). On a query q’, compute its cosine proximity to C j for all j. Combine this with the regular text score.
50
Issues Normalization of C j after updating Assumption of query compositionality –“white house” document popularity derived from “white” and “house” Updating - live or batch?
51
Basic Assumption Relevance can be directly measured by number of click throughs Valid?
52
Validity of Basic Assumption Click through to docs that turn out to be non-relevant: what does a click mean? Self-perpetuating ranking Spam All votes count the same
53
Variants Time spent viewing page –Difficult session management –Inconclusive modeling so far Does user back out of page? Does user stop searching? Does user transact?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.