Adapted from Lectures by Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney (UT, Austin) http://www.ams.org/featurecolumn/archive/pagerank.html http://pagerank.suchmaschinen-doktor.de/index/examples.html http://pr.efactory.de/e-pagerank-algorithm.shtml =============================================== 25 billion web documents – 95% docs made out of 10K words PageRank: the importance of a page is judged by the number of pages linking to it as well as their importance. Prasad L21LinkAnalysis
Evolution of Search Engines 1st Generation : Retrieved documents that matched keyword-based queries based on boolean model. 2nd Generation : Incorporated content-specific relevance ranking based on vector space model (TF-IDF), to deal with high recall. 3rd Generation: Incorporated content-independent source ranking, to overcome spamming, and to exploit “collective web wisdom”. Prasad L21LinkAnalysis
(cont’d) 3rd Generation: Tried to glean relative semantic emphasis of various words based on syntactic features such as fonts, span of query term hits, etc. to enhance the efficacy of VSM. Future search engines will incorporate context, profile, and past query history associated with a user, to personalize search, and apply additional reasoning and heuristics to improve satisfaction of information need. Prasad L21LinkAnalysis
Meta-Search Engines Search engine that passes query to several other search engines and integrates results. Submit queries to host sites. Parse resulting HTML pages to extract search results. Integrate multiple rankings into a “consensus” ranking. Present integrated results to user. Examples: Metacrawler, SavvySearch ,Dogpile Lot of work on rank aggregation
HTML Structure & Feature Weighting Weight tokens under particular HTML tags more heavily: <TITLE> tokens (Google seems to like title matches) <H1>,<H2>… tokens <META> keyword tokens Parse page into conceptual sections (e.g. navigation links vs. page content) and weight tokens differently based on section. BASIS: Indicative of semantic relevance
Bibliometrics: Citation Analysis Many standard documents include bibliographies (or references), explicit citations to other previously published documents. Using citations as links, standard corpora can be viewed as a graph. The structure of this graph, independent of the content, can provide interesting information about the similarity of documents and the structure of information. Relevance to clustering more than ranking
Bibliographic Coupling: Similarity Measure Measure of similarity of documents introduced by Kessler in 1963. The bibliographic coupling of two documents A and B is the number of documents cited by both A and B. Size of the intersection of their bibliographies. Maybe want to normalize by size of bibliographies? A B
Co-Citation : Similarity Measure An alternate citation-based measure of similarity introduced by Small in 1973. Number of documents that cite both A and B. Maybe want to normalize by total number of documents citing either A or B ? A B
Impact Factor (of a journal) Developed by Garfield in 1972 to measure the importance (quality, influence) of scientific journals. Measure of how often papers in the journal are cited by other scientists. Computed and published annually by the Institute for Scientific Information (ISI). The impact factor of a journal J in year Y is the average number of citations (from indexed documents published in year Y) to a paper published in J in year Y1 or Y2. Does not account for the quality of the citing article. self-reference of sorts? h-index of author by Hirsch – h articles cited h or more times.
Authorities Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic. In-degree (number of pointers to a page) is one simple measure of authority. However in-degree treats all links as equal. Should links from pages that are themselves authoritative count for more?
Hubs Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities). Hub pages for IR are included in the course home page: http://www.cs.utexas.edu/users/mooney/ir-course
Hubs and Authorities Together they tend to form a bipartite graph:
Today’s lecture: Basics of 3rd Generation Search Engines Role of anchor text Link analysis for ranking Pagerank and variants (Page and Brin) BONUS: Illustrates how to solve an important problem by adapting topology/statistics of large dataset for mathematically sound analysis and a practical scalable algorithm HITS (Klienberg) Anchor text enables keyword-based search of URLs Pagerank can be computed statically (that is, it is query independent) while HITS requires query-dependent computation. Prasad L21LinkAnalysis
The Web as a Directed Graph Page A hyperlink Page B Anchor Assumption 1: A hyperlink between pages denotes author perceived relevance (quality signal) Focus away from content – towards the topology of the network Note: Making assumptions explicit captures the scope of analysis and rationalizes potential failures! A1 endorsement. A2 summary. (A1 can fail if the passage around the link actually says it is a fake, controvertial, or expresses negative sentiment.) (cf. attention vs popularity vs positive sentiment) (A2 can fail if the anchor text belies the purpose of the page -> spamming.) Assumption 2: The anchor text of the hyperlink describes the target page (textual context) Prasad L21LinkAnalysis
Anchor Text WWW Worm - McBryan [Mcbr94] For ibm how to distinguish between: IBM’s home page (mostly graphical) IBM’s copyright page (high term freq. for ‘ibm’) Rival’s spam page (arbitrarily high term freq.) “IBM home page” “ibm.com” “ibm” WWWW – 1994 Search Engine : 110K docs in index (Worm = Crawler) -------------------------------------------------------------------------------------- IR -> ranking based on content vs WR -> ranking based on links Motivation for link/anchor-text analysis and counts (for ranking): http://www.ibm.com/ http://www.ibm.com/legal/copytrade.shtml -------------------------------------------------------------------------------- overcomes spamming/gaming VSM A million pieces of anchor text with “ibm” send a strong signal www.ibm.com Prasad L21LinkAnalysis
Indexing anchor text When indexing a document D, include anchor text from links pointing to D. Armonk, NY-based computer giant IBM announced today www.ibm.com Anchor text describes content: Enables keyword-based search of URLs in the Semantic Web too (our ESWC-2007 paper) (use anchor text as content in web-context) Can sometimes have unexpected side effects - e.g., evil empire. Can index anchor text with higher weight. Big Blue today announced record profits for the quarter Joe’s computer hardware links Compaq HP IBM
Anchor Text Other applications Weighting/filtering links in the graph HITS [Chak98], Hilltop [Bhar01] Generating page descriptions from anchor text [Amit98, Amit00] ftp://ftp.cs.toronto.edu/pub/reports/csri/405/hilltop.html Hilltop algorithm only considers "expert" sources - pages that have been created with the specific purpose of directing people towards resources. In response to a query, we first compute a list of the most relevant experts on the query topic. Then, we identify relevant links (plus anchor text) within the selected set of experts, and follow them to identify target web pages. The targets are then ranked according to the number and relevance of non-affiliated experts that point to them. Thus, the score of a target page reflects the collective opinion of the best independent experts on the query topic. When such a pool of experts is not available, Hilltop provides no results. Thus, Hilltop is tuned for result accuracy and not query coverage. http://www2006.org/programme/files/xhtml/3101/p3101-Richardson.html There has recently been work showing that PageRank may not perform any better than other simple measures on certain tasks. We show there are a number of simple url- or page- based features that significantly outperform PageRank (for the purposes of statically ranking Web pages) despite ignoring the structure of the Web. Popularity: Actual popularity of a Web page, measured as the number of times that it has been visited by users over some period of time. Anchor: It includes features such as the total amount of text in links pointing to the page ("anchor text"), the number of unique words in that text, etc. Page: This category consists of features which may be determined by looking at the page (and its URL) alone. We used only eight, simple features such as the number of words in the body, the frequency of the most common term, etc. Domain: This category contains features that are computed as averages across all pages in the domain. For example, the average number of outlinks on any page and the average PageRank. Prasad L21LinkAnalysis
Citation Analysis Citation frequency Co-citation coupling frequency Cocitations with a given author measures “impact” Cocitation analysis [Mcca90] Bibliographic coupling frequency Articles that co-cite the same articles are related Citation indexing Who is author cited by? (Garfield [Garf72]) Pagerank preview: Pinsker and Narin ’60s Co-cited articles are related. Articles that co-cite are related. ---------------------------------------------------------------------------- Citation analysis was developed in information science as a tool to identify core sets of articles, authors, or journals of particular fields of study. Cocitation analysis has been used to map the topical relatedness of clusters of authors, journals or articles. Co-Citation is a popular similarity measure used to establish a subject similarity between two items ------------------------------------------------------------------------- Last two precursor to pagerank Garfield: ISI (Now, Thomson Scientific) Impact factor: The more citations in reputable journals, the higher the impact factor. Narin and Pinsker: Weighted Citation Frequency / Citation Rank Prasad L21LinkAnalysis
Query-independent ordering First generation: using link counts as simple measures of popularity. Two basic suggestions: Undirected popularity: Each page gets a score = the number of in-links plus the number of out-links (3+2=5). Directed popularity: Score of a page = number of its in-links (3). Prasad
Query processing First retrieve all pages meeting the text query (say venture capital). Order these by their link popularity (either variant on the previous page). Spamming simple popularity Exercise: How do you spam each of the previous two heuristics so your page gets a high score? Prasad L21LinkAnalysis
Motivation and Introduction to PageRank Why is Page Importance Rating important? New challenges for information retrieval on the World Wide Web. Huge number of web pages: 150 million by1998 1000 billion by 2008 Diversity of web pages: different topics, different quality, etc. What is PageRank? A method for rating the importance of web pages objectively and mechanically using the link structure of the web. 150 million web pages 1.7 billion links
Initial PageRank Idea Can view it as a process of PageRank “flowing” from pages to the pages they cite. .08 .03 .1 .05 .03 Problem: A group of pages that only point to themselves but are pointed to by other pages act as a “rank sink” and absorb all the rank in the system. .09
Sample Stable Fixpoint 0.2 0.4 0.2 0.2 0.2 0.4 0.4
Pagerank scoring : More details Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, with equal probability “In the steady state” each page has a long-term visit rate - use this as the page’s score. 1/3 no content involved – only topology (node linking) probability matrix corresponding to arbitrary graph: STOCHASTIC => No negative entries + Column entries add to 1 Prasad L21LinkAnalysis
Not quite enough The web is full of dead-ends. Random walk can get stuck in dead-ends. Makes no sense to talk about long-term visit rates. ?? Prasad L21LinkAnalysis
Teleporting At a dead end, jump to a random web page. At any non-dead end, with probability 10%, jump to a random web page. With remaining probability (90%), go out on a random link. 10% - a parameter. changes matrix to be IRREDUCIBLE so that all states are reachable over the long haul == eigen vector corresponding to largest eigen value 1 is the desired pagerank --------------------------------------------------------------------------------------------------------- Realistic assumptions that makes the problem mathematically tractable ---------------------------------------------------------------------------------------------------------- damping factor alpha = 0.9 effects quick convergence Prasad L21LinkAnalysis
Result of teleporting Now cannot get stuck locally. There is a long-term rate at which any page is visited (not obvious, will show this). How do we compute this visit rate? Prasad L21LinkAnalysis
Markov chains A Markov chain consists of n states, plus an nn transition probability matrix P. At each step, we are in exactly one of these states. For 1 i,j n, the matrix entry Pij tells us the probability of j being the next state, given we are currently in state i. Pii>0 is OK. Markov property – Memory-less system – the transition out of a state is dependent on the current state, and not on the path to the current state i j Pij Prasad
Markov chains Clearly, for all i, Markov chains are abstractions of random walks. Exercise: represent the teleporting random walk from 3 slides ago as a Markov chain. For this case: Because of removal of dead-ends relative weights on each transition ----------------------------------------------------------------------------------- FIRST NODE: Out transitions: 0.9/2 + 0.1/3 Probability of looping back: 0.1/3 SECOND NODE: -- Ditto -- THIRD NODE: For dead-end case: all three out-transition incl. loop: 0.33 Prasad
Ergodic Markov chains A Markov chain is ergodic if you have a path from any state to any other. For any start state, after a finite transient time T0, the probability of being in any state at a fixed time T>T0 is nonzero. Stochastic (rows/col add to 1), Irreducible (no isolated island) necessary! Ergodicity achieved by providing jumps out of dead-states and arbitrary jumps out of any state. Not ergodic (even/ odd). Prasad
Ergodic Markov chains For any ergodic Markov chain, there is a unique long-term visit rate for each state. Steady-state probability distribution. Over a long time-period, we visit each state in proportion to this rate. It doesn’t matter where we start. Ergodicity achieved by providing jumps out of dead-states and arbitrary jumps out of any state. Prasad L21LinkAnalysis
Probability vectors A probability (row) vector x = (x1, … xn) tells us where the walk is, at any point. E.g., (000…1…000) means we’re in state i. 1 i n More generally, the vector x = (x1, … xn) means the walk is in state i with probability xi. Prasad
Change in probability vector If the probability vector is x = (x1, … xn) at this step, what is it at the next step? Recall that row i of the transition prob. matrix P tells us where we go next from state i. So from x, our next state is distributed as xP. xP = [ x1 P11 + … + xn Pn1, x1 P12 + … + xn Pn2, …, x1 P1n + … + xn Pnn] Sigma xi = 1 Summing elements of xP = x1 (P11 + P12 + … + P1n) + … = x1 + … + xn = 1 ====================================================== Formulation using transpose: I = H I (where H is hyperlink matrix and I is importance vector) Prasad L21LinkAnalysis
For this example, a1=1/4 and a2=3/4. Steady state example The steady state looks like a vector of probabilities a = (a1, … an): ai is the probability of being in state i. 3/4 1/4 1 2 3/4 [0.25 0.75] [ 0.25 0.75 ] = [0.25 0.75] [ 0.25 0.75 ] Pij = transition probability from i to j 1/4 For this example, a1=1/4 and a2=3/4. Prasad L21LinkAnalysis
How do we compute this vector? Let a = (a1, … an) denote the row vector of steady-state probabilities. If current state is described by a, then the next step is distributed as aP. But a is the steady state, so a=aP. So a is the (left) eigenvector for P. Corresponds to the “principal” eigenvector of P with the largest eigenvalue. Transition probability matrices always have largest eigenvalue 1 (and others are smaller). Principal eigenvector can be computed iteratively from any initial vector. Does the sequence a k always converge? converges because principal eigenvalue is 1 and others are smaller. If second eigenvalue is 1, the claim does not hold. Other values not small, then the convergence is slow. Is the vector to which it converges independent of the initial vector a0? Express any vector in terms of the eigenvectors and iterate to see convergence. I = c1v1 + … + cnvn Ik = c1v1 + c2 /\^k v2 + … I_limit = c1v1 (eigenvector with eigenvalue 1) Do the importance rankings contain the information that we want? ====================================================== Note that normally there are multiple eigenvectors so it is necessary to show that the iterative method converges irrespective of the intial vector to yield the appropriate one. Prasad
One way of computing a Recall, regardless of where we start, we eventually reach the steady state a. Start with any distribution (say x=(10…0)). After one step, we’re at xP; after two steps at xP2 , then xP3 and so on. “Eventually” means for “large” k, xPk = a. Algorithm: multiply x by increasing powers of P until the product looks stable. Note that normally there are multiple eigenvectors so it is necessary to show that the iterative method converges irrespective of the intial vector to yield the appropriate one. For justification for why it converges to eigenvector corresponding to the largest eigenvalue 1 see: http://www.ams.org/featurecolumn/archive/pagerank.html Prasad L21LinkAnalysis
Link Structure of the Web 150 million web pages 1.7 billion links Backlinks and Forward links: A and B are C’s backlinks C is A and B’s forward link Intuitively, a webpage is important if it has a lot of backlinks. What if a webpage has only one link off www.yahoo.com?
A Simple Version of PageRank u: a web page Bu: the set of u’s backlinks Nv: the number of forward links of page v c: the normalization factor to make ||R||L1 = 1 (||R||L1= |R1 + … + Rn|)
An example of Simplified PageRank (Transposed version) (Steady) State probability column vector entry xi corresponds to probability in state i Transition matrix entry m(i,j) corresponds to the probability of transition from state i to state j PageRank Calculation: first iteration
An example of Simplified PageRank (Transposed version) PageRank Calculation: second iteration
An example of Simplified PageRank (Transposed version) Convergence after some iterations
A Problem with Simplified PageRank A loop: During each iteration, the loop accumulates rank but never distributes rank to other pages!
An example of the Problem
An example of the Problem
An example of the Problem
Random Walks in Graphs The Random Surfer Model The Modified Model The simplified model: the standing probability distribution of a random walk on the graph of the web. simply keeps clicking successive links at random The Modified Model The modified model: the “random surfer” simply keeps clicking successive links at random, but periodically “gets bored” and jumps to a random page based on the distribution of E
Modified Version of PageRank E(u): a distribution of ranks of web pages that “users” jump to when they “gets bored” after successive links at random.
An example of Modified PageRank
Pagerank summary Preprocessing: Query processing: Given graph of links, build matrix P. From it compute a. The entry ai is a number between 0 and 1: the pagerank of page i. Query processing: Retrieve pages meeting query. Rank them by their pagerank. Order is query-independent. RANDOM WALK => Equal outlink probabilities -- dealing with deadends – damping factor / random jump http://web.dcs.bbk.ac.uk/~dell/teaching/ir/pr_example.pdf http://www.ianrogers.net/google-page-rank/ Prasad L21LinkAnalysis
The reality Pagerank is used in google, but with so are many other clever heuristics. Prasad L21LinkAnalysis
Pagerank: Issues and Variants How realistic is the random surfer model? What if we modeled the back button? [Fagi00] Surfer behavior sharply skewed towards short paths [Hube98] Search engines, bookmarks & directories make jumps non-random. Biased Surfer Models Weight link traversal probabilities based on match with topic/query (skewed selection) Bias jumps to pages on topic (e.g., based on personal bookmarks & categories of interest) Prasad
Topic Specific Pagerank [Have02] Conceptually, we use a random surfer who teleports, with say 10% probability, using the following rule: Select a category (say, one of the 16 top level ODP categories) based on a query & user -specific distribution over the categories Teleport to a page uniformly at random within the chosen category Sounds hard to implement: can’t compute PageRank at query time! Open Directory Project: largest, most comprehensive human-edited directory of the Web Prasad
Topic Specific Pagerank [Have02] Implementation offline:Compute pagerank distributions w.r.t individual categories Query independent model as before Each page has multiple pagerank scores – one for each ODP category, with teleportation only to that category online: Distribution of weights over categories computed by query context classification Generate a dynamic pagerank score for each page - weighted sum of category-specific pageranks Prasad L21LinkAnalysis
Influencing PageRank (“Personalization”) Input: Web graph W influence vector v v : (page degree of influence) Output: Rank vector r: (page page importance wrt v) r = PR(W , v) basis web graph!! Prasad L21LinkAnalysis
Non-uniform Teleportation Sports Teleport with 10% probability to a Sports page Prasad L21LinkAnalysis
Interpretation of Composite Score For a set of personalization vectors {vj} j [wj · PR(W , vj)] = PR(W , j [wj · vj]) Weighted sum of rank vectors itself forms a valid rank vector, because PR() is linear wrt vj Prasad L21LinkAnalysis
Interpretation Sports 10% Sports teleportation Prasad L21LinkAnalysis
Interpretation Health 10% Health teleportation Prasad L21LinkAnalysis
Interpretation Health Sports pr = (0.9 PRsports + 0.1 PRhealth) gives you: 9% sports teleportation, 1% health teleportation Prasad
Alternative to Pagerank Hubs and Authorities Alternative to Pagerank Prasad L21LinkAnalysis
Hyperlink-Induced Topic Search (HITS) - Klei98 In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages: Hub pages are good lists of links on a subject. e.g., “Bob’s list of cancer-related links.” Authority pages occur recurrently on good hubs for the subject. Best suited for “broad topic” queries rather than for page-finding queries. Gets at a broader slice of common opinion. Broad topic queries => many related documents sought compared to the “static” pagerank, computation of hubs and authorities rank is dynamic -------------------------------------------------------------------------------------------------------------- Pagerank does not attempt to capture the distinction between hubs and authorities. Ranks pages just by authority. Applied to the entire web rather than a local neighborhood of pages surrounding the results of a query. Prasad L21LinkAnalysis
Hubs and Authorities Thus, a good hub page for a topic points to many authoritative pages for that topic. A good authority page for a topic is pointed to by many good hubs for that topic. Circular definition - will turn this into an iterative computation. Prasad L21LinkAnalysis
Long distance telephone companies The hope Authorities Hubs Usually authorities may not point to other authorities, due to competition (contrasts with bibliographic citations) Long distance telephone companies Prasad
High-level scheme Extract from the web a base set of pages that could be good hubs or authorities. From these, identify a small set of top hub and authority pages; iterative algorithm. Bootstrap from standard search engines and improve upon it for broad-topic searches Prasad L21LinkAnalysis
Base set Given text query (say browser), use a text index to get all pages containing browser. Call this the root set of pages. Add in any page that either points to a page in the root set, or is pointed to by a page in the root set. Call this the base set. Pull in potential authorities and hubs if missing from the root set by this one-level expansion in either direction Prasad L21LinkAnalysis
Visualization Root set Base set Prasad L21LinkAnalysis
Assembling the base set [Klei98] Root set typically 200-1000 nodes. Base set may have up to 5000 nodes. How do you find the base set nodes? Follow out-links by parsing root set pages. Get in-links (and out-links) from a connectivity server. (Actually, suffices to text-index strings of the form href=“URL” to get in-links to URL.) Prasad L21LinkAnalysis
Distilling hubs and authorities Compute, for each page x in the base set, a hub score h(x) and an authority score a(x). Initialize: for all x, h(x)1; a(x) 1; Iteratively update all h(x), a(x); After iterations output pages with highest h() scores as top hubs highest a() scores as top authorities. Key Prasad L21LinkAnalysis
Illustrated Update Rules 1 4 a4 = h1 + h2 + h3 2 3 5 4 6 h4 = a5 + a6 + a7 7
Iterative update Repeat the following updates, for all x: y’s x y’s x Prasad L21LinkAnalysis
Scaling To prevent the h() and a() values from getting too big, can normalize after each iteration. Scaling factor doesn’t really matter: we only care about the relative values of the scores. Prasad L21LinkAnalysis
Old Results Authorities for query: “Java” java.sun.com comp.lang.java FAQ Authorities for query “search engine” Yahoo.com Excite.com Lycos.com Altavista.com Authorities for query “Gates” Microsoft.com roadahead.com In most cases, the final authorities were not in the initial root set generated using Altavista. Authorities were brought in from linked and reverse-linked pages and then HITS computed their high authority score.
Similar Page Results from Hubs Given “honda.com” toyota.com ford.com bmwusa.com saturncars.com nissanmotors.com audi.com volvocars.com
Finding Similar Pages Using Link Structure Given a page, P, let R (the root set) be t (e.g. 200) pages that point to P. Grow a base set S from R. Run HITS on S. Return the best authorities in S as the best similar-pages for P. Finds authorities in the “link neighbor-hood” of P.
HITS for Clustering An ambiguous query can result in the principal eigenvector only covering one of the possible meanings. Non-principal eigenvectors may contain hubs & authorities for other meanings. Example: “jaguar”: Atari video game (principal eigenvector) NFL Football team (2nd non-princ. eigenvector) Automobile (3rd non-princ. eigenvector)
Japan Elementary Schools Hubs Authorities schools LINK Page-13 “ú–{‚ÌŠwZ a‰„¬ŠwZƒz[ƒƒy[ƒW 100 Schools Home Pages (English) K-12 from Japan 10/...rnet and Education ) http://www...iglobe.ne.jp/~IKESAN ‚l‚f‚j¬ŠwZ‚U”N‚P‘g•¨Œê ÒŠ—’¬—§ÒŠ—“Œ¬ŠwZ Koulutus ja oppilaitokset TOYODA HOMEPAGE Education Cay's Homepage(Japanese) –y“쬊wZ‚̃z[ƒƒy[ƒW UNIVERSITY ‰J—³¬ŠwZ DRAGON97-TOP ‰ª¬ŠwZ‚T”N‚P‘gƒz[ƒƒy[ƒW ¶µ°é¼ÂÁ© ¥á¥Ë¥å¡¼ ¥á¥Ë¥å¡¼ The American School in Japan The Link Page ‰ªès—§ˆä“c¬ŠwZƒz[ƒƒy[ƒW Kids' Space ˆÀés—§ˆÀ鼕”¬ŠwZ ‹{鋳ˆç‘åŠw•‘®¬ŠwZ KEIMEI GAKUEN Home Page ( Japanese ) Shiranuma Home Page fuzoku-es.fukui-u.ac.jp welcome to Miasa E&J school _“Þ쌧E‰¡•ls—§’†ì¼¬ŠwZ‚̃y http://www...p/~m_maru/index.html fukui haruyama-es HomePage Torisu primary school goo Yakumo Elementary,Hokkaido,Japan FUZOKU Home Page Kamishibun Elementary School... Prasad L21LinkAnalysis
Things to note Pulled together good pages regardless of language of page content. Use link analysis only after base set assembled iterative scoring is query-independent. Iterative computation after text index retrieval - significant overhead. Prasad L21LinkAnalysis
Proof of convergence nn adjacency matrix A: each of the n pages in the base set has a row and column in the matrix. Entry Aij = 1 if page i links to page j, else = 0. 1 2 3 1 2 1 2 3 0 1 0 1 1 1 3 1 0 0 Prasad L21LinkAnalysis
Hub/authority vectors View the hub scores h() and the authority scores a() as vectors with n components. Recall the iterative updates We only require the relative orders of the h() and a() scores - not their absolute values. Prasad
Rewrite in matrix form Substituting, h=AAth and a=AtAa. Recall At is the transpose of A. h=Aa. a=Ath. Substituting, h=AAth and a=AtAa. Thus, h is an eigenvector of AAt and a is an eigenvector of AtA. What are the properties of AAt and AtA that guarantee convergence to principal eignevector from any initial value? Further, our algorithm is known for computing eigenvectors: the power iteration method. Guaranteed to converge. Prasad
Issues Topic Drift Mutually Reinforcing Affiliates Off-topic pages can cause off-topic “authorities” to be returned E.g., the neighborhood graph can be about a “super topic” Mutually Reinforcing Affiliates Affiliated pages/sites can boost each others’ scores Linkage between affiliated pages is not a useful signal http://www2004.org/proceedings/docs/1p309.pdf http://www2004.org/proceedings/docs/1p595.pdf http://www2003.org/cdrom/papers/refereed/p270/kamvar-270-xhtml/index.html http://www2003.org/cdrom/papers/refereed/p641/xhtml/p641-mccurley.html Prasad L21LinkAnalysis