Information Retrieval Link-based Ranking. Ranking is crucial… “.. From our experimental data, we could observe that the top 20% of the pages with the.

Slides:



Advertisements
Similar presentations
Lecture 18: Link analysis
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Information Networks Link Analysis Ranking Lecture 8.
Graphs, Node importance, Link Analysis Ranking, Random walks
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
Link Analysis David Kauchak cs160 Fall 2009 adapted from:
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
Information Retrieval IR10 Today’s lecture Anchor text Link analysis for ranking Pagerank and variants HITS.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
ICS 278: Data Mining Lecture 15: Mining Web Link Structure
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Link Analysis, PageRank and Search Engines on the Web
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
INF 2914 Web Search Lecture 4: Link Analysis Today’s lecture Anchor text Link analysis for ranking Pagerank and variants HITS.
CS347 Lecture 6 April 25, 2001 ©Prabhakar Raghavan.
Singular Value Decomposition and Data Management
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Stochastic Approach for Link Structure Analysis (SALSA) Presented by Adam Simkins.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
1 Page Link Analysis and Anchor Text for Web Search Lecture 9 Many slides based on lectures by Chen Li (UCI) an Raymond Mooney (UTexas)
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis.
Information retrieval Lecture 9 Recap and today’s topics Last lecture web search overview pagerank Today more sophisticated link analysis using links.
PrasadL17LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning (Stanford)
ITCS 6265 Lecture 17 Link Analysis This lecture Anchor text Link analysis for ranking Pagerank and variants HITS.
Lecture 14: Link Analysis
CS276 Lecture 18 Link Analysis Today’s lecture Anchor text Link analysis for ranking Pagerank and variants HITS.
CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 18: Link analysis.
Hyperlink Analysis for the Web. Information Retrieval Input: Document collection Goal: Retrieve documents or text with information content that is relevant.
Hyperlink Analysis for the Web. Information Retrieval Input: Document collection Goal: Retrieve documents or text with information content that is relevant.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 21: Link analysis.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Web Search and Tex Mining Lecture 9 Link Analysis.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Overview of Web Ranking Algorithms: HITS and PageRank
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
CS349 – Link Analysis 1. Anchor text 2. Link analysis for ranking 2.1 Pagerank 2.2 Pagerank variants 2.3 HITS.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
COMP4210 Information Retrieval and Search Engines Lecture 9: Link Analysis.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Web-Mining Agents Community Analysis Tanya Braun Universität zu Lübeck Institut für Informationssysteme.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
Modified by Dongwon Lee from slides by
The PageRank Citation Ranking: Bringing Order to the Web
Quality of a search engine
Search Engines and Link Analysis on the Web
Link-Based Ranking Seminar Social Media Mining University UC3M
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Presentation transcript:

Information Retrieval Link-based Ranking

Ranking is crucial… “.. From our experimental data, we could observe that the top 20% of the pages with the highest number of incoming links obtained 70% of the new links after 7 months, while the bottom 60% of the pages obtained virtually no new incoming links during that period…” [ Cho et al., 04 ] Cho et al., 04

Query processing First retrieve all pages meeting the text query Order these by their link popularity +… other scores: TF-IDF, Title, Anchor,

Query-independent ordering First generation: using link counts as simple measures of popularity. Two basic suggestions: Undirected popularity: Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5). Directed popularity: Score of a page = number of its in-links (es. 3). SPAM!!!!

Second generation: PageRank Each link has its own importance!! PageRank is independent of the query many interpretations…

Basic Intuition… What about nodes with no out-links?

Google’s Pagerank B(i) : set of pages linking to i. #out(j) : number of outgoing links from j. e : vector of components 1/sqrt{N}. Random jump

Pagerank: use in Search Engines Preprocessing: Given graph of links, build matrix P Compute its principal eigenvector The i-th entry is the pagerank of page i We are interested in the relative order Query processing: Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Three different interpretations Graph (intuitive interpretation) What is the random jump? What happens to the “real” web graph? Matrix (easy for computation) Turns out to be an eigenvector computation Or a linear system solution Markov Chain (useful to show convergence) a sort of Usage Simulation

Pagerank as Usage Simulation Imagine a user doing a random walk on web: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably “In the steady state” each page has a long-term visit rate - use this as the page’s score. 1/3

Not quite enough The web is full of dead-ends. Random walk can get stuck in dead-ends. Cannot talk about long-term visit rates.

Teleporting At each step, with probability 10%, jump to a random page. With remaining probability (90%), go out on a out-link taken at random. Any node 0.9 Neighbors 0.1

Result of teleporting Now we CANNOT get stuck locally. There is a long-term rate at which any page is visited Not obvious  Markov chains

Markov chains A Markov chain consists of n states, plus an n  n transition probability matrix P. At each step, we are in exactly one of the states. For 1  i,j  n, the matrix entry P ij tells us the probability of going from state i to state j. ij P ij P ii >0 is OK.

Markov chains Clearly, for all i, Markov chains abstracts random walks A Markov chain is ergodic if you have a path from any state to any other you can be in any state at every time step, with non-zero probability. Not ergodic Not ergodic

Ergodic Markov chains For any ergodic Markov chain, there is a unique long-term visit rate for each state. Steady-state distribution. Over a long time-period, we visit each state in proportion to this rate. It doesn’t matter where we start.

Computing steady-state probabilities ? Let x = (x 1, …, x n ) denote the row vector of steady-state probabilities. If our current position is described by x, then the next step is distributed as xP. But x is the steady state, so x=xP. Solving this matrix equation gives us x. So x is the (left) eigenvector for P. (Corresponds to the “principal” eigenvector of P with the largest eigenvalue.)

Computing x by Power Method Recall, regardless of where we start, we eventually reach the steady state a. Start with any distribution (say x=(10…0)). After one step, we’re at xP; after two steps at xP 2, then xP 3 and so on. “Eventually” means for “large” k, a = xP k Algorithm: multiply x by increasing powers of P until the product looks stable.

Why we need a fast ranking? “…The link structure of the Web is significantly more dynamic than the contents on the Web. Every week, about 25% new links are created. After a year, about 80% of the links on the Web are replaced with new ones. This result indicates that search engines need to update link-based ranking metrics very often…” [ Cho et al., 04 ] Cho et al., 04

Accelerating PageRank Web Graph Compression to fit in internal memory Efficient external-memory implementation Mathematical approaches Combination of the above strategies

Personalized PageRank a b

Influencing PageRank (“Personalization”) Input: Web graph W influence vector v v : page  degree of influence Output: Rank vector r: page  importance wrt v r = PR(W, v)

Personalized PageRank α is the dumping factor v is a personalization vector

HITS: Hubs and Authorities A good hub page for a topic points to many authoritative pages for that topic. A good authority page for a topic is pointed to by many good hubs for that topic. Circular definition - will turn this into an iterative computation.

HITS: Hypertext Induced Topic Search It is query-dependent Produces two scores per page: Authority score Hub score

Hubs & Authorities Calculation Root Set Base Set Query

Assembling the base set Root set typically nodes. Base set may have up to 5000 nodes. How do you find the base set nodes? Follow out-links by parsing root set pages. Get in-links (and out-links) from a connectivity server. (Actually, suffices to text-index strings of the form href=“URL” to get in-links to URL.)

Authority and Hub scores a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

Iterative update Repeat the following updates, for all x: x x

HITS: Link Analysis Computation Where a: Vector of Authorities’ scores h: Vector of Hubs’ scores. A: Adjacency matrix in which a i,j = 1 if i points to j. Thus, h is an eigenvector of AA t a is an eigenvector of A t A scaling SVD of A

Math Behind the Algorithm Theorem (Kleinberg, 1998). The vectors a(p) and h(p) converge to the principal eigenvectors of A T A and AA T, where A is the adjacency matrix of the (directed) Web subgraph.

How many iterations? Claim: relative values of scores will converge after a few iterations: We only require the relative orders of the h() and a() scores - not their absolute values. In practice ~5 iterations get you close to stability. Scores are sensitive to small variations of the graph. Bad !!

HITS Example Results Authority Hubness Authority and hubness weights

Weighting links In hub/authority link analysis, can match text to query, then weight link.

Weighting links Should increase with the number of query terms in anchor text. E.g.: 1+ number of query terms. Armonk, NY-based computer giant IBM announced today xy Weight of this link for query computer is 2.

Practical HITS Used by ASK.com How to compute it efficiently??????

Comparison Pagerank Pros Hard to spam Quality signal for all pages Cons Non-trivial to compute Not query specific Doesn’t work on small graphs Proven to be effective for general purpose ranking HITS & Variants Pros Query specific Works on small graphs Cons Local graph structure can be manufactured (spam!) Real-time computation is hard. Well suited for supervised directory construction

SALSA Proposed by Lempel and Moran Probabilistic extension of the HITS algorithm Random walk is carried out by following hyperlinks both in the forward and in the backward direction Two separate random walks Hub walk Authority walk

Forming a Bipartite Graph in SALSA

SALSA: Random Walks Hub walk Follow a forward-link from an hub u h to an authority w a Traverse a back-link from w a to some page v h ’ Authority Walk Follow a back-link from an authority w a to an hub u h Traverse a forward-link from an hub u h to an authority z a Lempel and Moran (2001) proved that SALSA weights are more robust that HITS weights in some cases. Authority-score converges to Indegree, Hub-score converges to outdegree

Information Retrieval Spamming

Spam Farm (SF), rules of thumb Use all available own pages in the SF Accumulate the maximum number of inlinks to SF NO links pointing outside the SF Avoid dangling nodes within the SF Spamming PageRank

Spamming HITS Easy to spam Create a new page p pointing to many authority pages (e.g., Yahoo, Google, etc.)  p becomes a good hub page … On p, add a link to your home page

Many applications Ranking Web News Financial Institutions Peers Human Trustness Viruses Social networks You propose!

Information Retrieval Other tools

snippet

Joint with A. Gullì WWW 2005, Tokio (JP)

SnakeT ’s personalization Many interesting features Full-adaptivity to the variegate user needs/behaviors Scalable to many users and their unbounded profiles Privacy protection: no login or profiled information is required/used Black-box: no change in the underlying (unpersonalized) search engine The user first gets a “glimpse” on all themes of results without scanning all of them, and then selects (filters) the results interesting for him/her.