More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.

Slides:



Advertisements
Similar presentations
Markov Models.
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
The math behind PageRank A detailed analysis of the mathematical aspects of PageRank Computational Mathematics class presentation Ravi S Sinha LIT lab,
Information Networks Link Analysis Ranking Lecture 8.
Graphs, Node importance, Link Analysis Ranking, Random walks
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
Absorbing Random walks Coverage
Experiments with MATLAB Experiments with MATLAB Google PageRank Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University, Taiwan
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Mining and Searching Massive Graphs (Networks)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
More on Rankings.
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Link Analysis, PageRank and Search Engines on the Web
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
The effect of New Links on Google Pagerank By Hui Xie Apr, 07.
Stochastic Approach for Link Structure Analysis (SALSA) Presented by Adam Simkins.
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Entropy Rate of a Markov Chain
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
DATA MINING LECTURE 13 Absorbing Random walks Coverage.
1 Random Walks on Graphs: An Overview Purnamrita Sarkar, CMU Shortened and modified by Longin Jan Latecki.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
6.896: Probability and Computation Spring 2011 Constantinos (Costis) Daskalakis lecture 3.
Overview of Web Ranking Algorithms: HITS and PageRank
Optimal Link Bombs are Uncoordinated Sibel Adali Tina Liu Malik Magdon-Ismail Rensselaer Polytechnic Institute.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Chapter 5: Link Analysis for Authority Scoring
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
COMP4210 Information Retrieval and Search Engines Lecture 9: Link Analysis.
DATA MINING LECTURE 11 Classification Naïve Bayes Graphs And Centrality.
Seminar on random walks on graphs Lecture No. 2 Mille Gandelsman,
PageRank Algorithm -- Bringing Order to the Web (Hu Bin)
By: Jesse Ehlert Dustin Wells Li Zhang Iterative Aggregation/Disaggregation(IAD)
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Online Social Networks and Media Absorbing random walks Label Propagation Opinion Formation.
Random Sampling Algorithms with Applications Kyomin Jung KAIST Aug ERC Workshop.
Computation on Graphs. Graphs and Sparse Matrices Sparse matrix is a representation of.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
The PageRank Citation Ranking: Bringing Order to the Web
Quality of a search engine
Random Walks on Graphs.
Methods and Apparatus for Ranking Web Page Search Results
Search Engines and Link Analysis on the Web
Link-Based Ranking Seminar Social Media Mining University UC3M
PageRank and Markov Chains
DTMC Applications Ranking Web Pages & Slotted ALOHA
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Presented by Nick Janus
Presentation transcript:

More on Rankings

Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the pages in Q ordered according to order π What are the advantages of such an approach?

InDegree algorithm Rank pages according to in-degree – w i = |B(i)| 1.Red Page 2.Yellow Page 3.Blue Page 4.Purple Page 5.Green Page w=1 w=2 w=3 w=2

PageRank algorithm [BP98] Good authorities should be pointed by good authorities Random walk on the web graph – pick a page at random – with probability 1- α jump to a random page – with probability α follow a random outgoing link Rank according to the stationary distribution 1.Red Page 2.Purple Page 3.Yellow Page 4.Blue Page 5.Green Page

Markov chains A Markov chain describes a discrete time stochastic process over a set of states according to a transition probability matrix – P ij = probability of moving to state j when at state i ∑ j P ij = 1 (stochastic matrix) Memorylessness property: The next state of the chain depends only at the current state and not on the past of the process (first order MC) – higher order MCs are also possible S = {s 1, s 2, … s n } P = {P ij }

Random walks Random walks on graphs correspond to Markov Chains – The set of states S is the set of nodes of the graph G – The transition probability matrix is the probability that we follow an edge from one node to another

An example v1v1 v2v2 v3v3 v4v4 v5v5

State probability vector The vector q t = (q t 1,q t 2, …,q t n ) that stores the probability of being at state i at time t – q 0 i = the probability of starting from state i q t = q t-1 P

An example v1v1 v2v2 v3v3 v4v4 v5v5 q t+1 1 = 1/3 q t 4 + 1/2 q t 5 q t+1 2 = 1/2 q t 1 + q t 3 + 1/3 q t 4 q t+1 3 = 1/2 q t 1 + 1/3 q t 4 q t+1 4 = 1/2 q t 5 q t+1 5 = q t 2

Stationary distribution A stationary distribution for a MC with transition matrix P, is a probability distribution π, such that π = πP A MC has a unique stationary distribution if – it is irreducible the underlying graph is strongly connected – it is aperiodic for random walks, the underlying graph is not bipartite The probability π i is the fraction of times that we visited state i as t → ∞ The stationary distribution is an eigenvector of matrix P – the principal left eigenvector of P – stochastic matrices have maximum eigenvalue 1

Computing the stationary distribution The Power Method – Initialize to some distribution q 0 – Iteratively compute q t = q t-1 P – After enough iterations q t ≈ π – Power method because it computes q t = q 0 P t Rate of convergence – determined by λ 2

The PageRank random walk Vanilla random walk – make the adjacency matrix stochastic and run a random walk

The PageRank random walk What about sink nodes? – what happens when the random walk moves to a node without any outgoing inks?

The PageRank random walk Replace these row vectors with a vector v – typically, the uniform vector P’ = P + dv T

The PageRank random walk How do we guarantee irreducibility? – add a random jump to vector v with prob α typically, to a uniform vector P’’ = αP’ + (1-α)uv T, where u is the vector of all 1s

Effects of random jump Guarantees irreducibility Motivated by the concept of random surfer Offers additional flexibility – personalization – anti-spam Controls the rate of convergence – the second eigenvalue of matrix P’’ is α

A PageRank algorithm Performing vanilla power method is now too expensive – the matrix is not sparse q 0 = v t = 1 repeat t = t +1 until δ < ε Efficient computation of y = (P’’) T x

Random walks on undirected graphs In the stationary distribution of a random walk on an undirected graph, the probability of being at node i is proportional to the (weighted) degree of the vertex Random walks on undirected graphs are not “interesting”

Research on PageRank Specialized PageRank – personalization [BP98] instead of picking a node uniformly at random favor specific nodes that are related to the user – topic sensitive PageRank [H02] compute many PageRank vectors, one for each topic estimate relevance of query with each topic produce final PageRank as a weighted combination Updating PageRank [Chien et al 2002] Fast computation of PageRank – numerical analysis tricks – node aggregation techniques – dealing with the “Web frontier”

Topic-sensitive pagerank HITS-based scores are very inefficient to compute PageRank scores are independent of the queries Can we bias PageRank rankings to take into account query keywords? Topic-sensitive PageRank

Conventional PageRank computation: r (t+1) (v)= Σ uЄN(v) r (t) (u)/d(v) N(v): neighbors of v d(v): degree of v r = Mxr M’ = (1-α)P+ α [ 1/n ] nxn r = (1-α)Pr+ α [ 1/n ] nxn r = (1-α)Pr+ α p p = [1/n] nx1

Topic-sensitive PageRank r = (1-α)Pr+ α p Conventional PageRank: p is a uniform vector with values 1/n Topic-sensitive PageRank uses a non-uniform personalization vector p Not simply a post-processing step of the PageRank computation Personalization vector p introduces bias in all iterations of the iterative computation of the PageRank vector

Personalization vector In the random-walk model, the personalization vector represents the addition of a set of transition edges, where the probability of an artificial edge (u,v) is αp v Given a graph the result of the PageRank computation only depends on α and p : PR(α,p)

Topic-sensitive PageRank: Overall approach Preprocessing – Fix a set of k topics – For each topic c j compute the PageRank scores of page u wrt to the j-th topic: r(u,j) Query-time processing: – For query q compute the total score of page u wrt q as score(u,q) = Σ j=1…k Pr(c j |q) r(u,j)

Topic-sensitive PageRank: Preprocessing Create k different biased PageRank vectors using some pre-defined set of k categories (c 1,…,c k ) T j : set of URLs in the j-th category Use non-uniform personalization vector p=w j such that:

Topic-sensitive PageRank: Query-time processing D j : class term vectors consisting of all the terms appearing in the k pre-selected categories How can we compute P(c j )? How can we compute Pr(q i |c j )?

Comparing results of Link Analysis Ranking algorithms Comparing and aggregating rankings

Comparing LAR vectors How close are the LAR vectors w 1, w 2 ? w 1 = [ ] w 2 = [ ]

Distance between LAR vectors Geometric distance: how close are the numerical weights of vectors w 1, w 2 ? w 1 = [ ] w 2 = [ ] d 1 (w 1,w 2 ) = = 1.6

Distance between LAR vectors Rank distance: how close are the ordinal rankings induced by the vectors w 1, w 2 ? – Kendal’s τ distance