Link Structure Analysis

Slides:



Advertisements
Similar presentations
Information Networks Link Analysis Ranking Lecture 8.
Advertisements

Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.
Graphs, Node importance, Link Analysis Ranking, Random walks
Link Analysis: PageRank
CS345 Data Mining Page Rank Variants. Review Page Rank  Web graph encoded by matrix M N £ N matrix (N = number of web pages) M ij = 1/|O(j)| iff there.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
Markov Chains Lecture #5
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Link Analysis.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Link Analysis, PageRank and Search Engines on the Web
Network Structure and Web Search Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Information Retrieval Link-based Ranking. Ranking is crucial… “.. From our experimental data, we could observe that the top 20% of the pages with the.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
Stochastic Approach for Link Structure Analysis (SALSA) Presented by Adam Simkins.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Overview of Web Ranking Algorithms: HITS and PageRank
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
COMP4210 Information Retrieval and Search Engines Lecture 9: Link Analysis.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
PageRank Algorithm -- Bringing Order to the Web (Hu Bin)
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
WEB SPAM.
Quality of a search engine
15-499:Algorithms and Applications
HITS Hypertext-Induced Topic Selection
Methods and Apparatus for Ranking Web Page Search Results
Search Engines and Link Analysis on the Web
Markov Chains Mixing Times Lecture 5
Link Analysis 2 Page Rank Variants
7CCSMWAL Algorithmic Issues in the WWW
Link-Based Ranking Seminar Social Media Mining University UC3M
CSE 454 Advanced Internet Systems University of Washington
CSE 454 Advanced Internet Systems University of Washington
Lecture 22 SVD, Eigenvector, and Web Search
CSE 454 Advanced Internet Systems University of Washington
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CSE 454 Advanced Internet Systems University of Washington
Piyush Kumar (Lecture 2: PageRank)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presented by Nick Janus
Presentation transcript:

Link Structure Analysis Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)

236620 Search Engine Technology Link Analysis In the Lecture HITS: topic-specific algorithm Assigns each page two scores – a hub score and an authority score – with respect to a topic PageRank: query independent algorithm Assigns each page a single, global importance score Both algorithms reduced to the computation of principal eigenvectors of certain matrices Today’s Tutorial: Graph modifications in link analysis algorithms SALSA – HITS with a random-walk twist Topic-Sensitive PageRank 9 December 2018 236620 Search Engine Technology

Graph Modifications in Link-Analysis Algorithms Delete irrelevant elements (pages, links) from the collection. Non-informative links Pages that are deemed irrelevant (mostly by similarity of content to the query), and their incident links [Bharat and Henzinger, 1998] Assign varying (positive) link weights to the non-deleted links. Similarity of anchor text to the query [CLEVER] Links incident to pre-defined relevant pages [CLEVER] Multiple links from pages of site A to pages of site B [Bharat and Henzinger, 1998] Note that some of the above modifications are only applicable to topic distillation algorithms Several works suggested performing modifications, that intuitively should improve precision, to the link structure of the analyzed collections: Pages that are deemed irrelevant - for HITS, pruning the initial HITS graph by removing irrelevant pages. “Links incident to pre-defined relevant pages “ - If you know some authorities/hubs and want to find additional ones, you can assign high link weights to links incident to your known good pages. These weights will propagate to the immediate vicinity and will increase the scores of the neighborhood. In general, links from/to known trusted sources can get a weight bonus.   “Multiple links from pages of site A to pages of site B “ – if site A links (from many of its pages) 100 times to site B, it is unreasonable to count all those links with a weight of 1, as they are all under the control of a single entity/site administrator. Therefore, what that paper did was allow an overall weight of 1 to all links between any two sites (in the earlier example this would mean decreasing the weight of each link from A to B to the value of 0.01) CLEVER – IBM project implementing HITS and various variations thereof, started during Kleinberg’s sabbatical at IBM Almaden. Many publications came out the CLEVER project, seehttp://www.almaden.ibm.com/projects/clever.shtml 9 December 2018 236620 Search Engine Technology

SALSA – Stochastic Approach to Link Structure Analysis SALSA, like HITS, is a topic-distillation algorithm that aims to assign pages both hub and authority scores SALSA analyzes the same topic-centric graph as HITS, but splits each node into two – a “hub personality” without in-links and an “authority personality” without out-links Examines the resulting bipartite graph Innovation: incorporate stochastic analysis with the authority-hub paradigm Examine two separate random walk Markov chains: an authority chain A, and a hub chain H. A single step in each chain is composed of two link traversals on the Web - one link forward, and one link backwards. The principal community of each type: the most frequently visited pages in the corresponding Markov Chain 9 December 2018 236620 Search Engine Technology

Forming bi-pirate graph in Salsa Hub walk Follow a Web link from a page uh to a page wa (a forward link) and then Immediately traverse a backlink going from wa to vh, where (u,w) Є E and (v,w) Є E Authority Walk Follow a Web link from a page w(a) to a page u(h) (a backward link) and then Immediately traverse a forward link going back from vh to wa where (u,w) Є E and (v,w) Є E

SALSA – Authority Chain Example Pr (23) = 2/5*1/3 Formally, The transition probability matrix: Hub weight computed from the sum of the product of the inverse degree of the in-links and the out-links A_uv = sum_(w,u),(w,v) [1/ deg(v_a) * 1/deg(w_h)] H_uv = sum_(u,w),(v,w) [1/ deg(u_h) * 1/deg(w_a)] [PA]i,j =  {k| ki, kj} (iin)-1(kout)-1 9 December 2018 236620 Search Engine Technology

236620 Search Engine Technology SALSA: Analysis The transition probabilities induce a probability distribution on the authorities (hubs) in the authority (hub) Markov chain The principal community of authorities (hubs) is defined as the k most probable pages in the authority (hub) chain While one can compute the scores by calculating the principal eigenvector of the stochastic transition matrices, a more efficient way exists 9 December 2018 236620 Search Engine Technology

SALSA: Analysis (cont.) Mathematical Analysis of SALSA leads to the following theorem: SALSA’s authority weights reflect The normalized in-degree of each page, multiplied by the relative size of the page’s component in the authority side of the graph x 3 4 a(x) = ----- x ----- = 0.25 3 +5 4 +2 9 December 2018 236620 Search Engine Technology

SALSA: Proof for Irreducible Authority Chains The proof assumes a weighted graph, in which the link kj has weight w(kj) The examples shown so far assumed that all links have a weight of 1 Define W as the sum of all links weights Define a distribution vector π by πj = din(j)/W, where din(j) is the sum of weights of j’s incoming links Similarly, dout(k) is the sum of weights of k’s outgoing links It is enough to prove that πPA=π, since PA has a single stationary eigenvector (Ergodic Theorem) Recall that PA is the transition matrix of the authority chain PA is always aperiodic 9 December 2018 236620 Search Engine Technology

SALSA: Proof for Irreducible Authority Chains 9 December 2018 236620 Search Engine Technology

Topic Sensitive PageRank [T. Haveliwala, 2002] A topic T is defined by a set of on-topic pages ST. A T-biased PageRank is PageRank where the random jumps (teleportations) land u.a.r. on ST rather than on any arbitrary Web page Recall the alternative interpretation of PageRank, as walking random paths of geometrically distributed lengths between resets Here, a reset returns to some on-topic page If we assume that pages tend to link to pages with topical affinity, short paths starting at ST will not stray too far away from on-topic pages, hence the PageRanks will be T-biased Note that pages unreachable from ST will receive a T-biased PageRank of 0 Where would be a good place to find sets ST for certain topics? The pages classified under the 16 top-level topics of the Open Directory Project (see next slide) 9 December 2018 236620 Search Engine Technology

236620 Search Engine Technology 9 December 2018 236620 Search Engine Technology

Topic-Sensitive PageRank (cont.) 16 PageRank vectors are computed, PR1,…,PR16 Given a query q, its affinity to the 16 topics T1,…,T16 is computed Based on the probability of generating the query by the language model induced by the set of pages ST A distribution vector [α1,…,α16] is computed, where αj ~ Prob(q | language model of STj) The PageRank vector that will be used to serve q is PRq = αjPRj The idea of biasing PageRank’s random jump destinations is also used for personalized PageRank flavors [e.g. Jeh and Widom 2003] 9 December 2018 236620 Search Engine Technology

Link Analysis Algorithms - Summary Many variants and refinements of both HITS and PageRank have been proposed. Other approaches include: Max-flow techniques [Flake et al., SIGKDD 2000] Machine learning and Bayesian techniques Examples of applications: Ranking pages (topic specific/global importance/ personalized rankings) Categorization, clustering, finding related pages Identifying virtual communities Computational issues: Distributed computations of eigenvectors of massive, sparse matrices Convergence acceleration, approximations A wealth of literature 9 December 2018 236620 Search Engine Technology