Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.

Slides:



Advertisements
Similar presentations
Hyper search ing the Web Soumen Chakrabarti, Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins Jacob Kalakal Joseph CS.
Advertisements

Topic-Sensitive PageRank Presented by : Bratislav V. Stojanović University of Belgrade School of Electrical Engineering Page 1/29.
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
CSE 5243 (AU 14) Graph Basics and a Gentle Introduction to PageRank 1.
Information Networks Link Analysis Ranking Lecture 8.
Graphs, Node importance, Link Analysis Ranking, Random walks
Traditional IR systems  Traditonal IR systems Worth of a document w.r.t. a query is intrinsic to the document. Documents  Self-contained units  Generally.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented by Yongqiang Li Adapted from
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
ICS 278: Data Mining Lecture 15: Mining Web Link Structure
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented By: Talin Kevorkian Summer June
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Advances & Link Analysis
Lecture 7: Social Network Analysis (Chap 7, Charkrabarti)
Link Analysis, PageRank and Search Engines on the Web
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Link Structure and Web Mining Shuying Wang
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Link Analysis HITS Algorithm PageRank Algorithm.
CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine CS 277: Data Mining Mining Web Link Structure.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
Using Hyperlink structure information for web search.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Overview of Web Ranking Algorithms: HITS and PageRank
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
Chapter 6: Link Analysis
Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University.
Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Roberto Battiti, Mauro Brunato
15-499:Algorithms and Applications
HITS Hypertext-Induced Topic Selection
7CCSMWAL Algorithmic Issues in the WWW
Link-Based Ranking Seminar Social Media Mining University UC3M
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS 440 Database Management Systems
PageRank algorithm based on Eigenvectors
Information retrieval and PageRank
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good first-order indicator

Notations Document citation graph, Node adjacency matrix E E[i,j] = 1 iff document i cites document j, and zero otherwise. Prestige p[v] associated with every node v Prestige vector over all nodes : p

Fixpoint Prestige Vector Confer to all nodes v the sum total of prestige of all u which links to v Gives a new prestige score p Fixpoint for prestige vector iterative assignment Fixpoint = principal eigenvector of E’ Variants: attenuation factor

Centrality Graph-based notions of centrality Distance d(u,v) : number of links between u and v0 Radius of node u is Center of the graph is Example: Influential papers in an area of research by looking for papers u with small r(u) No single measure is suited for all applications

Co-citation v and w are said to be co-cited by u. If document u cites documents v and w E[i,j]: document citation matrix => E T E: co-citation index matrix Indicator of relatedness between v and w. Clustering Using above pair-wise relatedness measure in a clustering algorithm

MDS Map of WWW Co-citations Social structure of Web communities concerning Geophysics, climate, remote sensing, and ecology. The cluster labels are generated manually. [Courtesy Larson]

The surfing model Correspondence between “surfer model” and the notion of prestige Page v has high prestige if the visit rate is high This happens if there are many neighbors u with high visit rates leading to v Deficiency Web graph is not strongly connected Only a fourth of the graph is ! Web graph is not aperiodic Rank-sinks Pages without out-links Directed cyclic paths

Surfing Model: Simple fix Two way choice at each node With probability d ( 0.1 < d < 0.2 ), the surfer jumps to a random page on the Web. With probability 1–d the surfer decides to choose, uniformly at random, an out-neighbor MODIFIED EQUATION 7.9 Direct solution of eigen-system not feasible.

Solution : Power Iterations

PageRank Architecture at Google Ranking of pages more important than exact values of p i Convergence of page ranks in 52 iterations for a crawl with 322 million links. Pre-compute and store the PageRank of each page. PageRank independent of any query or textual content.

Ranking scheme combines PageRank with textual match Unpublished Many empirical parameters, human effort and regression testing. Criticism : Ad-hoc coupling and decoupling between relevance and prestige

HITS: Hyperlink Induced Topic Search Relies on query-time processing To select base set Vq of links for query q constructed by selecting a sub-graph R from the Web (root set) relevant to the query selecting any node u which neighbors any r \in R via an inbound or outbound edge (expanded set) To deduce hubs and authorities that exist in a sub-graph of the Web Every page u has two distinct measures of merit, its hub score h[u] and its authority score a[u]. Recursive quantitative definitions of hub and authority scores

Use text-based search engine to create a root set of matching documents Expand root set to form base set context graph of depth 1 additional heuristics

Query dependent input Root Set IN OUT

Query dependent input Root Set IN OUT

Query dependent input Root Set IN OUT Base Set

Associate two numerical scores with each document in a hyperlinked collection: authority score and hub score Authorities: most definitive information sources (on a specific topic) Like conference papers (new ideas) Hubs: most useful compilation of links to authoritative documents Like journal papers or books (consolidate or survey significant research)

Basic presumptions Creation of links indicates judgment: conferred authority, endorsement Authority is not conferred directly from page to page, but rather mediated through hub nodes: authorities may not be linked directly but through co-citation Example: major car manufacturer pages will not point to each other, but there may be hub pages that compile links to such pages J. Kleinberg. Authoritative sources in a hyperlinked environment. Proc. 9th ACM- SIAM Symposium on Discrete Algorithms, 1998

Hub & Authority Scores “ Hubs and authorities exhibit what could be called a mutually reinforcing relationship: a good hub is a page that points to many good authorities; a good authority is a page that is pointed to by many good hubs ” [Kleinberg 1999]

Directed Graph Authority score of page i Hub score of page i

The HITS algorithm. “h” and “a”are L 1 vector norms

Translate mutual relationship into iterative update equations Iterative Score Computation (1) (t)(t-1)

Iterative Score Computation (2) Matrix notation Adjacency matrix Score vectors

Condense into a single update equation (e.g.) Question of convergence (ignore absolute scale) Notice resemblance with eigenvector equations Iterative Score Computation (3) Existence ? Uniqueness ?

Example Simple example graph Hub & authority matrices Authority and Hub weights

HITS: Topic Distillation Process 1. Send query to a text-based IR system and obtain the root-set. 2. Expand the root-set by radius one to obtain an expanded graph. 3. Run power iterations on the hub and authority scores together. 4. Report top-ranking authorities and hubs.

HITS : Applications Clever model [ Fine-grained ranking [Soumen WWW10] Query Sensitive retrieving [Krishna Bharat SIGIR ’ 98]

PageRank vs. HITS PageRank advantage over HITS Query-time cost is low HITS: computes an eigenvector for every query Less susceptible to localized link-spam HITS advantage over PageRank HITS ranking is sensitive to query HITS has notion of hubs and authorities Topic-sensitive PageRanking [Haveliwala WWW11] Attempt to make PageRanking query sensitive

HITS: Discussion Pros Derives topic-specific authority scores Returns list of hubs in addition to authorities Computational tractable (due to focused sub-graph) Cons Sensitive to Web spam (artificially increasing hub and authority weight) Query dependence requires expensive context graph building step Topic drift: dominant topic in base set may not be the intended one

Relation between HITS, PageRank and LSI HITS algorithm = running SVD on the hyperlink relation (source,target) LSI algorithm = running SVD on the relation (term,document). PageRank on root set R gives same ranking as the ranking of hubs as given by HITS