Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Hubs and Authorities on the world wide web (most from Rao’s lecture slides) Presentor: Lei Tang.
Mining Web’s Link Structure Sushanth Rai University of Texas at Arlington
Our purpose Giving a query on the Web, how can we find the most authoritative (relevant) pages?
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented by Yongqiang Li Adapted from
Data Mining Chapter 5 Web Data Mining Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented By: Talin Kevorkian Summer June
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Advances & Link Analysis
Link Structure and Web Mining Shuying Wang
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
Singular Value Decomposition and Data Management
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented by Benjy Weinberger.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Link Analysis HITS Algorithm PageRank Algorithm.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
1 Page Link Analysis and Anchor Text for Web Search Lecture 9 Many slides based on lectures by Chen Li (UCI) an Raymond Mooney (UTexas)
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Web Intelligence Web Communities and Dissemination of Information and Culture on the www.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
Link Analysis on the Web An Example: Broad-topic Queries Xin.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Overview of Web Ranking Algorithms: HITS and PageRank
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Algorithmic Detection of Semantic Similarity WWW 2005.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Analysis of Link Structures on the World Wide Web and Classified Improvements Greg Nilsen University of Pittsburgh April 2003.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
HITS Hypertext-Induced Topic Selection
Methods and Apparatus for Ranking Web Page Search Results
7CCSMWAL Algorithmic Issues in the WWW
Link-Based Ranking Seminar Social Media Mining University UC3M
Chapter 7 Web Structure Mining
Greg Nilsen University of Pittsburgh April 2003
Text & Web Mining 9/22/2018.
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Inf 723 Information & Computing
Anatomy of a search engine
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
COMP5331 Web databases Prepared by Raymond Wong
Digital Libraries IS479 Ranking
Presentation transcript:

Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg Presented By: Lekhendro

Outline Introduction Constructing focused Subgraph Computing Hubs and Authorities Conclusion

Introduction How to improve quality of search on WWW ? Quality of search requires human evaluation due to the subjectivity inherent in notions such as relevance. The quality of search results and storage are orthogonal. What kind of problem can be solved by analysis of link structure?

Queries and Authoritative Sources Types of queries Specific queries E.g. “Does Netscape support the JDK 1.1 code-signing API?” Broad-topic queries E.g. “Find information about the Java programming language.” Handling specific queries is difficult. Scarcity problem- There are few pages containing those information and it is difficult to determine the identity of those pages. For broad topic queries, there are sometimes thousands of relevant pages. Abundance problem: The number of pages that could reasonably be returned as relevant is far too large for a human user to digest. One needs a way to filter a small set of the authoritative or definitive pages from a huge collection of relevant pages. What kind of problem can be solved by analysis of link structure?

Limitations of text based analysis Text-based ranking function E.g. For the “harvard”, www.harvard.edu is proper authoritative page but there may be lots of other web pages containing “harvard” more often. Most popular Pages are not sufficiently self–descriptive. Usually the term “search engine” doesn’t appear on search engine home web pages of Yahoo, AltaVista, Excite etc. Honda or Toyota home pages hardly contain the term “automobile manufacturer”.

Analysis of link structure Hyperlinks encode a latent human judgment which can be used to formulate a notion of authority. Creation of a link represents a concrete indication of the following type of judgment The creator of page p, by including a link to page q, has in some measure conferred authority on q. Opportunity for the user to find potential authorities purely through the pages that point to them. In this paper a link-based model for the conferral of authority has been proposed. It has been shown that the proposed method consistently identifies relevant authoritative web pages for broad search topics. However, there are pitfalls of above concept. Most links are created for navigational purposes. Difficult to balance between appropriate relevance and popularity

Authorities and Hubs Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic. Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities). In-degree - Number of pointers to a page and is one simple measure of authority. Out-degree - Number of pointers from a page to other pages.

Overview Discover authoritative WWW sources globally. Determine hubs and authorities on a particular topic through analysis of a relevant sub-graph of the web. Given Keyword Query, assign a hub and an authoritative value to each page. Pages with high authority are results of query

Hubs & Authorities Mutually reinforcing relationship: Hubs point to lots of authorities. Authorities are pointed to by lots of hubs Good hub: page that points to many good authorities. Good authority: page pointed to by many good hubs.

Constructing a focused subgraph of WWW Terms: A collection of hyperlinked pages can be viewed as a directed graph G=(V,E); nodes correspond to pages, and a directed edge (p,q) ε E indicates the presence of a link from p to q. Given a query string , determine the sub-graph G of WWW. The graph may include all the pages containing the query string. This approach has the following drawbacks. The set may contain millions of pages Best authorities may not belong to this set. Focus is on S pages with the following properties. S is very small S is rich in relevant pages. S contains most of the strongest authorities.

Together they tend to form a bipartite graph: Hubs and Authorities Together they tend to form a bipartite graph: Authorities Hubs

Root Set and Base Set Collect a root set, R (top ranked) of pages based on the query using text-based search engine (AltaVista). R satisfies 1 and 2 but may not satisfy 3. R contains the string (query) hence it is subset of Q set containing all the pages containing the query. A strong authority of query topic although it may not be in root set, quite likely to be pointed to by at least one page in root set. The number of authorities can be increased by expanding root set along the links that enter and leave it. Root Set

Root Set and Base Set (Cont’d)… Expand root set into base set by including (up to a designated size cut-off) all pages linked to by pages in root set all pages that link to a page in root set Typical base set contains roughly 1000-5000 pages Base Set Root Set

Subgraph construction algorithm

Heuristic Two types of links. Transverse: if it is between pages with different domain names. Intrinsic: if it is between pages with the same domain name. Delete all intrinsic links Most of them are for navigation purposes Less informative or information repetition Or keep upto m(4 to 8) pages of same domain

Authority score : ap (vector a) Hub score : hp (vector h) Iterative Algorithm For each page p  S maintain: Authority score : ap (vector a) Hub score : hp (vector h) Initialize all ap = hp = 1 Maintain normalized scores:

Computing Hubs and authorities h(v1) v1 v1 a(v1) h(v2) v2 p p v2 a(v2) h(v3) v3 v3 a(v3)

Hubs and authorities computation (contd) … Authorities are pointed to by lots of good hubs: Hubs point to lots of good authorities:

Initialize for all p  S: ap = hp = 1 For i = 1 to k: Iterative Algorithm Initialize for all p  S: ap = hp = 1 For i = 1 to k: For all p  S: (update auth. scores) For all p  S: (update hub scores) For all p  S: ap= ap/c c: For all p  S: hp= hp/c c: (normalize a) (normalize h)

Example: Mini Web A A = M H * H = M A * = H M * = M 1 T i i - 1 i i - ú û ù ê ë é = 1 M X Y Z   H = M A * X i i - H M T i * 1 - = A 1 T A = M H i * i - 1 Z Y

Z is most authoritative Example   Iteration 0 1 2 3 … ¥ X is the best hub Z is most authoritative X Z Y

Results Authorities for query: “Java” java.sun.com comp.lang.java FAQ Authorities for query “search engine” Yahoo.com Excite.com Lycos.com Altavista.com Authorities for query “Gates” Microsoft.com roadahead.com

Conclusions A technique for locating high-quality information related to broad search topic based on link analysis. Performed on the set of retrieved web pages for each query Computes authorities and hubs No indexing is needed. Only interface to different search engines is needed. IBM expanded HITS into CLEVER but not seen as viable search engine. (computation of real-time execution is hard).

Basic knowledge of Matrix M: symmetric n*n matrix  :vector : a number If for some vector , M  = , we say, The set of all such  is a subspace of Rn Eigenspace associated with ; These 1(M), 2(M), … are eigenvalues, while 1(M), 2(M), … are eigenvectors i(M) belongs to the subspace of i(M) If we assume |1(M) > 2(M)|, we refer to 1(M) as the principal eigenvector, and all other i(M) as non-principal eigenvector.

Convergence Proof of Iterate Procedure Theorem1. The sequences x1, x2, x3, … and y1, y2, y3, … converge to x* and y* respectively. Proof: G=(V,E); V={p1, p2, …, pn}; A is the adjacency matrix of graph G; Aij = 1 if (pi, pj) is an edge of G. I & O operations can be written as: x  ATy y  Ax K loops, So, x (1) AT Ax (0); x(0) = AT z x*  … x (k) (AT A)k-1 AT z y*  … y (k) (AAT)k z “if  is a vector not orthogonal to the principle eigenvector 1(M), the unit vector in the direction of Mk converges to 1(M) as k increases without bound”

Convergence Proof of Iterate Procedure(cont.) A is called an orthogonal matrix if AAT = AT A = E. Theorem2: x* is the principal eigenvector of ATA, and y* is the principal eigenvector of AAT. Experiment finds that k=20 is sufficient for the convergence of vectors.

Reference http://crystal.uta.edu/~gdas/Courses/websitepages/spring06DBIR.htm http://www.iiit.net/~pkreddy