Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Chapter 5: Introduction to Information Retrieval
Hubs and Authorities on the world wide web (most from Rao’s lecture slides) Presentor: Lei Tang.
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Mining Web’s Link Structure Sushanth Rai University of Texas at Arlington
Our purpose Giving a query on the Web, how can we find the most authoritative (relevant) pages?
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Aki Hecht Seminar in Databases (236826) January 2009
1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented by Yongqiang Li Adapted from
Data Mining Chapter 5 Web Data Mining Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented By: Talin Kevorkian Summer June
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Chapter 19: Information Retrieval
Link Structure and Web Mining Shuying Wang
Network Structure and Web Search Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Singular Value Decomposition and Data Management
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Link Analysis HITS Algorithm PageRank Algorithm.
The Further Mathematics network
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Presented By: - Chandrika B N
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Social Networking Algorithms related sections to read in Networked Life: 2.1,
Web Intelligence Web Communities and Dissemination of Information and Culture on the www.
Link Analysis on the Web An Example: Broad-topic Queries Xin.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Chapter 6: Information Retrieval and Web Search
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Overview of Web Ranking Algorithms: HITS and PageRank
SINGULAR VALUE DECOMPOSITION (SVD)
Algorithmic Detection of Semantic Similarity WWW 2005.
Analysis of Link Structures on the World Wide Web and Classified Improvements Greg Nilsen University of Pittsburgh April 2003.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
HITS Hypertext-Induced Topic Selection
7CCSMWAL Algorithmic Issues in the WWW
Informetrics, Webometrics and Web Use metrics
Text & Web Mining 9/22/2018.
A Comparative Study of Link Analysis Algorithms
Information Retrieval
Inf 723 Information & Computing
Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
Junghoo “John” Cho UCLA
Presentation transcript:

Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment

Agenda Introduction. Central Issue. Queries. Constructing a focused subgraph. Computing hubs and authorities. Extracting authorities and hubs. Similar page queries. conclusion

Introduction Process of discovering pages that are relevant to a particular query. A hyperlinked environment can be a rich source of information. Analyzing the link structure of WWW environment. The WWW is a hypertext corpus of enormous complexity, and it continues to expand at very fast rate. High level structure can only emerge through the complete analysis of the WWW environment.

Central Issue Distillation of broad search topics through the discovery of “Authoritative” information sources. Link analysis for discovering “authoritative pages” Improving the quality of search methods on WWW is a rich and interesting problem, because it should be both algoritmic and storage efficient. What does a typical search tool computes in the extra time it takes to produce results that are of greater value to the user? There is no objective function that is concretely defined and correspond to human notions of quality..

Queries Types of queries -specific queries: lead to scarcity problem. -Broad topic queries: Abundance problem. Filter and provide from a huge set of relevant pages, A small set of the most “authoritative” or “definitive” ones.

Problems in identifying authorities Example: “harvard” There are over million pages on web that use the term “harvard”. Remember “TF”- Term frequency. How do we circumvent this problem?

Link analysis Human judgement is needed to formulate the notion of authority. If a person includes a link for page q in page p, He has conferred authority on q in some measure. What are the problems in this?

Links may be created for various reasons. Example: for navigational purposes. Paid advertisements. A hacker may create a bot that keeps on adding links to all the pages. Solution?

Link-based model for the Conferral of Authority Identifies relevant authoritative www pages for broad search topics. Based on the relationship between authorities and hubs. Exploit the equilibrium between authorities and hubs to develop an algorithm that identifies both type of pages simultaneously.

Algorithm operates on focused subgraph produced by text based search engines. Produces small collection of pages likely to contain the most authoritative pages for a given topic. Example: Alta Vista

Constructing a focused subgraph of www We can view any collection V of hyperlinked pages as a directed graph G=(V,E) The nodes correspond to the pages. Edge(p,q) indicates the presence of a link from p to q. Construct a subgraph on www on which the algorithm operates.

The Goal is to focus the computational effort on relevant pages. (i) S(sigma) is relatively small. (ii)S(sigma) is rich in relevant pages. (iii) S(sigma) contains most (or many) of the strongest authorities. How to find such a collection of pages?

“t” highest ranked pages for the query (sigma) from a text based search engine. These “t” pages are refered as root set R(sigma) The root set satisfies both conditions (i) and (ii) It is far from satisfying (iii). Why?

There are often extremely few links between pages in R(sigma), Rendering it essentially structureless. Eample: root set for the query “java” contained 15 links between pages in different domains. Total number of possible links 200*199. (t=200)

We can use the root set R(sigma) to produce s(sigma) that satisfies all the conditions. A strong authority may not be in the set R(sigma), but it is likely to be pointed to by atleast one page in R(sigma). Subgraph(sigma,€,t,d) Sigma: query string,€-a text based search engine,t and d are natural numbers.

S(sigma) is obtained by growing R(sigma) to include any page pointed to by a page in R(sigma) and any page that points to a page in R(sigma). A single page in R(sigma) brings atmost d pages into S(sigma). Does this S(sigma) contains authorities?

Heuristics to reduce S(sigma) Two types of links: Transverse: between pages with different domain names. Intrinsic: between pages with the same domain name. Remove all the intrinsic links to get a graph G(sigma)

A large number of pages from a single domain point to a page p. This is because of advertisements. Allow only m≈4-8 pages from a single domain to point to any given page p. G(sigma) now contains many relevant pages and strong authorities.

Computing hubs and authorities Extracting authorities based on maximum indegree does not work. Example: For the query “java” the largest indegree pages consisted of and java.sun.com, together with advertising pages and home page of amazon. While the first two are good answers, others are not relevant.

Authoritative pages relevant to the initial query should not only have large in-degree; Since they are all authorities, there should be considerable overlap in the sets of pages that point to them. Thus in addition to authorities we should find what are called as hub pages. Hub pages: That have links to multiple relevant authoritative pages.

Hub pages allow us to throw away unrelated pages with high indegree. Mutually reinforcing relationship: A good hub is page that points to many good authorities; a good authority is a page that is pointed to by many good hubs. We should break this circularity to identify hubs and authorities. How?

An Iterative algorithm Maintains and updates numerical weights for each page. Each page is assosciated with a non-negative authority weight x^p and non-negative hub weight y^p. Each type is normalized so their squares sum to 1. Pages with larger x and y values are considered better authorities and hubs respe∞ctively. Two operations for weights.

The second operation updates the hub weights as follows:

The set of weights is represented as a vector with a co-ordinate for each page in G(sigma). The set of weights is represented as a vector y.

Iterate(G,k) G: a collection of n linked pages k: a natural number Let z denote the vector (1, 1, 1,..., 1) ε Rn. Set x0 :=z. Set y0 :=z. For i=1, 2,..., k Apply the ϑ operation to (xi-1, yi-1), obtaining new x-weightsxi’. Apply the Θ operation to (xi, yi-1), obtaining new y-weights yi’. Normalize xi’, obtaining xi. Normalize yi’, obtaining yi. End Return (xk, yk).

Filter out top c authorities and top c hubs Filter(G,k,c) G: a collection of n linked pages k,c: natural numbers (xk, yk) :=iterate(G, k). Report the pages with the c largest coordinates in xk as authorities. Report the pages with the c largest coordinates in yk as hubs.

The is applied with G set equal to G(sigma) and c ≈ 5-10 With arbitrarily large values of k, the sequences of vectors {xk} and {yk} converge to fixed points x* and y *. What is R^n in the ITERATE algorithm?. Eigenspace assosciated with λ. λ is the eigen value of an n x n matrix M, with the property that Mω=λω. For some vector ω

Similar-Page Queries The algorithm discussed can be applied to another type of problem. Using the link structure to infer the notion of “similarity” among pages. We begin with the page p and pose the request “Find t pages pointing to p

Conclusion The approach developed here might be integrated into a study of traffic patterns on www. Future work can be done to include other than broad topic queries. It would be interesting to understand eigenvector based heuristics completely in the context of algorithms presented here.

Thank You!