Download presentation
Presentation is loading. Please wait.
Published byFarida Jayadi Modified over 5 years ago
1
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
Presented By: Lekhendro
2
Outline Introduction Constructing focused Subgraph
Computing Hubs and Authorities Conclusion
3
Introduction How to improve quality of search on WWW ?
Quality of search requires human evaluation due to the subjectivity inherent in notions such as relevance. The quality of search results and storage are orthogonal. What kind of problem can be solved by analysis of link structure?
4
Queries and Authoritative Sources
Types of queries Specific queries E.g. “Does Netscape support the JDK 1.1 code-signing API?” Broad-topic queries E.g. “Find information about the Java programming language.” Handling specific queries is difficult. Scarcity problem- There are few pages containing those information and it is difficult to determine the identity of those pages. For broad topic queries, there are sometimes thousands of relevant pages. Abundance problem: The number of pages that could reasonably be returned as relevant is far too large for a human user to digest. One needs a way to filter a small set of the authoritative or definitive pages from a huge collection of relevant pages. What kind of problem can be solved by analysis of link structure?
5
Limitations of text based analysis
Text-based ranking function E.g. For the “harvard”, is proper authoritative page but there may be lots of other web pages containing “harvard” more often. Most popular Pages are not sufficiently self–descriptive. Usually the term “search engine” doesn’t appear on search engine home web pages of Yahoo, AltaVista, Excite etc. Honda or Toyota home pages hardly contain the term “automobile manufacturer”.
6
Analysis of link structure
Hyperlinks encode a latent human judgment which can be used to formulate a notion of authority. Creation of a link represents a concrete indication of the following type of judgment The creator of page p, by including a link to page q, has in some measure conferred authority on q. Opportunity for the user to find potential authorities purely through the pages that point to them. In this paper a link-based model for the conferral of authority has been proposed. It has been shown that the proposed method consistently identifies relevant authoritative web pages for broad search topics. However, there are pitfalls of above concept. Most links are created for navigational purposes. Difficult to balance between appropriate relevance and popularity
7
Authorities and Hubs Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic. Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities). In-degree - Number of pointers to a page and is one simple measure of authority. Out-degree - Number of pointers from a page to other pages.
8
Overview Discover authoritative WWW sources globally. Determine hubs and authorities on a particular topic through analysis of a relevant sub-graph of the web. Given Keyword Query, assign a hub and an authoritative value to each page. Pages with high authority are results of query
9
Hubs & Authorities Mutually reinforcing relationship:
Hubs point to lots of authorities. Authorities are pointed to by lots of hubs Good hub: page that points to many good authorities. Good authority: page pointed to by many good hubs.
10
Constructing a focused subgraph of WWW
Terms: A collection of hyperlinked pages can be viewed as a directed graph G=(V,E); nodes correspond to pages, and a directed edge (p,q) ε E indicates the presence of a link from p to q. Given a query string , determine the sub-graph G of WWW. The graph may include all the pages containing the query string. This approach has the following drawbacks. The set may contain millions of pages Best authorities may not belong to this set. Focus is on S pages with the following properties. S is very small S is rich in relevant pages. S contains most of the strongest authorities.
11
Together they tend to form a bipartite graph:
Hubs and Authorities Together they tend to form a bipartite graph: Authorities Hubs
12
Root Set and Base Set Collect a root set, R (top ranked) of pages based on the query using text-based search engine (AltaVista). R satisfies 1 and 2 but may not satisfy 3. R contains the string (query) hence it is subset of Q set containing all the pages containing the query. A strong authority of query topic although it may not be in root set, quite likely to be pointed to by at least one page in root set. The number of authorities can be increased by expanding root set along the links that enter and leave it. Root Set
13
Root Set and Base Set (Cont’d)…
Expand root set into base set by including (up to a designated size cut-off) all pages linked to by pages in root set all pages that link to a page in root set Typical base set contains roughly pages Base Set Root Set
14
Subgraph construction algorithm
15
Heuristic Two types of links.
Transverse: if it is between pages with different domain names. Intrinsic: if it is between pages with the same domain name. Delete all intrinsic links Most of them are for navigation purposes Less informative or information repetition Or keep upto m(4 to 8) pages of same domain
16
Authority score : ap (vector a) Hub score : hp (vector h)
Iterative Algorithm For each page p S maintain: Authority score : ap (vector a) Hub score : hp (vector h) Initialize all ap = hp = 1 Maintain normalized scores:
17
Computing Hubs and authorities
h(v1) v1 v1 a(v1) h(v2) v2 p p v2 a(v2) h(v3) v3 v3 a(v3)
18
Hubs and authorities computation (contd) …
Authorities are pointed to by lots of good hubs: Hubs point to lots of good authorities:
19
Initialize for all p S: ap = hp = 1 For i = 1 to k:
Iterative Algorithm Initialize for all p S: ap = hp = 1 For i = 1 to k: For all p S: (update auth. scores) For all p S: (update hub scores) For all p S: ap= ap/c c: For all p S: hp= hp/c c: (normalize a) (normalize h)
20
Example: Mini Web A A = M H * H = M A * = H M * = M 1 T i i - 1 i i -
ú û ù ê ë é = 1 M X Y Z H = M A * X i i - H M T i * 1 - = A 1 T A = M H i * i - 1 Z Y
21
Z is most authoritative
Example Iteration … X is the best hub Z is most authoritative X Z Y
22
Results Authorities for query: “Java” java.sun.com comp.lang.java FAQ
Authorities for query “search engine” Yahoo.com Excite.com Lycos.com Altavista.com Authorities for query “Gates” Microsoft.com roadahead.com
23
Conclusions A technique for locating high-quality information related to broad search topic based on link analysis. Performed on the set of retrieved web pages for each query Computes authorities and hubs No indexing is needed. Only interface to different search engines is needed. IBM expanded HITS into CLEVER but not seen as viable search engine. (computation of real-time execution is hard).
24
Basic knowledge of Matrix
M: symmetric n*n matrix :vector : a number If for some vector , M = , we say, The set of all such is a subspace of Rn Eigenspace associated with ; These 1(M), 2(M), … are eigenvalues, while 1(M), 2(M), … are eigenvectors i(M) belongs to the subspace of i(M) If we assume |1(M) > 2(M)|, we refer to 1(M) as the principal eigenvector, and all other i(M) as non-principal eigenvector.
25
Convergence Proof of Iterate Procedure
Theorem1. The sequences x1, x2, x3, … and y1, y2, y3, … converge to x* and y* respectively. Proof: G=(V,E); V={p1, p2, …, pn}; A is the adjacency matrix of graph G; Aij = 1 if (pi, pj) is an edge of G. I & O operations can be written as: x ATy y Ax K loops, So, x (1) AT Ax (0); x(0) = AT z x* … x (k) (AT A)k-1 AT z y* … y (k) (AAT)k z “if is a vector not orthogonal to the principle eigenvector 1(M), the unit vector in the direction of Mk converges to 1(M) as k increases without bound”
26
Convergence Proof of Iterate Procedure(cont.)
A is called an orthogonal matrix if AAT = AT A = E. Theorem2: x* is the principal eigenvector of ATA, and y* is the principal eigenvector of AAT. Experiment finds that k=20 is sufficient for the convergence of vectors.
27
Reference
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.