Outline Search on WWW – Problem in general Overview of the authoritative approach proposed by this paper Constructing a focused Subgraph Computing Hubs and Authorities Similar page Queries Multiple Sets of Hubs and Authorities Diffusion and generalization Evaluation Conclusion
General Problem How to improve quality of search on WWW? Quality of search requires human evaluation due to the subjectivity inherent in notions such as relevance. The WWW is a hypertext corpus of enormous complexity and information. This paper aims to create link based model that consistently identifies relevant, authoritative WWW pages for broad search topics.
Understand Query Types There is more than one type of query and the handling of each may require different techniques. Type of queries: Specific queries E.g. “Does Netscape support the JDK 1.1 code-signing API?” Broad-topic queries E.g. “Find information about the Java programming language.” Similar page queries Example: Find pages ‘similar ’ to
Difficulty in Handling query Specific queries: Scarcity Problem- There are few pages containing those information and it is difficult to determine the identity of those pages. Broad topic queries: Abundance problem- The number of pages that could reasonably be returned as relevant is far too large for a human user to digest. Select a small set of the most “authoritative” or “definitive” ones from a huge collection of pages that are most relevant
Authoritative Pages Given a particular page, how do we tell whether it is authoritative? Problem is related to limitations of text based analysis. Text based ranking function E.g. For the “harvard”, is proper authoritative page but there may be lots of other web pages containing “harvard” more often. Most popular Pages are not sufficiently self descriptive Usually the term “search engine” doesn’t appear on search engine home web pages of Yahoo, AltaVista, Excite etc. Honda or Toyota home pages hardly contain the term “automobile manufacturer”.
Analysis of link structure Hyperlinks encode a latent human judgment which can be used to formulate a notion of authority. Creation of a link represents a concrete indication of the following type of judgment The creator of page p, by including a link to page q, has in some measure conferred authority on q. Opportunity for the user to find potential authorities purely through the pages that point to them. Potential Pitfalls of above concept Most links are created for navigational purposes.(eg: main-menu, paid-adds) Difficult to balance between appropriate relevance and popularity(eg: Yahoo)
Authorities and Hubs Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic. Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities). In-degree - Number of pointers to a page and is one simple measure of authority. Out-degree - Number of pointers from a page to other pages.
Can we operate over entire WWW ? Local approaches- deals with intranet and amount of data is much smaller as compared to WWW as a whole. Clustering approach- dissects a heterogeneous population into subpopulations that in some way more cohesive, but underlying problem of filtering vast number of pages is still the same. Authoritative approach- global nature Perform search on text based WWW search engine Distil broad topic from these pages via the discovery of authority.
Constructing Subgraph The collection V of hyperlinked pages can be viewed as a directed graph G=(V,E):nodes correspond to pages, and a directed edge (p,q) ε E indicates the presence of a link from p to q. Construct a focused subgraph (S ) of the WWW with the following properties:- S is relatively small (so that computation is affordable) S is rich in relevant pages (so that its easier to find good authority) S contains most (or many) of the strongest authorities
How to find S Set Q- set of all pages containing query string. Root set R - t highest ranked pages for the query got from a text-based search engine. It satisfy property 1 & 2. Problems with R : R is a subset of collection Q and Q does not satisfy property 3. There are extremely few links between pages in R, rendering it essentially “structureless”. Strong authority for query is quite likely to be pointed to by at least one page in R . Construct Base set S by extend root set R by including :- All pages linked to by pages in R All pages that link to a page in R at most d
Subgraph algorithm
Observation & Heuristics Heuristic 1: Delete all intrinsic links & keep all transverse links Intrinsic links: if the link is between pages with the same domain name. Generally these are for navigation purposes. Less informative and often contain repetitive information. Transverse: if it is between pages with different domain names. Heuristic 2: Delete pages having collusion or keep 4 to 8 Large number of pages from a single domain all point to a single page p. Generally used for mass endorsement, advertisement etc.
Computing Hubs & Authorities Simplest approach would be to order pages by in-degree Problem: Nodes with highest in-degree in base set:- might not necessarily be authorities & lack any thematic unity. might simply be universally popular pages like yahoo, google, etc.
Computing Hubs & Authorities Observation: Good sources of content (authorities) Good sources of links (hubs) True authority pages are pointed by a number of good hubs. Mutually reinforcing relationship: Hubs point to lots of authorities. Authorities are pointed to by lots of hubs We will use the iterative algorithm to break this circularity. Terms : Good hub: page that points to many good authorities. Good authority: page pointed to by many good hubs.
Overview of Algorithm
Iterative Algorithm An iterative algorithm with each page p , we associate a non-negative authority weight x<p> a non-negative hub weight y<p> weights of each type are normalized so their squares sum to 1 The pages with larger x and y values have “better” authorities and hubs respectively.
Iterative Algorithm If p points to many pages with large x-values, then it should receive a large y-value If p is pointed to by many pages with large y-values, then it should receive a large x-value Inlinks Operation I: Outlinks Operation O:
Matrices Basics
Observations As one applies Iterate with arbitrary large k , the vectors Let G = (V , E ), with V = {p1 , p2 ,…, pn }, and let A denote the adjacency matrix of the graph G : the (i , j )th entry of A is 1 if (pi , pj ) is an edge of G , and is 0 otherwise. x* is the principal eigenvector of ATA , and y* is the principal eigenvector of AAT The convergence of Iterate is quite rapid (k =20 is sufficient)
Observations Any eigenvector algorithm can be used to compute the fixed points X* and Y* Emphasizes the underlying motivation of the approach by reinforcing I and O operations Do not require to iterate I and O to convergence Can start from initial vector X0 and Y0 and computer using a fixed bound of I and O operations
Example: Mini Web
Example: Mini Web (Cont..)
Basic Results
Observations Just “pure ” analysis of link structure We ignored the text in searching for authoritative pages. i.e., text-based search is just an initial set Pages legitimately considered as authoritative in the context of www without access to large- scale index of the www i.e., global analysis of the full www link structure can be replaced by local method over small focused subgraph This approach can replace local approaches used in intranet
Similar page queries Example: Find pages ‘similar ’ to Using link structure to infer a notion of “similarity” among pages We have found a page p that is of interest and it’s an authoritative page on a topic. Can this help in finding similar pages? What do users of the WWW consider to be related to p when they create pages and links ?
Similar page queries Previously our request to search engine was: “Find t pages containing the string . Now our request to search engine is: “Find t pages pointing to p” Rp root set Sp base set Gp focused subgraph Strongest authorities in the local region of the link structure near p are the potential broad-topic summary of pages related to p.
Results- Similar page queries
Multiple Sets of Hubs & Authorities Several densely linked collections of hubs and authorities within the same set. Example: “jaguar” – has several different meanings. “randomized algorithms” – arises multiple technical communities. “abortion” - -involves groups that may not be linked to each other. Clustering in presence of Abundance problem is needed.
Multiple Sets of Hubs & Authorities The non-principal Eigenvectors provide us a way to extract additional densely linked collections of hubs and authorities. Non-principal eigenvectors will have both positive and negative entries. Often, the highly positive entries will correspond to a cluster of pages and negative entries to a different cluster. Typically the two clusters will not be tightly intertwined. intertwined.
Jaguar Example Authority principal eigenvector is primarily about the Atari product. In the positive end of the 2nd non-principal eigenvector, the pages are primarily about the Jacksonville Jaguars. In the positive end of the 3rd non-principal eigenvector, the pages are primarily about the car.
Randomized Algorithms Example The first non-principal eigenvector, positive end returned home pages of theoretical computer scientists. First non-principal eigenvector’s, negative end returns compendia of mathematical software. In the negative end of the fourth non-principal eigenvector, the pages are primarily about wavelets.
Diffusion and Generealization The query may not be sufficiently “broad.” In this case there will not be enough highly relevant pages in the base set to extract a sufficiently dense sub-graph of relevant hubs and authorities. When this occurs, the collection will often represent a broader topic, and the results will reflect a diffused version of the initial query. Example: “WWW conferences” -> WWW resource pages. resource pages.
In studies conducted in 1998 over 26 queries and 37 volunteers, Clever reported better authorities than Yahoo!, which in turn was better than Alta Vista.
Conclusion Need a way to distill a broad topic, for which there may be millions of relevant pages Provides a high quality results in context of what is available on the www globally Operate without maintaining an index of the www or its link structure It identifies the complex pattern of social organization on the www.
