Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1
Basic Idea R is grown to a set S so that it contains a rich amount of authoritative pages. Include any page to S that is pointed to by a page in R. R- Root set Scontains t results. RS- Base set generated from algorithm. ‘S’ is used to determine the hubs and authorities. 2
Get a set of results for a query string from a text based search query. Take the top ‘t’ results out of it and put it in a set R. For every page in set R, ◦ Add all the pages that the page points to into the set R. ◦ Add a maximum of d pages that points to the page, into the set R. The new result set is named S. Result returned: Base set S out of which we compute the top authorities and hubs. 3
Heuristics To determine what pages to add to the set S. Heuristic 1: Avoiding navigational links. ◦ Transverse links: links that are between pages with different domain names. ◦ Intrinsic links (navigational links): links that are between pages within a domain. ◦ Delete all intrinsic links. Heuristic 2: Avoiding Mass endorsements. ◦ Mass endorsements: A large number of pages in a domain pointing to a single page. ◦ Example: “This site is designed by …” and a link. ◦ Eliminate this by setting a parameter m and allowing only m pages from a single domain to point to a page. 4
Extracting authorities from the overall collection of pages, through an analysis of the link structure of G. Good hub points to many good authorities and a good authority is pointed to by many good hubs. HubsAuthoritiesunrelated page of large in-degree 5
Basic Idea Each page p has a non negative authority weight and non negative hub weight. If p points to pages with large authority weight values then the page has a large hub weight value. If p is pointed to by pages with large hub weight values then the page has a large authority weight value. Pages with higher weights are better authorities and hubs. 6
I operation: ◦ Authority weight of a page= Sum of all hub weights of pages pointing to the page. O operation: ◦ Hub weight of a page= Sum of all authority weights of pages, this page points to. I and O reinforce each other. Normalization: The values of the hub and authority weights are divided with a value so that the squares of the sum doesn’t exceed 1. 7
Contd... q1 q2 y[p]=sum of all x[q]. page p page p q2 x[p]=sum of all y[q] q3 Operation IOperation O Decision on when to stop the reinforcing process. 1)Apply I and O operations alternatively until a fixed point is reached. 2)Choose a specific parameter ‘k’ and iterate the process only to k number of times. 8
Given the set of pages in the form of a graph, set an integer value for parameter k. k is the number of time the iteration occurs. Repeat the following process k times. ◦ Apply the I operation to a page and update its new authority weight. ◦ Apply the O operation to a page and update its hub weight. ◦ Normalize both the authority weight and the hub weight. Return the graph with the new authority weight and hub weight for each page. 9
Observations The top authorities and hubs are determined by finding the pages containing the top ‘c’ values for x and y from the graph resulted from the Iterate algorithm. The Iterate procedure converges to fixed points x* and y* as k increases arbitrarily. ◦ Proved using principal eigenvectors. Iterate algorithm results in densely linked collection of pages- rich in relevant pages. ◦ Most relevant collection of pages is the densest graph. 10
Results (java) Authorities Gamelan JavaSoft Home Page The Java Developer: HowDoI The Java Book (\search engines") Authorities Yahoo! Excite Lycos Home Page AltaVista: Main Page (Gates) Authorities Bill Gates: The Road Ahead Welcome to Microsoft It was observed that the was the only site that was present in R initially. This supports the algorithm because many of the pages don’t contain the search query in them. 11