Presentation is loading. Please wait.

Presentation is loading. Please wait.

- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.

Similar presentations


Presentation on theme: "- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm."— Presentation transcript:

1 - Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm

2 Plan Broad Picture of the talk Introduce Foundations (Terminology) The Problem to be solved Motivation behind HITS (Why Link Analysis) Construction of Sub-graph Some basics of matrices Design of Algorithm (Meat of this paper) Application of HITS Conclusion

3 Broad Picture of the talk Goal of Search Engine is to provide quality search results - Relevance Ways to achieve this goal - Linked structure of the web The Algorithm ranks pages based on the relationship between hubs and authorities. What are Hubs and Authorities? - Later

4 You need relevance – Start filtering Pages Containing Query String Base Set Sub graph Hubs Authorities Set of ‘t’ pages Heuristics Highest Ranked Pages – Root Set Text based search engine User Query Hits Algorithm Filter

5 Terminology Authority: A valuable and informative webpage, usually pointed to by a large number of hyperlinks Hub: A webpage that points to many authority pages is itself a resource and is called a hub Authorities and hubs reinforce one another A good authority is pointed to by many good hubs A good hub points to many good authorities i j j i

6 Problem to be solved Relevant terms may not appear on the pages of authoritative websites. Many prominent pages are not self descriptive Car manufacturers may not use the term “automobile manufacturers” on their home page. The term “search engine” is not used by any of natural authorities like Yahoo, Google, AltaVista etc.

7 Link based Analysis Limitations of text based analysis  Text-based ranking function Eg. Could www.harvard.edu be recognized as one of the most authoritative pages, since many other webpages contain “harvard” more often.www.harvard.edu  Pages are not sufficiently self – descriptive Usually the term “search engine” doesn’t appear on search engine web pages

8 Motivation behind HITS The creator of page p, by including a link to page q, has in some measure conferred authority on q Links afford us the opportunity to find potential authorities purely through the pages that point to them What is the problem here?  Some links are just navigational “Click here to return to the main menu”  Some links are advertisements  Difficulty in finding balance between relevance and popularity Solution: Based on relationship between the authorities for a topic and those pages that link to many related authorities - HUBS

9 HITS Algorithm developed by Kleinberg in 1998. Attempts to computationally determine hubs and authorities on a particular topic through analysis of a relevant subgraph of the web. Based on mutually recursive facts: Hubs point to lots of authorities. Authorities are pointed to by lots of hubs.

10 HITS Algorithm Computes hubs and authorities for a particular topic specified by a normal query. First determines a set of relevant pages for the query called the base set S. Analyze the link structure of the web subgraph defined by S to find authority and hub pages in this set.

11 Construction of focused subgraph We have a set created by text-based search engine. Why do we need subset? The set may contain too many pages and entail a considerable computational cost Most of the best authorities may not belong to this set Subset properties: Relatively small Rich in relevant pages Contains most ( or many ) of the strongest authorities

12 Subset Construction Subgraph( σ, Ε, t, d) σ : a query string. Ε : a text-based search engine. t, d: natural numbers. Let R σ denote the top t results of E on σ Set S σ : = R σ For each page p Є R σ Let Γ + (p) denote the set of all pages p points to. Let Γ - (p) denote the set of all pages pointing to p. Add all pages in Γ +(p) to S σ. If | Γ - (p)| ≤ d then Add all pages in Γ - (p) to S σ. Else Add an arbitrary set of d pages from Γ -(p) to S σ. End Return S σ

13 For a specific query Q, let the set of documents returned by a standard search engine be called the root set R. Initialize S to R. Add to S all pages pointed to by any page in R. Add to S all pages that point to any page in R. RS

14 Subgraph reduction Offset the effect of links that serve purely a navigational function Remove all intrinsic edges from the graph, keeping only the edges corresponding to transverse links Remove links that are mentioned in more than m pages (m=4-8).

15 Handling “spam” links Should all links be equally treated? Two considerations: Some links may be more meaningful/important than other links. Web site creators may trick the system to make their pages more authoritative by adding dummy pages pointing to their cover pages (spamming).

16 Handling “spam” links (contd) Transverse link: links between pages with different domain names. Domain name: the first level of the URL of a page. Intrinsic link: links between pages with the same domain name. Transverse links are more important than intrinsic links. Two ways to incorporate this: 1. Use only transverse links and discard intrinsic links. 2. Give lower weights to intrinsic links.

17 Handling “spam” links (contd) How to give lower weights to intrinsic links? In adjacency matrix A, entry (p, q) should be assigned as follows: If p has a transverse link to q, the entry is 1. If p has an intrinsic link to q, the entry is c, where 0 < c < 1. If p has no link to q, the entry is 0.

18 Basics of matrices Adjacency matrix of directed graph G is the matrix A such that: = 1 (i, j) E(G) or = 0 (i, j) E(G). An eigenvalue is a scalar with property that there exists a non-zero vector x, such that Ax = x. The vector x is called Eigen vector of A. The normalized eigenvector corresponding to the largest eigenvalue is called the principal eigenvector. If M is a symmetric n x n matrix and v is a vector not orthogonal to principal Eigen vector then the unit vector in the direction of converges to

19 Iterative Algorithm Each page p is assigned two non-negative weights, an authority weight x and a hub weight y. Update the weights of x and y Authority Weight: I Operation Hub Weight : O Operation These operations add the weights of hubs into the authority weight and add the authority weights into the hub weight, respectively. Alternating these two operations will eventually result in an equilibrium value, or weight, for each page.

20 Iterative Algorithm The algorithm states: For each iteration, apply the I and O operations and normalize the authority and hub scores.

21 The top c authorities and top c hubs may be found using this simple procedure:

22 Convergence Iteration algorithm converges as k increases. That is, the weights (vectors) converge. Let G = ( V, E ) with V = {p1, p2 … pn} Let A be the adjacency matrix of G. I and O operations can be written as Let be the authority scores after i iterations. Let be the hub scores after i iterations. Operation I Operation O

23 From the basics of matrices the vectors and converge to x* and y* respectively, where x* and y* are the principal Eigen vectors of and Kleinberg says that 20 iterations are sufficient to obtain convergence The “principal eigenvector” represents the densest cluster in the focused subgraph The non-principal eigenvectors represent less dense areas in the subgraph

24 Application - Finding Similar Pages Using Link Structure Given a page, P, let R (the root set) be t (e.g. 200) pages that point to P. Grow a base set S from R. Run HITS on S. Return the best authorities in S as the best similar-pages for P. Finds authorities in the “ link neighbor-hood ” of P as its similar pages.

25 Application - HITS for Clustering An ambiguous query can result in the principal eigenvector only covering one of the possible meanings. Non-principal eigenvectors may contain hubs & authorities for other meanings. Example: “ jaguar ” :  Atari video game (principal eigenvector)  NFL Football team (2 nd non-principal eigenvector)  Automobile (3 rd non-principal eigenvector) This is clustering!

26 Multiple sets of Hubs and Authorities Why?  The query string  may have several very different meanings. Eg. “java”  The string may arise as a term in the context of multiple technical communities. Eg. “randomized algorithms”  The string may refer to a highly polarized issue, involving groups that are not likely to link to one another. Eg. “abortion” Idea:  The NON-principal eigenvectors of A T A and AA T provide us with a natural way to extract additional densely linked collections of hubs and authorities from the base set S .

27 Multiple sets of Hubs and Authorities Experimental result 1 For the query “jaguar”, the strongest collections of authoritative sources concerned the Atari Jaguar product, the NFL football team from Jacksonville, and the automobile.

28 Multiple sets of Hubs and Authorities Experimental result 2 For the query “randomized algorithms”, none of the strongest collections of hubs and authorities are precisely on the query topic. They include home pages of theoretical computer scientists, compendia of mathematical software and pages on wavelets.

29 Conclusion A technique for locating high-quality information related to a broad search topic on the www, based on a structural analysis of the link topology surrounding “authoritative” pages on the topic. Related work. Standing, influence in social networks, scientific citations Hypertext and WWW rankings


Download ppt "- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm."

Similar presentations


Ads by Google