Presentation is loading. Please wait.

Presentation is loading. Please wait.

HITS Hypertext Induced Topic Selection

Similar presentations


Presentation on theme: "HITS Hypertext Induced Topic Selection"— Presentation transcript:

1 HITS Hypertext Induced Topic Selection
2/4/2019 HITS Hypertext Induced Topic Selection Gyozo Gidofalvi Uppsala Database Laboratory

2 Idea Given a set of web pages we want to find
that are all concerned with the same topic we want to find the most interesting pages by examining the internal link structure in the set the pages that are most likely to guide us to an interesting pages Gyozo Gidofalvi

3 Foundation Identify Hubs and Authorities
Definition is mutually recursive: A good hub is pointing to good authorities A good authority is pointed to by good hubs The hub value of a site is the sum of the authority values of the sites that the site is pointing to. The authority value of a site is the sum of the hub values of the sites that points to the site. Gyozo Gidofalvi

4 Pseudo-code Find a set of pages about a given subject
You may use an existing search engine (such as Google) In the assignment, you are provided a bunch of pages with links Preprocess the link structure Initialize hub and authority vectors Normalize the vectors to length 1 Calculate the new authority vector based on the link structure and the hub vector Calculate the new hub vector based on the link structure and the authority vector If the new values of the hub and authority vectors are similar enough to the old ones we are done, otherwise repeat from 4 Sort the vectors and find the top authorities and hubs Gyozo Gidofalvi

5 Calculating the hub and authority vectors
First we initialize the hub and authority vector to some value. What initial values are appropriate? Does it matter what we initialize to? Next, we calculate the new hub and authority vectors using the formulas Does it matter which order these calculations happen? Do we need to normalize the vectors in each iteration? How do we know when to stop? Gyozo Gidofalvi

6 Preprocessing Preprocessing will improve the accuracy o
Several links may point to the same page; Remove site-internal links as this can make a site seem more important than it really is. Remove links to sites for which we do not know the link structure. Gyozo Gidofalvi

7 The assignment You will mine four different link structures for four different queries. We have done the web crawling and some of the preprocessing for you!  Input files are on the lab course web page However, you must Do some preprocessing yourselves Directions for pre-processing are on the lab course web page Validate your implementation Think of how to verify your solution Your validation does not have to be fancy not even automated At least, implement the test case on the following slide, and see what output it gives you. Make sure that the test case output is reasonable Gyozo Gidofalvi

8 Example (test case) Rank the pages according to hub and authority value in this link structure: a b c d Gyozo Gidofalvi


Download ppt "HITS Hypertext Induced Topic Selection"

Similar presentations


Ads by Google