HITS Hypertext Induced Topic Selection 2/4/2019 HITS Hypertext Induced Topic Selection Gyozo Gidofalvi Uppsala Database Laboratory
Idea Given a set of web pages we want to find that are all concerned with the same topic we want to find the most interesting pages by examining the internal link structure in the set the pages that are most likely to guide us to an interesting pages 2019-02-04 Gyozo Gidofalvi
Foundation Identify Hubs and Authorities Definition is mutually recursive: A good hub is pointing to good authorities A good authority is pointed to by good hubs The hub value of a site is the sum of the authority values of the sites that the site is pointing to. The authority value of a site is the sum of the hub values of the sites that points to the site. 2019-02-04 Gyozo Gidofalvi
Pseudo-code Find a set of pages about a given subject You may use an existing search engine (such as Google) In the assignment, you are provided a bunch of pages with links Preprocess the link structure Initialize hub and authority vectors Normalize the vectors to length 1 Calculate the new authority vector based on the link structure and the hub vector Calculate the new hub vector based on the link structure and the authority vector If the new values of the hub and authority vectors are similar enough to the old ones we are done, otherwise repeat from 4 Sort the vectors and find the top authorities and hubs 2019-02-04 Gyozo Gidofalvi
Calculating the hub and authority vectors First we initialize the hub and authority vector to some value. What initial values are appropriate? Does it matter what we initialize to? Next, we calculate the new hub and authority vectors using the formulas Does it matter which order these calculations happen? Do we need to normalize the vectors in each iteration? How do we know when to stop? 2019-02-04 Gyozo Gidofalvi
Preprocessing Preprocessing will improve the accuracy o Several links may point to the same page; http://www.it.uu.se http://www.it.uu.se/index.html www.it.uu.se Remove site-internal links as this can make a site seem more important than it really is. Remove links to sites for which we do not know the link structure. 2019-02-04 Gyozo Gidofalvi
The assignment You will mine four different link structures for four different queries. We have done the web crawling and some of the preprocessing for you! Input files are on the lab course web page However, you must Do some preprocessing yourselves Directions for pre-processing are on the lab course web page Validate your implementation Think of how to verify your solution Your validation does not have to be fancy not even automated At least, implement the test case on the following slide, and see what output it gives you. Make sure that the test case output is reasonable 2019-02-04 Gyozo Gidofalvi
Example (test case) Rank the pages according to hub and authority value in this link structure: a b c d 2019-02-04 Gyozo Gidofalvi