7CCSMWAL Algorithmic Issues in the WWW Lecture HITS
HITS HITS: Important (non-Google) method of ranking web pages The method has many attractions, but so far has not had such a successful implementation as PageRank HITS classifies the pages in 2 ways, based on in-degree and out-degree Uses/used a similar algorithm? Teoma (Ask Jeeves), Twitter As of 2010 Ask.com referred to the Teoma algorithm as the ExpertRank algorithm
Some documentation https://en.wikipedia.org/wiki/HITS_algorithm The next one is very ‘famous’ http://web.eecs.umich.edu/~michjc/eecs584/notes/lecture19-kleinberg.pdf Basically what search engines do is mainly secret Why would that be?
ASK
HITS (Hypertext Induced Topic Search) Similar to PageRank, but uses both in-links and out-links to create two popularity scores for each page HITS defines hubs and authorities A hub is a page with many out-links An authority is a page with many in-links A page can be both a hub and an authority Hub Auth
Hubs and Authorities “Good authorities are pointed to by good hubs and good hubs point to good authorities” Measures of goodness for each page Pi Authority score (or weight) xi Hub score (or weight) yi The definitions where E is the set of all hyperlinks Hub i: (i, j) refers to hyperlinks from Pi to Pj Auth i: (j, i) refers to the hyperlink from Pj to Pi
Suppose the hub and authority scores of Example P1 Suppose the hub and authority scores of P1, P2, P3, P4 are P2 P5 Pages P1 P2 P3 P4 Authority score xi 3.5 1.2 4.2 1.0 Hub score yi 2.1 0.3 5.2 1.1 P3 P4 The authority score of P5 is 2.1 + 0.3 + 5.2 = 7.6 Because Hubs P(1),P(2),P(3) point to P(5) The hub score of P5 is 3.5 + 1.0 = 4.5 Because P5 points to Authorities P(1), P(4);
Iterative Approach Similar to the pagerank computation, the authority and hub scores are computed iteratively, followed by normalization (scores add to 1) Let x(k)i and y(k)i be the authority and hub scores, respectively, at iteration k For instance, in each iteration, we start by updating the authority scores and then the hub scores
Score Computation in Matrix Form Adjacency matrix L Lij = 1, if there exists a link from Pi to Pj Lij = 0, otherwise. Example 1 3 2 4
Score Computation in Matrix Form Let x(k) be the column vector of the authority scores Let y(k) be the column vector of the hub scores Then we have y(k) = L x(k) because for each i y(k)i = j = 1 to n Lij * x(k)j = (i,j) in E x(k)j
Score Computation in Matrix Form Transpose of a nm matrix M is a mn matrix MT where MTij = Mji Example
Score Computation in Matrix Form We have x(k) = LT y(k-1) because for each i x(k)i = j = 1 to n LTij * y(k-1)j = (j,i) in E y(k-1)j Note: x authority score of page, y hub score of page Note if edge (i,j) then LTij =1
HITS Score Computation in Matrix Form Let x(k) be the column vector of the authority scores Let y(k) be the column vector of the hub scores Let L be the adjacency matrix of the network HITS Algorithm Initialize: y(0) = e (column vector of all ones) Until convergence, do x(k) = LT y(k-1) y(k) = L x(k) Normalize* x(k) and y(k) k = k+1 * To avoid the values getting too large. Need to add to 1
Example
Graph
results hub authority Pagerank99 PageRank85 betweenness 1 0.000 0.090 0.170 0.180 0.180 2 0.470 0.000 0.220 0.200 0.300 3 0.160 0.170 0.060 0.060 0.010 4 0.270 0.220 0.080 0.090 0.040 5 0.000 0.260 0.100 0.100 0.130 6 0.050 0.260 0.100 0.100 0.030 7 0.050 0.000 0.050 0.060 0.010 8 0.000 0.000 0.220 0.210 0.300
R program require(igraph) #read in graph mygraph # see the data mygraph<-read.table("compare-the-measures.csv",sep=",") mygraph # see the data #turn data into a graph mygraph<-graph.data.frame(mygraph, directed=T) plot(mygraph,vertex.color="white") #draw the graph #HITS (Kleinberg scores) H<-hub.score(mygraph)$vector A<-authority.score(mygraph)$vector #normalize HITS #H=H/max(H) #A=A/max(A) H=H/sum(H) A=A/sum(A) #pagerank with 'no damping' and 'Google damping' P1<-page.rank(mygraph,damping=0.99999)$vector P2<-page.rank(mygraph,damping=0.85)$vector P1=P1/sum(P1) P2=P2/sum(P2) #betweenness centrality B<-betweenness(mygraph) B=B/sum(B) answer=data.frame(hub=H,authority=A,Pagerank99=P1, PageRank85=P2,betweenness=B) format(round(answer, 2), nsmall = 3)
HITS Score Computation in Matrix Form Let x(k) be the column vector of the authority scores Let y(k) be the column vector of the hub scores Let L be the adjacency matrix of the network HITS Algorithm Initialize: y(0) = e (column vector of all ones) Until convergence, do x(k) = LT y(k-1) y(k) = L x(k) Normalize* x(k) and y(k) k = k+1 * To avoid the values getting too large. Need to add to 1
Score Computation in Matrix Form In step 2 of the algorithm, the two equations x(k) = LT y(k-1) y(k) = L x(k) can be simplified by substitution to x(k) = LT L x(k-1) y(k) = L LT y(k-1) The equations become very similar to that of the pagerank computation: (k) = (k-1)H In HITS LTL is called the authority matrix L LT is called the hub matrix
HITS Implementation Two main steps A neighbourhood graph N related to the query terms is built (Query Dependent) The authority and hub scores for each page in N are computed Two ranked lists: the most authoritative pages and most “hubby” pages We focus on the first step as the second step has been explained
Neighbourhood Graph N Formation Initialized with all pages containing references to the query terms Make use of the content index (inverted file) Expand the graph N by adding vertices (from the Web graph) that link either to or from vertices in N Relevant pages without the query terms can be added. E.g., with query term “car”, pages containing “automobile” may be added. Some what arbitrary process. N can become very large if a page containing the query terms has huge in-degree or out-degree In practice, a limit, say 100, is applied to the expansion from in-links or out-links of a page with the query terms
Score Computation with N Once N is built, the adjacency matrix L corresponding to N is formed The number of pages in N (and size of L) is much smaller than the total number of pages on the Web It incurs much smaller cost in computing the authority and hub scores, when compared with PageRank method In fact, we do not need to iterate both equations because when a stable authority vector x is obtained we can apply y = Lx to get the stable hub vector y
Normalization (Total = 1) To limit the values of the authority and hub scores, and ensure the convergence Normalization step x(k) x(k) / m(x(k)) where m(x(k)) can be the sum of individual values in x(k), i.e., x(k)1 + x(k)2 + ... + x(k)n n is the number of pages in N (not the whole Web graph) Similarly for y: y(k) y(k) / m(y(k))
Example Suppose P1 and P6 contain the query terms and the neighbourhood graph around P1 and P6 includes P2, P3, P5 and P10 The adjacency matrix L 3 10 1 6 2 5
Example (cont) The respective authority and hub matrices Authority matrix 3 10 1 6 2 5 Hub matrix
For the first 20 iterations Example (cont) x(0) = (1,1,1,1,1,1)T x(1) = LT L x(0) = (1,0,4,2,4,0) Normalize x(1), each element is divided by the sum of x(1), which is 11 (1/11,0/11,4/11,2/11,4/11,0/11) (.0909, 0, .3636, .1818, .3636, 0) x(2) = LTL x(1) ... For the first 20 iterations
Example (cont) By the HITS algorithm, the stable authority scores x and hub scores y are x = (0 0 0.3660 0.1340 0.5 0)T y = (0.3660 0 0.2113 0 0.2113 0.2113)T Labels=(1, 2, 3, 5, 6, 10) Ties may occur and be broken by any tie-breaking strategy, e.g., by page number Authority ranking: P6, P3, P5, P1, P2, P10 Hub ranking: P1, P3, P6, P10, P2, P5 P6 is the most authoritative page and P1 is the best hub
Modification Similar to PageRank method, we can also incorporate teleporting by modifying the authority and hub matrices Authority matrix: LT L + (1 – )(1/n) I Hub matrix: L LT + (1 – )(1/n) I where is a number between 0 & 1 (similar to in PageRank) For the example with =0.95, the modification obtains the authority and hub scores x = (0.0032 0.0023 0.3634 0.1351 0.4936 0.0023)T y = (0.3628 0.0032 0.2106 0.0023 0.2106 0.2106)T Note that the rankings remain the same in this example
Strength of HITS Provide two ranked lists The most authoritative pages: for more in-depth information The most “hubby” pages: for a portal to related information in a broad search Size of the problem is much smaller than that of the PageRank method Intuitively, hubs with a high score carry a lot of traffic, good place for advertising etc
Weakness of HITS Query-dependent For each query, a neighbourhood graph must be built in query time It is easy to make HITS query-independent by computing the authority and hub vectors, x and y, using the adjacency matrix of the entire web graph. Then, it leads to the size problem that PageRank method is facing
Weakness of HITS Susceptibility to spamming Adding more out-links to a page can easily increase the hub score Authority score and hub score are interdependent. An authority score will increase as a hub score increases Bharat & Henzinger (1998) proposed to normalize links in two situations If k links from Host 1 to the same page in Host 2, each link has a weight of 1/k in computing the authority score of Pi I.e., each page Pj in Host 1 that points to Pi contributes only yj/k to xi (1) k links ... ... Pi ... Host 1 Host 2
Weakness of HITS If h links from the same page of Host 1 to the some pages in Host 2, each link has a weight of 1/h in computing the hub score of Pi i.e., each page Pj in Host 2 that is pointed to from Pi contributes only xj/h to yi (2) h links ... Pi ... ... Host 1 Host 2
Weakness of HITS Topic drift Very authoritative yet off-topic page can be included in the neighbourhood graph, if the page is linked to a page containing the query terms E.g., query for “Jaguar” (wild cat) may return homepages of car manufacturers or lists of car manufacturers Neighbourhood Graph N which is built may not relate to properly to the query
Weakness of HITS Bharat & Henzinger suggest a solution that weights the authority and hub scores of the pages in N by a measure of relevancy to the query Let S(Q,Pi) be the “relevancy score”, (a number between 0 and 1), of page Pi to the query Can be computed using traditional information retrieval techniques (see later) For any hyperlink (Pi, Pj), Pi contributes S(Q,Pi) * yi score to the authority score of Pj, where yi is the hub score of Pi Pj contributes S(Q,Pj) * xj score to the hub score of Pj, where xj is the authority score of Pj