Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000
Topic Distillation on the WWW Definition Given a typical user query to find quality documents related to the query topic. Characteristics More general than finding a precise query match Not as ambitious as trying to exactly satisfy user information need In cases where query is ambiguous, it should return relevant documents for (some of) the main query topics.
Related Research HITS Related Page [3] Topic Distillation [2] [1] Web Community [4] Reputation [5] Authoritative sources in a hyperlinked environment ‘97 Improved Algorithms for Topic Distillation in a Hyperlinked Environment ’98 Finding Related Pages in the World Wide Web ’99 Inferring Web Communities from link topology ’98 What is this page known for ? Computing Web Page Reputations. ‘00
HITS (Hyperlink Induced Topic Search) Algorithm Start with a root set S Ss is relatively small (typically up to 200 pages) Ss is rich in relevant pages Ss contains most (or many) of the strongest authorities. Recursively compute the degree of authority and hub for each element. set T a(p) = h(q) h(p) = a(q) qp pq set S
HITS (Hyperlink Induced Topic Search) Premises The implicit annotation provided by human creator contains sufficient information to infer authority. The sufficiently broad topics contain embedded communities of hyperlinked pages. Problems Mutually Reinforcing Relationships certain arrangements of documents “conspire” to dominate the computation. Automatically Generated Links no human opinion is expressed by the link. Non-relevant Documents the graph contains documents not relevant to the query topic
Improved Algorithm Improved Connectivity Analysis Mutually reinforcing relationships should have the same infulence on a single document. Pruning Nodes from Neighborhood Graph Relevant threshold : Median Weight Start Set Median Weight Fixed Fraction of Maximum Weight a(p) = h(q) x auth_wt(q,p) h(p) = a(q) x hub_wt(p,q) qp pq Similarity(Q,Dj) = Wiq x Wij i=1 t wiq 2 wij
Partial Content Analysis Selectively analyze and prune if needed, the nodes that are most influential in the outcome. Query Q formation (use 30 documents) Heuristic : in_degree+2*num_query_matches+has_out_links Pruning Degree Based Pruning Use 4*in_degree+out_degree as a measure of influence Fetch the top 100 nodes, scored against Q and pruned if needed. Iterative Pruning Use connectivity analysis itself to select nodes to prune.(imp) Pruning happens over a sequence of rounds, each runs imp for 10 iterations to get ranked list.
Evaluation All Rare Popular At 5 At 10 26% 36% max base imp med start Without Regulation With Regulation Partial pca0 pca1 0.52 0.46 0.24 0.18 0.36 0.40 0.66 0.58 0.55 0.54 0.73 0.65 0.64 0.50 0.60 0.57 0.48 0.68 0.70 0.69 0.62 0.43 0.67 0.44 0.72 0.75 0.88 0.80 26% 36% Average Precision at Top 5 and 10 ranked authority documents All Rare Popular At 5 At 10 max base imp med start Without Regulation With Regulation Partial pca0 pca1 0.60 0.56 0.44 0.46 0.48 0.42 0.74 0.73 0.64 0.80 0.68 0.87 0.79 0.88 0.76 0. 80 0.78 0.70 0.72 0.75 0.81 0.77 0.69 0.66 0.53 1.00 0.71 0.63 0.54 23% 33% Average Precision at Top 5 and 10 ranked hub documents
Finding Related Pages in the WWW Appears in 8th www conference Definition A related web page is one that addresses the same topic as the original page. For example, www.washingtonpost.com is a page related to www.nytimes.com. Algorithms Companion algorithm : derived from HITS. Cocitation algorithm : finds pages that are frequently cocited with the input URL u. Evaluation Two proposed algorithms are 73% better, 51% better than Netscape’s “What’s Related”.
Companion Algorithm Takes as input a URL u and consists of four steps: Build a vicinity graph for u. Contract duplicates and near-duplicates in this graph Compute edge weights based on host to host connections Compute hub/authority score. u
Cocitation Algorithm Degree of co-citation The number of common parents of two nodes. Sibling Set u