Relevance Propagation for Web Search Dr. Tie-Yan Liu Web Search and Mining Group Microsoft Research Asia Joint Work with Tao Qin, Tsinghua University.
DCWC Outline Introduction Generic framework for relevance propagation Evaluations ̵ Effectiveness analysis ̵ Complexity analysis Conclusions
DCWC Introduction Web Search ≠ Information Retrieval ̵ Beside the content relevance, various structure information also plays an important role in Web search Hyperlink graph Local sitemap Webpage layout
DCWC Introduction Three ways of utilizing the structure information for Web search ̵ Linear combination of content relevance and importance scores computed from hyperlink graph β∙Relevance + (1-β)∙ PageRank ̵ Enhance link analysis with the help of content relevance Query-dependent link graph in HITS Topic-sensitive PageRank ̵ Propagate content relevance along the Web structure The use of anchor text in Search Engines Hyperlink-based relevance score propagation (TREC 2003) Sitemap-based feature propagation (TREC 2004)
DCWC Hyperlink-based Relevance Score Propagation ( Zhai et al, TREC2003) Assumption ̵ Hyperlinked pages have correlated content links outlinks
DCWC Hyperlink-based Relevance Score Propagation ( Zhai et al, TREC2003) Assumption ̵ Hyperlinked pages have correlated content Propagation model ̵ Weighted inlink model ̵ Weighted outlink model ̵ Uniform outlink model Original relevance score Propagation from the inllinks Propagation from the outlinks
DCWC Sitemap-based Feature Propagation (Liu and Qin, TREC2004) Assumption ̵ Child pages are extensions of their parent page ̵ One should consider the contribution of the child pages while computing the relevance of the parent page to a query. Propagation model
DCWC Generic Relevance Propagation Framework Modification of the sitemap-based feature propagation model Reminder of the hyperlink-based propagation model A generic framework to cover both hyperlink-based and sitemap-based propagations
DCWC More Derived Propagation Models Score levelFeature level Hyperlink Hyperlink based score propagation model Sitemap Sitemap based feature propagation model Hyperlink-based Feature Propagation Model Weighted inlink model Weighted outlink model Uniform outlink model Sitemap-based Score Propagation Model
DCWC Summary: All Models Covered by the Generic Framework AlgorithmAbbreviation Weighted in-link case of hyperlink based score propagation modelHS-WI Weighted out-link case of hyperlink based score propagation modelHS-WO Uniform out-link case of hyperlink based score propagation modelHS-UO Weighted in-link case of hyperlink based feature propagation modelHF-WI Weighted out-link case of hyperlink based feature propagation modelHF-WO Uniform out-link case of hyperlink based feature propagation modelHF-UO Sitemap based score propagation modelSS Sitemap based feature propagation modelSF
DCWC Benchmark Datasets Corpora ̵.GOV 1M pages Queries: TD 2003, 2004 ̵ MSN 2M pages Query: 100 most popular queries from MSN query log Base Ranking function ̵ BM2500
DCWC Experimental Results (1) TREC 2003
DCWC Experimental Results (2) TREC 2004
DCWC Experimental Results (3) MSN
DCWC Conclusions on Effectiveness In general, relevance propagation can boost the search performance with proper parameter settings; The sitemap-based models are more effective than the hyperlink-based models; ̵ Hyperlinks ≠ Content Correlation, while the pages in the same sub site usually talk about correlated topics. Detailed comparisons ̵ The two sitemap-based models have similar performance. ̵ Among the hyperlink-based models, the HF-WI model performs best.
DCWC Online Complexity w is the size of the working set, q is the number of query terms, l is the average number of inlinks / outlinks, t is the number of iterations. For the SS model, the complexity is O(w), ̵ The SS model needs to propagate the relevance score of a page to its parent only once if we conduct the propagation from the leaf nodes in a bottom-up manner. For the SF model, the complexity is O(qw). For the HS models, the complexity is O(twl) ̵ In each step of t iterations of the HS models, we need to propagate the relevance score of a page along its in-link or out- link in the sub graph of the working set. For the HF models, the complexity is O(tqwl).
DCWC Online Complexity AlgorithmComplexity average waverage laverage taverage q CPU time HS-WIO(twl) HS-WOO(twl) HS-UOO(twl) HF-WIO(tqwl) HT-WOO(tqwl) HF-UOO(tqwl) SSO(w) SFO(qw) The sitemap-based models are more efficient than the hyperlink-based models The score-level propagation models are faster than feature-level models
DCWC Offline Complexity Score-level propagation is very difficult to implement offline ̵ The score can only be computed online w.r.t the query. For feature-level propagations, ̵ The time complexity of the SF model for offline implementation is acceptable; 62.2 hours, or 2.6 days to re-index 8 billion pages ̵ The time complexity of the HF model is out of tolerance hours, or 45 days to re-index 8 billion pages ̵ The ST model is easy for parallel implementation while the parallel implementation of the HF model is non-trivial
DCWC Conclusions of this Study Generally speaking, relevance propagation can boost the performance of web information retrieval. Sitemap-based propagation models outperform hyperlink-based propagation models in terms of both effectiveness and efficiency. Notably, sitemap-based propagation can be implemented in parallel. Score-level propagation and feature-level propagation have almost similar effectiveness. Although the former is more efficient in on-line implementations, it is not practical for real-world search engines because it can not be implemented offline. Overall speaking, sitemap-based feature propagation model is the best choice for real search engines.
Thanks!