ResIn: A Combination of Results Caching and Index Pruning for High-performance Web Search Engines Gleb Skobeltsyn Flavio Junqueira Vassilis Plachouras Ricardo Baeza-Yates The 31st Annual International ACM SIGIR Conference Singapore, 21 July 2008
Motivation Caching – crucial for WSE to save resources Results caching: Is efficient with real queries But its hit rate is limited due to singletons How to increase the hit-rate further? – index pruning
Contents ResIn architecture Original query stream vs. query stream after the results cache (misses) Static pruned index: Term pruning Document pruning A combination of both Conclusion
ResIn architecture We study Results Caching and Index Pruning together Query processing: 1. from the main index Back end query result Front Term cache Main Index Broker Top results Top results Top results query query query We study Results Caching and Index Pruning together … to reduce latency and load on back-end servers
ResIn architecture We study Results Caching and Index Pruning together Query processing: 2. from the results cache Back end Results cache hit miss query result Front Term cache Main Index Broker query We study Results Caching and Index Pruning together … to reduce latency and load on back-end servers
ResIn architecture We study Results Caching and Index Pruning together Query processing: 3. from the pruned index Back end miss hit query result Front Term cache Main Index Broker Pruned index query query Results miss cache hit We study Results Caching and Index Pruning together … to reduce latency and load on back-end servers
Original query stream (all queries) vs Original query stream (all queries) vs. query stream after the results cache (misses)
All queries vs. Misses: Experimental setup Real query log to test results cache and generate a “miss-log”: Original query log all queries “Miss-log” misses … Q185’000’000: last query Q1: britney spears … Q1: britney spears Q2: sigir 2007 Results cache (LRU) Q2: sigir 2007 miss Q3: britney spears 185M queries from Q4: sigir 2008 Q4: sigir 2008 hit Q3: britney spears …
All queries vs. Misses: Number of terms in a query Average number of terms for all queries = 2.4 Most single term queries are hits in the results cache Queries with many terms are unlikely to be hits , for misses = 3.2
All queries vs. Misses: Query result size distribution Randomly selected 2000 queries from all queries and misses: Avg. result size for misses is ~100 times smaller than for all queries Approx. half of the misses returns less than 5000 results – SMALL! Similar results with a “small” UK document collection (78M)
All queries vs. Misses: Term popularity distribution Each point -> avg. popularity of 1000 consecutive terms The order of terms for misses is the same as for all queries Terms which were popular before the results cache remain popular after Log sizes: 185M – all queries, 41M - misses
Static index pruning
Static pruned index Smaller version of the main index, returns: the top-k response that is the same as the main index’s, or a miss otherwise. Assumes Boolean query processing Types of pruning: Term pruning – full posting lists for selected terms Document pruning – truncated posting lists Term+Document pruning – combination of both Full index Term pruning Document pruning T+D pruning t1 t1 t1 t1 t2 t2 t2 t2 t3 t3 t3 t3 t4 t4 t4 t4 Posting list
Term Pruning: Performance Term pruning based on profit(t)=popularity(t)/df(t) Answers a query if all query terms are in the pruned index Performs well for all queries For misses as well: e.g., can process almost 50% of the queries with 25% of the index UK document collection, 78M documents:
Result Caching + Term Pruning Results caching performance is independent of the collection size results cache capacity is up to 10% of the full index size
Term pruning: Frequent terms in misses MinDF (df of the least frequent query term) correlates to the result size MaxDF (df of the most frequent query term) is high for most of the misses Many misses contain at least one frequent term => the term pruned index has to include large posting lists Gleb Flavio Vassilis Ricardo MinDF MaxDF •••••••••••••••••• ••••••••••••• ••••••••••••••••••••••••••••• ••••••••••
Document pruning Based on Fagin’s top-k intersection algorithm Keeps postings with high scores only: Sufficient to compute top-k results for some queries Determining correctness of the result requires computing of a scoring threshold – LATENCY! Top-2 results: t1 D1 D5 D3 D2 D4 … D1 D2 t2 D2 D1 D5 … Score threshold: s(D2,t1)+s(D1,t2)+s(D2,t3) t3 D4 D1 D2 D3 … Posting list, sorted by score
Document pruning: Experimental setup Scoring function: pr(d) – query independent score of the document d (pagerank) ω, k – normalization constants: ω=[0,10,20] k=1 We try different values of PLLmax – maximum Posting List Length and choose the one that maximizes the hit rate We only look at the upper bound for the hit rate: Whether the original top-10 results found in the top portions of all PLs?
Document pruning: performance Doc. pruning needs high pagerank weights It performs better for All queries than for Misses
Term+Document pruning: performance T+D pruning is the best but expensive (high latency) profit2 is better than profit1 Improvement is marginal for misses unless the pagerank weight is very high
Conclusions Results caching: Lesson learned: Index pruning: delivers good hit rates with a constant capacity but hit rate is limited because of singletons Index pruning: has no limit on hit rate, but the pruned index size grows with the doc. collection – more expensive Static index pruning: addition to results caching, not replacement Term pruning performs well for misses also => “compatible” with results cache Document pruning: all queries - OK, misses - only with high pagerank weights Term+Document pruning slightly improves over document pruning Important to consider the interaction between the components Lesson learned:
Last slide Thank you Questions?