May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at Austin Renu Tewari IBM Almaden Research Center
May 30, 2016Department of Computer Sciences, UT Austin2 Motivation Google, Yahoo, MSN: Significant fraction of near-duplicates in top search results Google “emacs manual” query 7 of 20 results redundant 3 identical pairs 4 similar to one document Similar results for Yahoo, MSN, A9 search engines chapter/emacs_toc.html 29.2% of data common across 150 million pages (Fetterly’03, Broder’97)
May 30, 2016Department of Computer Sciences, UT Austin3 Problem Statement Goal: Filter near-duplicates in web search results Given a query search results, identify pages that are either Highly similar in content (and link structure) Contained in another page (Inclusions with small changes) Key Constraints Low Space Overhead: Use only a small amount of information per document Low Time Overhead: (latency unnoticeable to end-user) Perform fast comparison and matching of documents
May 30, 2016Department of Computer Sciences, UT Austin4 Our Contributions A novel similarity detection technique using content-defined chunking and Bloom filters to refine web search results Satisfies key requirements Compact Representation Incurs only about 0.4% extra bytes per document Quick Matching 66 ms for top-80 search results Document similarity using bit-wise AND of their feature representations Easy Deployment Attached as a filter over any search engine’s result set
May 30, 2016Department of Computer Sciences, UT Austin5 Talk Outline Motivation Our approach System Overview Bloom Filters for Similarity Testing Experimental Evaluation Related Work and Conclusions
May 30, 2016Department of Computer Sciences, UT Austin6 System Overview Applying similarity detection to search engines Crawl time: The web crawler Step 1: fetches a page and indexes it Step 2: computes and stores per-page features Search time: The search-engine (or end user’s browser) Step 1: Retrieve the top results’ meta-data for a given query Step 2: Similarity Testing to filter highly similar results > Similarity threshold Documents C Similarity Testing Features Fast approximate comparison Feature-set: small space, low complexity
May 30, 2016Department of Computer Sciences, UT Austin7 Feature Extraction and Similarity Testing (1) chunk '3456 Markers Data inserted Original Document Modified Document > Similarity threshold Documents C Fast approximate comparison Feature-set: small space, low complexity Divide a file into variable-sized blocks (called chunks) Use Rabin fingerprint to compute block boundaries SHA-1 hash of each chunk as its feature representation Content-defined Chunking
May 30, 2016Department of Computer Sciences, UT Austin8 Feature Extraction and Similarity Testing (2) Markers Document > Similarity threshold Documents C Fast approximate comparison Feature-set: small space, low complexity A Bloom filter is an approximate set representation An array of m bits (initially 0) k independent hash functions Supports Insert (x,S) Member (y,S) Content-defined Chunking Bloom filter generation SHA y … Insert(y,S) X … Insert(x,S)
May 30, 2016Department of Computer Sciences, UT Austin9 Feature Extraction and Similarity Testing (3) Markers Document > Similarity threshold Documents C Fast approximate comparison Feature-set: small space, low complexity Content-defined Chunking Bloom filter generation X … SHA % of A’s set bits matched Bit-wise AND A B A /\ B Similarity Testing
May 30, 2016Department of Computer Sciences, UT Austin10 Proof-of-concept examples: Differentiate between multiple similar documents IBM site ( Dataset 20 MB (590 documents) /investor/corpgoverance/index.phtml compared with all pages Similar pages (same base URL) cgcoi.phtml (53%) cgblaws.phtml (69%) CVS Repository Dataset Technical doc. file (17 KB) Extracted 20 consecutive versions from the CVS foo original document foo.1 first modified version foo.19 last version
May 30, 2016Department of Computer Sciences, UT Austin11 Talk Outline Motivation Our approach System Overview Bloom Filters for Similarity Testing Experimental Evaluation Related Work and Conclusions
May 30, 2016Department of Computer Sciences, UT Austin12 Evaluation (1): Effect of degree of similarity “emacs manual” query on Google 493 results retrieved using GoogleAPI Fraction of duplicates 88% (50% similarity), 31% (90% similarity) Larger Aliasing of higher ranked documents Initial result set repeated more frequently in later results Similar results observed for other queries Percentage of Duplicate Documents Number of Top Search Results Retrieved % similar 60% similar 80% similar 50% similar 90% similar
May 30, 2016Department of Computer Sciences, UT Austin13 Evaluation (2): Effect of Search Query Popularity Queries: Google Zeitgeist (Nov. 2004) High correlation between occurrence of near-duplicates and search query popularity # near-duplicates increase with query popularity Most-popular:24%-44%; Medium-popular:16%-28%; Random:18% (a) Most-popular (b) Medium-popular (c) Random "republican national convention” ” "national hurricane center" "indian larry" "jon stewart crossfire" "electoral college" "day of the dead" "Olympics 2004 doping" "hawking black hole bet" "x prize spaceship" Percentage of Duplicate Documents Number of Top Search Results Retrieved
May 30, 2016Department of Computer Sciences, UT Austin14 Evaluation (3): Analyzing Response Times Top-80 search results for “emacs manual query” Offline Computation time (pre-computed and stored) CDC chunks 80 * 0.3 ms Bloom filters generation 80 * 14 ms Online Matching Time Bit-wise AND of two Bloom filters (4 s) Matching and Clustering time 66 ms + Total (offline + online) 1210 ms Online Time 66 ms
May 30, 2016Department of Computer Sciences, UT Austin15 Selected Related Work Most prior work based on shingling (many variants) Basic idea: (Broder’97) Divide document into k-shingles: all k consecutive words/tokens Represent document by shingle-set Shingle-sets intersection large near-duplicate documents Reduce similarity detection problem to set intersection Differences with our technique: Document similarity based on feature set intersection Higher feature-set computational overhead Feature set size dependent on sampling (Min s, Mod m, etc.)
May 30, 2016Department of Computer Sciences, UT Austin16 Conclusions Problem: Highly similar matches in search results Popular Search engines (Google, Yahoo, MSN) Significant fraction of near-duplicates in top results Adversely affects query search performance Our Solution: A similarity detection technique using CDC and Bloom filters Incurs small meta-data overhead 0.4% bytes per document Performs fast similarity detection Bit-wise AND operations; order of ms Easily deployed as a filter over any search engine’s results
May 30, 2016Department of Computer Sciences, UT Austin17 For more information: