Download presentation
Presentation is loading. Please wait.
Published byTracy Dickerson Modified over 8 years ago
1
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at Austin Renu Tewari IBM Almaden Research Center
2
May 30, 2016Department of Computer Sciences, UT Austin2 Motivation Google, Yahoo, MSN: Significant fraction of near-duplicates in top search results Google “emacs manual” query 7 of 20 results redundant 3 identical pairs 4 similar to one document Similar results for Yahoo, MSN, A9 search engines www.delorie.com/gnu/docs/emacs/emacs_toc.html www.cs.utah.edu/dept/old/texinfo/emacs19/emacs_toc.html www.dc.turkuamk.fi/docs/gnu/emacs/emacs_toc.html www.linuxselfhelp.com/gnu/emacs/html chapter/emacs_toc.html 29.2% of data common across 150 million pages (Fetterly’03, Broder’97)
3
May 30, 2016Department of Computer Sciences, UT Austin3 Problem Statement Goal: Filter near-duplicates in web search results Given a query search results, identify pages that are either Highly similar in content (and link structure) Contained in another page (Inclusions with small changes) Key Constraints Low Space Overhead: Use only a small amount of information per document Low Time Overhead: (latency unnoticeable to end-user) Perform fast comparison and matching of documents
4
May 30, 2016Department of Computer Sciences, UT Austin4 Our Contributions A novel similarity detection technique using content-defined chunking and Bloom filters to refine web search results Satisfies key requirements Compact Representation Incurs only about 0.4% extra bytes per document Quick Matching 66 ms for top-80 search results Document similarity using bit-wise AND of their feature representations Easy Deployment Attached as a filter over any search engine’s result set
5
May 30, 2016Department of Computer Sciences, UT Austin5 Talk Outline Motivation Our approach System Overview Bloom Filters for Similarity Testing Experimental Evaluation Related Work and Conclusions
6
May 30, 2016Department of Computer Sciences, UT Austin6 System Overview Applying similarity detection to search engines Crawl time: The web crawler Step 1: fetches a page and indexes it Step 2: computes and stores per-page features Search time: The search-engine (or end user’s browser) Step 1: Retrieve the top results’ meta-data for a given query Step 2: Similarity Testing to filter highly similar results > Similarity threshold Documents C Similarity Testing Features Fast approximate comparison Feature-set: small space, low complexity
7
May 30, 2016Department of Computer Sciences, UT Austin7 Feature Extraction and Similarity Testing (1) chunk 12 3 456 2'3456 Markers Data inserted Original Document Modified Document > Similarity threshold Documents C Fast approximate comparison Feature-set: small space, low complexity Divide a file into variable-sized blocks (called chunks) Use Rabin fingerprint to compute block boundaries SHA-1 hash of each chunk as its feature representation Content-defined Chunking
8
May 30, 2016Department of Computer Sciences, UT Austin8 Feature Extraction and Similarity Testing (2) Markers Document > Similarity threshold Documents C Fast approximate comparison Feature-set: small space, low complexity A Bloom filter is an approximate set representation An array of m bits (initially 0) k independent hash functions Supports Insert (x,S) Member (y,S) Content-defined Chunking Bloom filter generation SHA-1 0000000000 1100010101 y … Insert(y,S) 0100010001 X … Insert(x,S)
9
May 30, 2016Department of Computer Sciences, UT Austin9 Feature Extraction and Similarity Testing (3) Markers Document > Similarity threshold Documents C Fast approximate comparison Feature-set: small space, low complexity Content-defined Chunking Bloom filter generation X … SHA-1 0100010101 75% of A’s set bits matched 0100010101 Bit-wise AND 0100010100 0101010100 A B A /\ B Similarity Testing
10
May 30, 2016Department of Computer Sciences, UT Austin10 Proof-of-concept examples: Differentiate between multiple similar documents IBM site (http://www.ibm.com) Dataset 20 MB (590 documents) /investor/corpgoverance/index.phtml compared with all pages Similar pages (same base URL) cgcoi.phtml (53%) cgblaws.phtml (69%) CVS Repository Dataset Technical doc. file (17 KB) Extracted 20 consecutive versions from the CVS foo original document foo.1 first modified version foo.19 last version
11
May 30, 2016Department of Computer Sciences, UT Austin11 Talk Outline Motivation Our approach System Overview Bloom Filters for Similarity Testing Experimental Evaluation Related Work and Conclusions
12
May 30, 2016Department of Computer Sciences, UT Austin12 Evaluation (1): Effect of degree of similarity “emacs manual” query on Google 493 results retrieved using GoogleAPI Fraction of duplicates 88% (50% similarity), 31% (90% similarity) Larger Aliasing of higher ranked documents Initial result set repeated more frequently in later results Similar results observed for other queries Percentage of Duplicate Documents Number of Top Search Results Retrieved 0 20 40 60 80 100 0 200 300 400 500 70% similar 60% similar 80% similar 50% similar 90% similar
13
May 30, 2016Department of Computer Sciences, UT Austin13 Evaluation (2): Effect of Search Query Popularity Queries: Google Zeitgeist (Nov. 2004) High correlation between occurrence of near-duplicates and search query popularity # near-duplicates increase with query popularity Most-popular:24%-44%; Medium-popular:16%-28%; Random:18% (a) Most-popular (b) Medium-popular (c) Random 0 100 200 300 0 200 400 600 800 1000 "republican national convention” ” "national hurricane center" "indian larry" 0 100 200 300 400 0 200 400 600 800 1000 "jon stewart crossfire" "electoral college" "day of the dead" 0 20 40 60 80 0 100 200 300 400 "Olympics 2004 doping" "hawking black hole bet" "x prize spaceship" Percentage of Duplicate Documents Number of Top Search Results Retrieved
14
May 30, 2016Department of Computer Sciences, UT Austin14 Evaluation (3): Analyzing Response Times Top-80 search results for “emacs manual query” Offline Computation time (pre-computed and stored) CDC chunks 80 * 0.3 ms Bloom filters generation 80 * 14 ms Online Matching Time Bit-wise AND of two Bloom filters (4 s) Matching and Clustering time 66 ms + Total (offline + online) 1210 ms Online Time 66 ms
15
May 30, 2016Department of Computer Sciences, UT Austin15 Selected Related Work Most prior work based on shingling (many variants) Basic idea: (Broder’97) Divide document into k-shingles: all k consecutive words/tokens Represent document by shingle-set Shingle-sets intersection large near-duplicate documents Reduce similarity detection problem to set intersection Differences with our technique: Document similarity based on feature set intersection Higher feature-set computational overhead Feature set size dependent on sampling (Min s, Mod m, etc.)
16
May 30, 2016Department of Computer Sciences, UT Austin16 Conclusions Problem: Highly similar matches in search results Popular Search engines (Google, Yahoo, MSN) Significant fraction of near-duplicates in top results Adversely affects query search performance Our Solution: A similarity detection technique using CDC and Bloom filters Incurs small meta-data overhead 0.4% bytes per document Performs fast similarity detection Bit-wise AND operations; order of ms Easily deployed as a filter over any search engine’s results
17
May 30, 2016Department of Computer Sciences, UT Austin17 For more information: http://www.cs.utexas.edu/users/nav/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.