Download presentation
Presentation is loading. Please wait.
1
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with Davood Rafiei
2
SIGMOD 2006University of Alberta2 Outline Motivation The problem - Approximate duplicate detection Existing solutions - Caching - Bloom filters Our approach - Stable Bloom filters - Results Related work Conclusions
3
SIGMOD 2006University of Alberta3 The Motivating Application Duplicate URL detection in Web crawling [Broder et al. WWW03] - Web search engines fetch web pages continuously - Extract URLs within each downloaded page - Check each URL (duplicate detection), if never seen before, then download it; else skip it Problem - Huge number of distinct URLs - Memory is usually not large enough, and disks are too slow
4
SIGMOD 2006University of Alberta4 The Motivating Application Errors are usually acceptable - A false positive (false alarms) -- A distinct URL is wrongly reported as a duplicate; -- This URL will not be crawled - A false negative (misses) -- A duplicate URL is wrongly reported as distinct -- This URL will be crawled redundantly or searched in disks
5
SIGMOD 2006University of Alberta5 The Problem Approximate Duplicate Detection A sequence of elements with order Storage space M ( not large enough to store all distinct elements ) Continual membership query Appeared before? Yes or No …d g a f b e a d c b a Our goal -Minimize the # of errors -Fast M
6
SIGMOD 2006University of Alberta6 Existing Solutions – Caching Store as many distinct elements as possible in a buffer Duplicate detection process - Seeing an element, search the buffer - if found then report “duplicate” else “distinct” Update the buffer using some replacement policies - LRU, FIFO, Random, …
7
SIGMOD 2006University of Alberta7 Existing Solutions – Caching False negatives - lead to redundant crawling or searching disks Need extra space - to speed up the searching, - to maintain the replacement policy (e.g. LRU) - space amount proportional to the buffer size
8
SIGMOD 2006University of Alberta8 Existing Solutions – Bloom Filters A bitmap, originally all “0” Duplicate detection process - Hash each incoming element into some bits - If any bit is “0” then report “distinct” else “duplicate” Update process - sets corresponding bits to “1” x h1(x) h2(x) 1 2 3 4 5 6 a 1 2 b 1 3 c 2 4 a 1 2 000011 000111 001111 001111
9
SIGMOD 2006University of Alberta9 Existing Solutions – Bloom Filters False positives (false alarms) Bloom Filters will be “full” - All distinct URLs will be reported as duplicates, and thus skipped! 111111
10
SIGMOD 2006University of Alberta10 Our approach – Stable Bloom Filters(SBF) Kick “elements” out of the Bloom filters Change bits to “cells” (“cellmap”) 110101 031203
11
SIGMOD 2006University of Alberta11 Stable Bloom Filters(SBF) A “cellmap”, originally all “0” Duplicate detection - Hash each element into some cells, check those cells - If any cell is “0”, report “distinct” else “duplicate” Kick “elements” - Randomly choose some cells and deduct them by 1 Update the “cellmap” - Set cells into a predefined value, Max > 0 - Use the same hash functions as in the detection stage
12
SIGMOD 2006University of Alberta12 Analytical results SBF will be stable - the expected # of “0”s will become a constant after a number of updates - converge at an exponential rate - m onotonic - a lower bound of the expected # of “0”s (a function of the SBF size, # of hash functions, max cell values, and kick-out rates)
13
SIGMOD 2006University of Alberta13 Analytical results Two-sided errors - false positive rates become constant - An upper bound of false positive rates (a function of 4 parameters) - Given a false positive rate and SBF size, find the optimal parameters minimizing the # of false negatives (combining empirical results on setting max cell values)
14
SIGMOD 2006University of Alberta14 Experiments Experimental comparison between SBF, and Caching/Buffering method (LRU) -URL fingerprint data set, originally obtained from Internet Archive (~ 700M URLs) -Synthetic data simulating network traffics using Possion and B-model -To fairly compare, we introduce FPBuffering let Caching generate some false positives, i.e. if an element is not found in the buffer, report “duplicate” with certain probabilities
15
SIGMOD 2006University of Alberta15 Experimental Results SBF generates 3-13% less false negatives than FPBuffering, while having exactly the same # of false positives (<10%)
16
SIGMOD 2006University of Alberta16 Experimental Results
17
SIGMOD 2006University of Alberta17 Experimental Results MIN, [Broder et al. WWW03], theoretically optimal - assumes “the entire sequence of requests is known in advance” - beats LRU caching by <5% in most cases More false positives allowed, SBF gains more
18
SIGMOD 2006University of Alberta18 Related work Duplicate detection in click streams [Metwally et al. WWW05] URL caching [Broder et al. WWW03] Other variations of Bloom filters - Counting Bloom filters [Fan et al. SIGCOMM98] - Spectral Bloom filters [Cohen&Matias SIGMOD03] - … Fuzzy duplicate detection [Ananthakrishna et al. VLDB02], [Chaudhuri et al. ICDE05], [Weis et al. SIGMOD05]
19
SIGMOD 2006University of Alberta19 Conclusions SBF provides false positives/negatives trade-off when the space is limited SBF is fast and simple More false positive rates are allowed, SBF gains more
20
SIGMOD 2006University of Alberta20 Questions/Comments? Thanks!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.