SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.

Slides:



Advertisements
Similar presentations
Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol Li Fan, Pei Cao and Jussara Almeida University of Wisconsin-Madison Andrei Broder Compaq/DEC.
Advertisements

Online Algorithm Huaping Wang Apr.21
CS4432: Database Systems II Buffer Manager 1. 2 Covered in week 1.
New Directions in Traffic Measurement and Accounting Cristian Estan – UCSD George Varghese - UCSD Reviewed by Michela Becchi Discussion Leaders Andrew.
Mining Data Streams.
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
Indian Statistical Institute Kolkata
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.
Hit or Miss ? !!!.  Small size.  Simple and fast.  Implementable with hardware.  Does not need too much power.  Does not predict miss if we have.
Kuang-Hao Liu et al Presented by Xin Che 11/18/09.
Bloom Filters Kira Radinsky Slides based on material from:
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Chapter 8 File organization and Indices.
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines By F. Bonomi et al. Presented by Kenny Cheng, Tonny Mak Yui Kuen.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
Adaptive Content Management in Structured P2P Communities Jussi Kangasharju Keith W. Ross David A. Turner.
WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University.
Bloom filters Probability and Computing Randomized algorithms and probabilistic analysis P109~P111 Michael Mitzenmacher Eli Upfal.
Vassilios V. Dimakopoulos and Evaggelia Pitoura Distributed Data Management Lab Dept. of Computer Science, Univ. of Ioannina, Greece
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.
Minimizing Cache Usage in Paging Alejandro Salinger University of Waterloo Joint work with Alex López-Ortiz.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
A Web Crawler Design for Data Mining
Web Cache Replacement Policies: Properties, Limitations and Implications Fabrício Benevenuto, Fernando Duarte, Virgílio Almeida, Jussara Almeida Computer.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Minimizing Cache Usage in Paging Alejandro López-Ortiz, Alejandro Salinger University of Waterloo.
Dr. Yingwu Zhu Summary Cache : A Scalable Wide- Area Web Cache Sharing Protocol.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Measurement COS 597E: Software Defined Networking.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
Practical LFU implementation for Web Caching George KarakostasTelcordia Dimitrios N. Serpanos University of Patras.
The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.
Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.
Mining of Massive Datasets Ch4. Mining Data Streams
Adaptive Ordering of Pipelined Stream Filters Babu, Motwani, Munagala, Nishizawa, and Widom SIGMOD 2004 Jun 13-18, 2004 presented by Joshua Lee Mingzhu.
HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.
Bloom Filters. Lecture on Bloom Filters Not described in the textbook ! Lecture based in part on: Broder, Andrei; Mitzenmacher, Michael (2005), "Network.
@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.
Cuckoo Filter: Practically Better Than Bloom Author: Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher Publisher: ACM CoNEXT 2014 Presenter:
Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi Tian Wang.
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
How to Approximate a Set Without Knowing It’s Size In Advance? Rasmus Pagh Gil Segev Udi Wieder IT University of Copenhagen Stanford Microsoft Research.
Design and Implementation of a High-Performance distributed web crawler Vladislav Shkapenyuk and Torsten Suel Proc. 18 th Data Engineering Conf., pp ,
Mining Data Streams (Part 1)
Tian Xia and Donghui Zhang Northeastern University
Lower bounds for approximate membership dynamic data structures
Empirically Characterizing the Buffer Behaviour of Real Devices
The Variable-Increment Counting Bloom Filter
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
Spatial Online Sampling and Aggregation
Kalyan Boggavarapu Lehigh University
Optimizing Data Popularity Conscious Bloom Filters
Range-Efficient Computation of F0 over Massive Data Streams
CSC3050 – Computer Architecture
By: Ran Ben Basat, Technion, Israel
Lecture 1: Bloom Filters
Presentation transcript:

SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with Davood Rafiei

SIGMOD 2006University of Alberta2 Outline Motivation The problem - Approximate duplicate detection Existing solutions - Caching - Bloom filters Our approach - Stable Bloom filters - Results Related work Conclusions

SIGMOD 2006University of Alberta3 The Motivating Application Duplicate URL detection in Web crawling [Broder et al. WWW03] - Web search engines fetch web pages continuously - Extract URLs within each downloaded page - Check each URL (duplicate detection), if never seen before, then download it; else skip it Problem - Huge number of distinct URLs - Memory is usually not large enough, and disks are too slow

SIGMOD 2006University of Alberta4 The Motivating Application Errors are usually acceptable - A false positive (false alarms) -- A distinct URL is wrongly reported as a duplicate; -- This URL will not be crawled - A false negative (misses) -- A duplicate URL is wrongly reported as distinct -- This URL will be crawled redundantly or searched in disks

SIGMOD 2006University of Alberta5 The Problem Approximate Duplicate Detection A sequence of elements with order Storage space M ( not large enough to store all distinct elements ) Continual membership query Appeared before? Yes or No …d g a f b e a d c b a Our goal -Minimize the # of errors -Fast M

SIGMOD 2006University of Alberta6 Existing Solutions – Caching Store as many distinct elements as possible in a buffer Duplicate detection process - Seeing an element, search the buffer - if found then report “duplicate” else “distinct” Update the buffer using some replacement policies - LRU, FIFO, Random, …

SIGMOD 2006University of Alberta7 Existing Solutions – Caching False negatives - lead to redundant crawling or searching disks Need extra space - to speed up the searching, - to maintain the replacement policy (e.g. LRU) - space amount proportional to the buffer size

SIGMOD 2006University of Alberta8 Existing Solutions – Bloom Filters A bitmap, originally all “0” Duplicate detection process - Hash each incoming element into some bits - If any bit is “0” then report “distinct” else “duplicate” Update process - sets corresponding bits to “1” x h1(x) h2(x) a 1 2 b 1 3 c 2 4 a

SIGMOD 2006University of Alberta9 Existing Solutions – Bloom Filters False positives (false alarms) Bloom Filters will be “full” - All distinct URLs will be reported as duplicates, and thus skipped!

SIGMOD 2006University of Alberta10 Our approach – Stable Bloom Filters(SBF) Kick “elements” out of the Bloom filters Change bits to “cells” (“cellmap”)

SIGMOD 2006University of Alberta11 Stable Bloom Filters(SBF) A “cellmap”, originally all “0” Duplicate detection - Hash each element into some cells, check those cells - If any cell is “0”, report “distinct” else “duplicate” Kick “elements” - Randomly choose some cells and deduct them by 1 Update the “cellmap” - Set cells into a predefined value, Max > 0 - Use the same hash functions as in the detection stage

SIGMOD 2006University of Alberta12 Analytical results SBF will be stable - the expected # of “0”s will become a constant after a number of updates - converge at an exponential rate - m onotonic - a lower bound of the expected # of “0”s (a function of the SBF size, # of hash functions, max cell values, and kick-out rates)

SIGMOD 2006University of Alberta13 Analytical results Two-sided errors - false positive rates become constant - An upper bound of false positive rates (a function of 4 parameters) - Given a false positive rate and SBF size, find the optimal parameters minimizing the # of false negatives (combining empirical results on setting max cell values)

SIGMOD 2006University of Alberta14 Experiments Experimental comparison between SBF, and Caching/Buffering method (LRU) -URL fingerprint data set, originally obtained from Internet Archive (~ 700M URLs) -Synthetic data simulating network traffics using Possion and B-model -To fairly compare, we introduce FPBuffering let Caching generate some false positives, i.e. if an element is not found in the buffer, report “duplicate” with certain probabilities

SIGMOD 2006University of Alberta15 Experimental Results SBF generates 3-13% less false negatives than FPBuffering, while having exactly the same # of false positives (<10%)

SIGMOD 2006University of Alberta16 Experimental Results

SIGMOD 2006University of Alberta17 Experimental Results MIN, [Broder et al. WWW03], theoretically optimal - assumes “the entire sequence of requests is known in advance” - beats LRU caching by <5% in most cases More false positives allowed, SBF gains more

SIGMOD 2006University of Alberta18 Related work Duplicate detection in click streams [Metwally et al. WWW05] URL caching [Broder et al. WWW03] Other variations of Bloom filters - Counting Bloom filters [Fan et al. SIGCOMM98] - Spectral Bloom filters [Cohen&Matias SIGMOD03] - … Fuzzy duplicate detection [Ananthakrishna et al. VLDB02], [Chaudhuri et al. ICDE05], [Weis et al. SIGMOD05]

SIGMOD 2006University of Alberta19 Conclusions SBF provides false positives/negatives trade-off when the space is limited SBF is fast and simple More false positive rates are allowed, SBF gains more

SIGMOD 2006University of Alberta20 Questions/Comments? Thanks!