Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Slides:



Advertisements
Similar presentations
CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick.
Advertisements

CS 206 Introduction to Computer Science II 09 / 05 / 2008 Instructor: Michael Eckmann.
An Introduction to Hashing. By: Sara Kennedy Presented: November 1, 2002.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
CS 171: Introduction to Computer Science II Hashing and Priority Queues.
Indian Statistical Institute Kolkata
AVL Deletion: Case #1: Left-left due to right deletion Spring 2010CSE332: Data Abstractions1 h a Z Y b X h+1 h h+2 h+3 b ZY a h+1 h h+2 X h h+1 Same single.
Hash Table indexing and Secondary Storage Hashing.
Sets and Maps ITEC200 – Week Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn.
Hashes a “hash” is another fundamental data structure, like scalars and arrays. Hashes are sometimes called “associative arrays”. Basically, a hash associates.
Analysis of Algorithms. Time and space To analyze an algorithm means: –developing a formula for predicting how fast an algorithm is, based on the size.
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
Look-up problem IP address did we see the IP address before?
CS 106 Introduction to Computer Science I 03 / 03 / 2008 Instructor: Michael Eckmann.
Complexity (Running Time)
CS 206 Introduction to Computer Science II 11 / 12 / 2008 Instructor: Michael Eckmann.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Programming with Collections Grouping & Looping - Collections and Iteration Week 7.
CS 106 Introduction to Computer Science I 10 / 16 / 2006 Instructor: Michael Eckmann.
Data Structures Introduction Phil Tayco Slide version 1.0 Jan 26, 2015.
CS2110 Recitation Week 8. Hashing Hashing: An implementation of a set. It provides O(1) expected time for set operations Set operations Make the set empty.
The Study of Computer Science Chapter 0 Intro to Computer Science CS1510, Section 2.
(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.
Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Fast and deterministic hash table lookup using discriminative bloom filters  Author: Kun Huang, Gaogang Xie,  Publisher: 2013 ELSEVIER Journal of Network.
D ESIGN & A NALYSIS OF A LGORITHM 01 – H ASHING Informatics Department Parahyangan Catholic University.
University of Toronto Department of Computer Science CSC444 Lec05- 1 Lecture 5: Decomposition and Abstraction Decomposition When to decompose Identifying.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
[0][1][2][3][4][5][6][7][8][9] Bing David Ina Abhinav Erik Hyun Jim Fiona Gheeta Chelsea I can easily loop through all the student records by using a.
1 Lecture 11: Bloom Filters, Final Review December 7, 2011 Dan Suciu -- CSEP544 Fall 2011.
Dijkstra’s Algorithm. Announcements Assignment #2 Due Tonight Exams Graded Assignment #3 Posted.
COMP Recursion, Searching, and Selection Yi Hong June 12, 2015.
1 CMSC 341 Extensible Hashing Chapter 5, Section 6 (pp. 200 – 203)
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik.
Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.
Conjunctive Filter: Breaking the Entropy Barrier Daisuke Okanohara *1, *2 Yuichi Yoshida *1*3 *1 Preferred Infrastructure Inc. *2 Dept. of Computer Science,
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
HASHING PROJECT 1. SEARCHING DATA STRUCTURES Consider a set of data with N data items stored in some data structure We must be able to insert, delete.
Data Structures & Algorithms
COMP 103 Bitsets. 2 Sets, and more Sets!  Unsorted Array  Sorted ArrayO(n) for at least one of  Linked Listcontains, add, remove  Binary Search TreeO(log.
Elementary Data Organization. Outline  Data, Entity and Information  Primitive data types  Non primitive data Types  Data structure  Definition 
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.
Calculating frequency moments of Data Stream
CS 106 Introduction to Computer Science I 03 / 02 / 2007 Instructor: Michael Eckmann.
CSCE Database Systems Chapter 15: Query Execution 1.
1 the hash table. hash table A hash table consists of two major components …
27-Jan-16 Analysis of Algorithms. 2 Time and space To analyze an algorithm means: developing a formula for predicting how fast an algorithm is, based.
Bloom Filters. Lecture on Bloom Filters Not described in the textbook ! Lecture based in part on: Broder, Andrei; Mitzenmacher, Michael (2005), "Network.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Cuckoo Filter: Practically Better Than Bloom Author: Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher Publisher: ACM CoNEXT 2014 Presenter:
CSCI 383 Object-Oriented Programming & Design Lecture 25 Martin van Bommel.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
Coming up Implementation vs. Interface The Truth about variables Comparing strings HashMaps.
Sets and Maps Chapter 9.
Bloom Filters An Introduction and Really Most Of It CMSC 491
Design & Analysis of Algorithm Hashing
Sets Extended Maths © Adam Gibson.
structures and their relationships." - Linus Torvalds
Chapter 15 QUERY EXECUTION.
Computer Science 2 Hashing
Starter 15//2 = 7 (Quotient or Floor) (Modulus) 22%3 =1
Bloom Filters Very fast set membership. Is x in S? False Positive
CS5112: Algorithms and Data Structures for Applications
The Study of Computer Science
Sets and Maps Chapter 9.
CMSC 341 Extensible Hashing.
structures and their relationships." - Linus Torvalds
Presentation transcript:

Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Agenda Discuss what a set data structure is using math terms Discuss the concept of a Bloom filter Explore the mathematical magic behind Bloom filters

Set! A set is an unsorted data structure containing unique values Most common uses are: Error-free set membership tests Storing unique members of data (remove duplicates) Iterating through data in no particular order Other fun operations like unions, intersections, subsets, etceteras! Other sets support sorting and duplicate values But we aren’t here to talk about those

Set Insertion peter lois chris peter stewie chris stewie lois chris peter insert

is_member Set Membership Test peter stewie lois chris peter

is_member Set Membership Test adam stewie lois chris peter

Use Case! I’ve got a bunch of interesting keywords, A I’ve got a data set B I want to check if a record in B contains a word in A Make a new data set C for some cool data science for each record x in B for each word w in x if w in A emit x

Use Case, Solved! Stuff all the data in A into a set Get an A+ on your computer science project Impress the boss But what if A is stupid big? credit to mr. squarepants

Memory Footprint A contains 1 billion unique strings, average of 32 characters in length 8 bits per character 32 characters per string 1 billion of them 8 bits * 32 * 1,000,000,000 … Roughly 29.8 GB of raw storage required to hold these elements + overhead + even more if you are using Java For the sake of argument, let’s all agree that A doesn’t fit comfortably on a computer…

credit to xkcd and paint

Making a Set Smaller What two ‘features’ of a set can we relax to meet our requirements and have a reasonable memory footprint? Functionality Only want set membership operations Accuracy Don’t really need to be 100% accurate

Use Case, Revised! I’ve got a bunch of interesting keywords, A I’ve got a data set B I want to check if a record in B contains a word in A Make a new data set C for some cool data science I don’t really care if some stuff in C doesn’t contain words from A for each record x in B for each word w in x if w is likely in A with false positive p emit x

Let me paint you a story… We travel back to 1970… Burton Howard Bloom was investigating means to eliminate unnecessary disk accesses for particular algorithms Came up with the a probabilistic data structure for set membership Useful for programs with expensive operations where the operation is often unnecessary A structure only 15% of the size of the original can eliminate 85% of unnecessary disk accesses

Bloom Filter A space-efficient means to test if an element is a member of a set Elements can be added, but cannot be removed Storage cost for a single element is independent of the element size The members are not stored, so they cannot be retrieved There are no false negatives, but false positives are possible

How It’s Made – Training a Bloom Filter Given An array of bits size m, initialized to 0 k hash functions n elements foreach element n i in n foreach function k i in k m[k i (n i ) % m] = 1 Training a Bloom filter is O(n)

How It’s Made – Training a Bloom Filter peterloischris

How It’s Made – Membership Testing Given A trained Bloom filter of size m The same k hash functions An element x foreach function k i in k if m[k i (x) % m] is 0 return false return true Testing a Bloom filter is O(1)

How It’s Made – Membership Testing peter

How It’s Made – Membership Testing adam

I know what you’re thinking

The Catch What we make up for in space, we give up the accuracy… I give you… the false positive! cleveland

credit to xkcd and paint

Controlling the False Positive Rate Given Approximate number of elements in A, n A willingness to tolerate a percent p of false positives k is the optimal number of hash functions We can approximate m If you want the full details, read the paper or Wikipedia

Back to our use case… n = 1,000,000,000 p =.1 After dusting off the calculators…. m = x10 9 bits or GB An improvement of 29.8/0.558 = 53.4!

And now that we have m… We can use n and m to calculate k = m/n * ln(2) But I haven’t heard of 3.32 hash functions so let’s call it 4

References Wikipedia