Bloom Filters An Introduction and Really Most Of It CMSC 491

Slides:



Advertisements
Similar presentations
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Advertisements

CS 171: Introduction to Computer Science II Hashing and Priority Queues.
Lists Introduction to Computing Science and Programming I.
Hash Table indexing and Secondary Storage Hashing.
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
Look-up problem IP address did we see the IP address before?
Complexity (Running Time)
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.
Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
1 Lecture 11: Bloom Filters, Final Review December 7, 2011 Dan Suciu -- CSEP544 Fall 2011.
Dijkstra’s Algorithm. Announcements Assignment #2 Due Tonight Exams Graded Assignment #3 Posted.
1 CMSC 341 Extensible Hashing Chapter 5, Section 6 (pp. 200 – 203)
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik.
College Board A.P. Computer Science A Topics Program Design - Read and understand a problem's description, purpose, and goals. Procedural Constructs.
Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.
Recap form last time How to do for loops map, filter, reduce Next up: dictionaries.
Data Structures & Algorithms
COMP 103 Bitsets. 2 Sets, and more Sets!  Unsorted Array  Sorted ArrayO(n) for at least one of  Linked Listcontains, add, remove  Binary Search TreeO(log.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Calculating frequency moments of Data Stream
CS 106 Introduction to Computer Science I 03 / 02 / 2007 Instructor: Michael Eckmann.
CSCE Database Systems Chapter 15: Query Execution 1.
1 the hash table. hash table A hash table consists of two major components …
Bloom Filters. Lecture on Bloom Filters Not described in the textbook ! Lecture based in part on: Broder, Andrei; Mitzenmacher, Michael (2005), "Network.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
Coming up Implementation vs. Interface The Truth about variables Comparing strings HashMaps.
UNIT - IV SORTING By B.Venkateswarlu Dept of CSE.
Sets and Maps Chapter 9.
Design & Analysis of Algorithm Hashing
Updating SF-Tree Speaker: Ho Wai Shing.
Lower bounds for approximate membership dynamic data structures
15-121: Introduction to Data Structures
Containers and Lists CIS 40 – Introduction to Programming in Python
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Cse 373 April 24th – Hashing.
Damiano Bolzoni, Sandro Etalle, Pieter H. Hartel
structures and their relationships." - Linus Torvalds
Chapter 15 QUERY EXECUTION.
Computer Science 2 Hashing
Starter 15//2 = 7 (Quotient or Floor) (Modulus) 22%3 =1
Bloom Filters Very fast set membership. Is x in S? False Positive
Fundamentals of Programming
Stephen Smart & Christan Grant IRI 2017
CSE 373: Data Structures and Algorithms
Object Oriented Programming in java
Network Applications of Bloom Filters: A Survey
CSE 373 Data Structures and Algorithms
CSE 373: Data Structures and Algorithms
Coding Concepts (Data- Types)
CS2011 Introduction to Programming I Arrays (I)
CS5112: Algorithms and Data Structures for Applications
Database Design and Programming
Analysis of Algorithms
Introduction to Data Structure
Sets and Maps Chapter 9.
Minwise Hashing and Efficient Search
Analysis of Algorithms
CMSC 341 Extensible Hashing.
TRC: Trace – Reference Compression
CSE 326: Data Structures Lecture #14
structures and their relationships." - Linus Torvalds
Analysis of Algorithms
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Introduction to Computer Science
Presentation transcript:

Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook

Agenda Discuss what a set data structure is using math terms Discuss the concept of a Bloom filter Explore the mathematical magic behind Bloom filters

Set! A set is an unsorted data structure containing unique values Most common uses are: Error-free set membership tests Storing unique members of data (remove duplicates) Iterating through data in no particular order Other fun operations like unions, intersections, subsets, etceteras! Other sets support sorting and duplicate values But we aren’t here to talk about those

Set Insertion peter lois chris stewie stewie lois chris peter insert

Set Membership Test peter stewie lois chris peter is_member

Set Membership Test adam stewie lois chris peter is_member

Use Case! I’ve got a bunch of interesting keywords, A I’ve got a data set B I want to check if a record in B contains a word in A Make a new data set C for some cool data science for each record x in B for each word w in x if w in A emit x

Use Case, Solved! Stuff all the data in A into a set Get an A+ on your computer science project Impress the boss But what if A is stupid big? credit to mr. squarepants

Memory Footprint A contains 1 billion unique strings, average of 32 characters in length 8 bits per character 32 characters per string 1 billion of them 8 bits * 32 * 1,000,000,000 … Roughly 29.8 GB of raw storage required to hold these elements + overhead + even more if you are using Java For the sake of argument, let’s all agree that A doesn’t fit comfortably on a computer…

credit to xkcd and paint

Making a Set Smaller What two ‘features’ of a set can we relax to meet our requirements and have a reasonable memory footprint? Functionality Only want set membership operations Accuracy Don’t really need to be 100% accurate

Use Case, Revised! I’ve got a bunch of interesting keywords, A I’ve got a data set B I want to check if a record in B contains a word in A Make a new data set C for some cool data science I don’t really care if some stuff in C doesn’t contain words from A for each record x in B for each word w in x if w is likely in A with false positive p emit x

Let me paint you a story… We travel back to 1970… Burton Howard Bloom was investigating means to eliminate unnecessary disk accesses for particular algorithms Came up with the a probabilistic data structure for set membership Useful for programs with expensive operations where the operation is often unnecessary A structure only 15% of the size of the original can eliminate 85% of unnecessary disk accesses Hyphenation algorithm – for a dictionary of 500,000 words, 90% of which follow simple hyphenation rules, 10% require expensive disk accesses to retrieve specific patterns. With enough memory, you can use a set to eliminate unnecessary accesses. With limited memory, you can eliminiate most unnecessary acc

Bloom Filter A space-efficient means to test if an element is a member of a set Elements can be added, but cannot be removed Storage cost for a single element is independent of the element size The members are not stored, so they cannot be retrieved There are no false negatives, but false positives are possible

How It’s Made – Training a Bloom Filter Given An array of bits size m, initialized to 0 k hash functions n elements foreach element ni in n foreach function ki in k m[ki(ni) % m] = 1 Training a Bloom filter is O(n)

How It’s Made – Training a Bloom Filter peter lois chris 1 1 1 1 1 1 1

How It’s Made – Membership Testing Given A trained Bloom filter of size m The same k hash functions An element x foreach function ki in k if m[ki(x) % m] is 0 return false return true Testing a Bloom filter is O(1)

How It’s Made – Membership Testing 1 peter

How It’s Made – Membership Testing 1 adam

I know what you’re thinking BEST THING SINCE

The Catch What we make up for in space, we give up the accuracy… I give you… the false positive! 1 cleveland

credit to xkcd and paint

Controlling the False Positive Rate Given Approximate number of elements in A, n A willingness to tolerate a percent p of false positives k is the optimal number of hash functions We can approximate m If you want the full details, read the paper or Wikipedia

Back to our use case… n = 1,000,000,000 p = .1 After dusting off the calculators…. m = 4.792 x109 bits or 0.558 GB An improvement of 29.8/0.558 = 53.4!

And now that we have m… We can use n and m to calculate k = m/n * ln(2) But I haven’t heard of 3.32 hash functions so let’s call it 4

References Wikipedia