Bloom Filters Very fast set membership. Is x in S? False Positive

Slides:



Advertisements
Similar presentations
Hashing. CENG 3512 Motivation The primary goal is to locate the desired record in a single access of disk. – Sequential search: O(N) – B+ trees: O(log.
Advertisements

Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Bloom Filters Differential Files Simple large database.  Collection/file of records residing on disk.  Single key.  Index to records. Operations. 
File System Implementation
Bloom Filters Differential Files Simple large database.  File of records residing on disk.  Single key.  Index to records. Operations.  Retrieve. 
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
Sarang Dharmapurikar With contributions from : Praveen Krishnamurthy,
1 Physical Data Organization and Indexing Lecture 14.
Signature files. Signature Files Important alternative to inverted indexes. Given a document, the signature is calculated as follows. - First, each word.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Author: Haoyu Song, Fang Hao, Murali Kodialam, T.V. Lakshman Publisher: IEEE INFOCOM 2009 Presenter: Chin-Chung Pan Date: 2009/12/09.
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
Storage Structures. Memory Hierarchies Primary Storage –Registers –Cache memory –RAM Secondary Storage –Magnetic disks –Magnetic tape –CDROM (read-only.
The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.
1 Lecture 21: Hash Tables Wednesday, November 17, 2004.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Module D: Hashing.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Hash 2004, Spring Pusan National University Ki-Joune Li.
CENG Hashing for files. CENG 3512 Introduction Idea: to reference items in a table directly by doing arithmetic operations to transform keys into.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Jonathan Walpole Computer Science Portland State University
Module 11: File Structure
Record Storage, File Organization, and Indexes
Indexing Goals: Store large files Support multiple search keys
Indexing and hashing.
IP Routers – internal view
CSCI 210 Data Structures and Algorithms
Dynamic Hashing (Chapter 12)
Informatica PowerCenter Performance Tuning Tips
The Variable-Increment Counting Bloom Filter
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Lecture 21: Hash Tables Monday, February 28, 2005.
Temporal Indexing MVBT.
Hash-Based Indexes Chapter 11
Hashing CENG 351.
Subject Name: File Structures
Database Management Systems (CS 564)
Review Graph Directed Graph Undirected Graph Sub-Graph
Database Implementation Issues
File Organizations and Indexing
Hash-Based Indexes Chapter 10
Introduction to Database Systems
Hash Tables – 2 Comp 122, Spring 2004.
Indexing and Hashing Basic Concepts Ordered Indices
Bloom Filters Differential Files Simple large database. Operations.
Packet Classification Using Coarse-Grained Tuple Spaces
Chapter 13: Data Storage Structures
DATABASE IMPLEMENTATION ISSUES
2018, Spring Pusan National University Ki-Joune Li
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
CS-447– Computer Architecture Lecture 20 Cache Memories
Module 12a: Dynamic Hashing
Indexing 4/11/2019.
Evaluation of Relational Operations: Other Techniques
Hash Functions for Network Applications (II)
Database Implementation Issues
Update : about 8~16% are writes
CMSC 341 Extensible Hashing.
Chapter 11 Instructor: Xin Zhang
CS 3343: Analysis of Algorithms
Operating Systems: Internals and Design Principles, 6/E
Chapter 13: Data Storage Structures
Chapter 13: Data Storage Structures
Database Implementation Issues
Hash Tables – 2 1.
Presentation transcript:

Bloom Filters Very fast set membership. Is x in S? False Positive No Maybe False Positive Response is Maybe but should have been No Minimize false positive rate. May be used in any (large application) where you want to test set membership and can tolerate false positives. Extension to counting Bloom filters. Application of hashing that is used often, in practice, to reduce the average (though not necessarily worst-case) performance of search applications. Motivate with 2 examples—differential files and packet forwarding.

Differential Files Simple large database. Operations. Collection/file of records residing on disk. Single key. Index to records. Operations. Retrieve. Update. Insert a new record. Make changes to an existing record. Delete a record. Index is a dictionary of keys. With each key we store the address of the corresponding record on disk.

Naïve Mode Of Operation Key Index File Ans. Problems. Index and File change with time. Sooner or later, system will crash. Recovery => Copy Master File (MF) from backup. Copy Master Index (MI) from backup. Process all transactions since last backup. Recovery time depends on MF & MI size + #transactions since last backup. When MF is large, it cannot be backed up too frequently. The time to process the transactions since last backup is large. So recovery time is large.

Differential File Make no changes to master file. Alter index and write updated record to a new file called differential file.

Differential File Operation Key Index File Ans. DF Advantage. DF is smaller than File and so may be backed up more frequently. Index needs to be backed up whenever DF is. So, index should be small as well. Recovery time is reduced. Recover backup of df and index and process small number of transactions to recover state.

Differential File Operation Key Index File Ans. DF Disadvantage. Eventually DF becomes large and can no longer be backed up with desired frequency. Must integrate File and DF now. Following integration, DF is empty.

Differential File Operation Key Index File Ans. DF Large Index. Index cannot be backed up as frequently as desired. Time to recover current state of index & DF is excessive. Use a differential index. Make no changes to Index. DI is an index to all deleted records and records in DF. So, really index size determines backup frequency now.

Differential File & Index Operation Key Index File Ans. DF DI Y N Performance hit. Most queries search both DI and Index. Increase in # of disk accesses/query. Use a filter to decide whether or not DI should be searched. DI may be small enough to fit in memory => no performance hit. Set membership: is search key in set of keys that are in the DI?

Ideal Filter Y => this key is in the DI. Index File Ans. DF Filter Y N DI Y => this key is in the DI. N => this key is not in the DI. Functionality of ideal filter is same as that of DI. So, a filter that eliminates performance hit of DI doesn’t exist. Filter is memory resident.

Bloom Filter (BF) N => this key is not in the DI. Index File Ans. DF BF Y N DI M N => this key is not in the DI. M (maybe) => this key may be in the DI. Filter error. BF says Maybe. DI says No.

Bloom Filter (BF) Filter error. BF resides in memory. Key Index File Ans. DF BF Y N DI M Filter error. BF says Maybe. DI says No. BF resides in memory. Performance hit paid only when there is a filter error.

Longest Matching Prefix Suppose the router prefixes have W different lengths. Create W Bloom filters, one for each length. ith Bloom filter is for prefixes of length i. Keep W hash tables. ith hash table has length i prefixes together with next hop information. Query Bloom filters to get list of hash tables that may have matching prefix. Query hash tables in decreasing order of length (or, in parallel) to find longest matching prefix.

Longest Matching Prefix B1 B2 B3 BW … On Chip H1 H2 H3 HW … Off Chip Bis are on-chip Bloom filters; Bi is Bloom filter for length I prefixes. His are off chip hash tables of (prefix,action) pairs. First get a list of hash tables to search by querying the bloom filters in parallel. Next search hash tables in order of length. If no filter error, then only 1 hash table is searched  expect only about 1 access to off-chip memory. May also be used for intrusion detection systems. Signature string is decomposed into fixed length chunks that are stored in an off-chip data structure. On-chip Bloom filter used to minimize off-chip searches. Since most packet payloads don’t match a signature, few off-chip searches are done (assuming false-positives from filter are small). Equivalent to is given string one of the strings in a dictionary. Have a Bloom filter for the given set of dictionary strings. Good when few successful matches. For dynamic tables, keep a count of how many prefixes set a particular bit to 1. Reset bit to 0 when count becomes 0. Counters are kept in off-chip memory. May extend to approximate classification of flows; each bloom filter stores rules for 1 flow (I.e., packets that go to same output port). Search bloom filters in on-chip memory. Use output port corresponding to successful filter. If no successful filter, search main memory structure. In case of filter error, packet sent to wrong place. But can do error recovery.

Bloom Filter Design Use m bits of memory for the BF. Larger m => fewer filter errors. Initially, all m bits = 0. Use h > 0 hash functions: f1(), f2(), …, fh(). When key k inserted into DI, set bits f1(k), f2(k), …, and fh(k) to 1. f1(k), f2(k), …, fh(k) is the signature of key k. Independent hash functionsshould be possible for two keys to have the same value for one hash function but differ for remaining hash functions. This is not true for k mod 11 and 2k mod 11 as k = 11i+j has f1 = j and f2 = (2j)mod 11. So, all keys with the same f1 agree on their f2 value. But, f1= k mod 11 and f2 = (2k/3) mod 11 results in keys with the same f1 having possibly different f2s. In practice, may using a single hash function that generates many bits. E.g., Bloom filter of size 2^(32) and hash function gives a 64-bit output. Can use 32 bits as output of f1 and remaining 32 bits as f2.

Example m = 11 (normally, m would be much much larger). 1 2 3 4 5 6 7 8 9 1 1 1 1 m = 11 (normally, m would be much much larger). h = 2 (2 hash functions). f1(k) = k mod m. f2(k) = (2k) mod m. k = 15. k = 17.

Example DI has k = 15 and k = 17. Search for k. 1 2 3 4 5 6 7 8 9 DI has k = 15 and k = 17. Search for k. f1(k) = 0 or f2(k) = 0 => k not in DI. f1(k) = 1 and f2(k) = 1 => k may be in DI. k = 6 => filter error.

Bloom Filter Design Choose m (filter size in bits). Use as much memory as is available. Pick h (number of hash functions). h too small => probability of different keys having same signature is high. h too large => filter becomes filled with ones too soon. Select the h hash functions. Hash functions should be relatively independent.

Optimal Choice Of h Probability of a filter error depends on: Filter size … m. # of hash functions … h. # of updates before filter is reset to 0 … u. Insert Delete Change Assume that m and u are constant. # of master file records = n >> u. With n >> u, we may approximately say number of records doesn’t change even when inserts are allowed.

Probability Of Filter Error Key Index File Ans. DF BF Y N DI M p(u) = probability of a filter error after u updates = A * B A = p(request for an unmodified record after u updates) B = p(filter bits are all 1 for this request for an unmodified record) Minimize filter error probability for successful search/update. Modified record => changed, inserted, deleted record.

A = p(request for unmodified record) p(update j is for record i) = 1/n. p(record i not modified by update j) = 1 – 1/n. p(record i not modified by any of the u updates) = (1 – 1/n)u = A. Assume all updates are for different records. So di and df size equals #updates.

B = p(filter bits are all 1 for this request) Consider an update with key K. p(fj(K) != i) = 1 – 1/m. p(fj(K) != i for all j) = (1 – 1/m)h. p(bit i = 0 after one update) = (1 – 1/m)h. p(bit i = 0 after u updates) = (1 – 1/m)uh. p(bit i = 1 after u updates) = 1 – (1 – 1/m)uh. p(signature of K is 1 after u updates) = [1 – (1 – 1/m)uh]h = B. Assume that all hash functions map to different addresses.

Probability Of Filter Error p(u) = A * B = (1 – 1/n)u * [1 – (1 – 1/m)uh]h (1 – 1/x)q ~ e–q/x when x is large. p(u) ~ e–u/n(1 – e–uh/m )h d p(u)/dh = 0 => h = (ln 2)m/u ~ 0.693m/u.

Optimal h h p(u) hopt p(u) ~ e–u/n(1 – e–uh/m )h h ~ 0.693m/u. Use h = 1 or h = 2. m = 2*106, u = 106/2 h ~ 2.772 Use h = 2 or h = 3. Determine p(u) for above cases. Minimize cache missespattern-blocked Bloom filter [divide filter into blocks that fit in a cache line (say), first select a block given search key, search keys mapped to bits in that block], extension to multiple blocks.