How to Approximate a Set Without Knowing It’s Size In Advance? Rasmus Pagh Gil Segev Udi Wieder IT University of Copenhagen Stanford Microsoft Research.

Slides:



Advertisements
Similar presentations
Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol Li Fan, Pei Cao and Jussara Almeida University of Wisconsin-Madison Andrei Broder Compaq/DEC.
Advertisements

Uniform algorithms for deterministic construction of efficient dictionaries Milan Ružić IT University of Copenhagen Faculty of Mathematics University of.
1 Designing Hash Tables Sections 5.3, 5.4, Designing a hash table 1.Hash function: establishing a key with an indexed location in a hash table.
Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.
Hash-Based Indexes The slides for this text are organized into chapters. This lecture covers Chapter 10. Chapter 1: Introduction to Database Systems Chapter.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Cuckoo Filter: Practically Better Than Bloom
1 Hashing, randomness and dictionaries Rasmus Pagh PhD defense October 11, 2002.
An Improved Construction for Counting Bloom Filters Flavio Bonomi Michael Mitzenmacher Rina Panigrahy Sushil Singh George Varghese Presented by: Sailesh.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
1 Advanced Database Technology Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Spring 2004 March 4, 2004 INDEXING II Lecture based on [GUW,
Hit or Miss ? !!!.  Small size.  Simple and fast.  Implementable with hardware.  Does not need too much power.  Does not predict miss if we have.
Bloom Filters Kira Radinsky Slides based on material from:
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
Hash Tables With Finite Buckets Are Less Resistant to Deletions Yossi Kanizo (Technion, Israel) Joint work with David Hay (Columbia U. and Hebrew U.) and.
1 Staleness vs.Waiting time in Universal Discrete Broadcast Michael Langberg California Institute of Technology Joint work with Jehoshua Bruck and Alex.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
1 Advanced Database Technology Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Spring 2004 February 19, 2004 INDEXING I Lecture based on [GUW,
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
1 Geometric index structures April 15, 2004 Based on GUW Chapter , [Arge01] Sections 1, 2.1 (persistent B- trees), 3-4 (static versions.
Hashing General idea: Get a large array
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Proteus: Power Proportional Memory Cache Cluster in Data Centers Shen Li, Shiguang Wang, Fan Yang, Shaohan Hu, Fatemeh Saremi, Tarek Abdelzaher.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 7: Planning a DNS Strategy.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
Data Structures Hashing Uri Zwick January 2014.
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
© 2006 IBM Corporation Adaptive Self-Tuning Memory in DB2 Adam Storm, Christian Garcia-Arellano, Sam Lightstone – IBM Toronto Lab Yixin Diao, M. Surendra.
Smart Reference Proxy Provides additional actions whenever an object is referenced (e.g., counting the number of references to the object) Firewall Proxy.
Fast and deterministic hash table lookup using discriminative bloom filters  Author: Kun Huang, Gaogang Xie,  Publisher: 2013 ELSEVIER Journal of Network.
CMPE 421 Parallel Computer Architecture
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
TinyLFU: A Highly Efficient Cache Admission Policy
Dynamic Covering for Recommendation Systems Ioannis Antonellis Anish Das Sarma Shaddin Dughmi.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Conjunctive Filter: Breaking the Entropy Barrier Daisuke Okanohara *1, *2 Yuichi Yoshida *1*3 *1 Preferred Infrastructure Inc. *2 Dept. of Computer Science,
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
“Never doubt that a small group of thoughtful, committed people can change the world. Indeed, it is the only thing that ever has.” – Margaret Meade Thought.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Implementing ISA Server Caching
Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.
HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.
Cuckoo Filter: Practically Better Than Bloom Author: Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher Publisher: ACM CoNEXT 2014 Presenter:
Singleton Processing with Limited Memory Peter L. Montgomery Microsoft Research Redmond, WA, USA.
Chapter 11 (Lafore’s Book) Hash Tables Hwajung Lee.
1 Data Organization Example 1: Heap storage management Maintain a sequence of free chunks of memory Find an appropriate chunk when allocation is requested.
Lower bounds for approximate membership dynamic data structures
Database Implementation Issues
Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.
Database Implementation Issues
Algorithm design and Analysis
Edge computing (1) Content Distribution Networks
Bin Sort, Radix Sort, Sparse Arrays, and Stack-based Depth-First Search CSE 373, Copyright S. Tanimoto, 2002 Bin Sort, Radix.
Hash-Based Indexes Chapter 10
Optimizing Data Popularity Conscious Bloom Filters
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Bin Sort, Radix Sort, Sparse Arrays, and Stack-based Depth-First Search CSE 373, Copyright S. Tanimoto, 2001 Bin Sort, Radix.
CPSC-608 Database Systems
By: Ran Ben Basat, Technion, Israel
Chapter 11 Instructor: Xin Zhang
New Jersey, October 9-11, 2016 Field of theoretical computer science
CSE 326: Data Structures Lecture #14
Database Implementation Issues
Database Implementation Issues
Presentation transcript:

How to Approximate a Set Without Knowing It’s Size In Advance? Rasmus Pagh Gil Segev Udi Wieder IT University of Copenhagen Stanford Microsoft Research

Set Membership

Approximate Set Membership |S|= n

ASM in a picture S ⊆ [100]×[100] |S|=188 |A(S)|=5213 |A(S)|=2699|A(S)|=1580|A(S)|=918

Applications Many…. Very common in practice Data Bases, Networking and more… Serves as a filter for accessing slow/bandwidth bounded data Requests arrive first at the filter which determines which requests reside in the proxy’s cache and which should be fetched from the network. The cost of a false positive is a cache miss. Web Proxy Cache Filter: Approximation of the Cache Request External Web

Lower Bounds for Static Case: [CFGMW78]

Upper Bounds – Bloom Filters X

Dictionary Based Upper Bounds

Separation of Static and Dynamic

But in practice…. The size of the set is not known in advance! Leads to over-provisioning of space up front Waste of space as long as the set is small Typically the data structure lies in prime real estate, the whole idea is saving space. Problem raised and handled in ‘practical’ papers Typically in a naïve way from a ‘theoretical’ point of view

Main Results (approximate) Super linear bound!

Lower Bound

Lower Bound – proof sketch...

Lower Bound: the encoding

Upper Bound – Construction 1

Getting Constant Query Time

Analysis

Extensions and standard tricks Extra space required when rebuilding the new dictionary. Both dictionaries need to be stored until the rebuild is complete. This can be mitigated by bucketing items into many smaller dictionaries, rebuilding the smaller dictionaries one at a time. De-amortization of Insert, Each time an item is inserting, perform O(1) operations on the next dictionary. Not compatible with bucketing technique, requires a small increase in space.

Supporting Deletions Necessary assumption: Only items that are in the set are ever deleted. The removal of a ‘false positive’ item may introduce false negatives The assumption makes sense in many applications when data structure filters a cache Standard approach of storing multi-sets is problematic. An item generates many signatures, can’t tell which one to remove. Upon insertions, if fingerprint already appears put it in a secondary structure. Upon removal check secondary structure first. Requires assumption that each item is inserted only once Requires some extra book keeping.

Open Problems Bridge a theory – practice gap Practitioners seem content with the solution of multiple bloom filters But then, practitioners seem content with Bloom Filters… Get the leading constant in front of log log n THANK YOU