By: Ran Ben Basat, Technion, Israel

Slides:



Advertisements
Similar presentations
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.
Advertisements

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
1 Summarizing Data using Bottom-k Sketches Edith Cohen AT&T Haim Kaplan Tel Aviv University.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Bloom Filters Kira Radinsky Slides based on material from:
Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)
Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.
Hash Tables With Finite Buckets Are Less Resistant to Deletions Yossi Kanizo (Technion, Israel) Joint work with David Hay (Columbia U. and Hebrew U.) and.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.
Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines By F. Bonomi et al. Presented by Kenny Cheng, Tonny Mak Yui Kuen.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
A survey on stream data mining
Bloom filters Probability and Computing Randomized algorithms and probabilistic analysis P109~P111 Michael Mitzenmacher Eli Upfal.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
Fast and deterministic hash table lookup using discriminative bloom filters  Author: Kun Huang, Gaogang Xie,  Publisher: 2013 ELSEVIER Journal of Network.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
1 Efficient Computation of Frequent and Top-k Elements in Data Streams.
CEDAR Counter-Estimation Decoupling for Approximate Rates Erez Tsidon (Technion, Israel) Joint work with Iddo Hanniel and Isaac Keslassy ( Technion ) 1.
Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.
CEDAR Counter-Estimation Decoupling for Approximate Rates Erez Tsidon Joint work with Iddo Hanniel and Isaac Keslassy Technion, Israel 1.
DoWitcher: Effective Worm Detection and Containment in the Internet Core S. Ranjan et. al in INFOCOM 2007 Presented by: Sailesh Kumar.
TinyLFU: A Highly Efficient Cache Admission Policy
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
1 LD-Sketch: A Distributed Sketching Design for Accurate and Scalable Anomaly Detection in Network Data Streams Qun Huang and Patrick P. C. Lee The Chinese.
The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.
Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro.
Calculating frequency moments of Data Stream
Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi Tian Wang.
SCREAM: Sketch Resource Allocation for Software-defined Measurement Masoud Moshref, Minlan Yu, Ramesh Govindan, Amin Vahdat (CoNEXT’15)
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
SketchVisor: Robust Network Measurement for Software Packet Processing
Mining Data Streams (Part 1)
Big Data Infrastructure
Author: Heeyeol Yu; Mahapatra, R.; Publisher: IEEE INFOCOM 2008
Constant Time Updates in Hierarchical Heavy Hitters
Frequency Counts over Data Streams
The Stream Model Sliding Windows Counting 1’s
Lower bounds for approximate membership dynamic data structures
The Variable-Increment Counting Bloom Filter
Finding Frequent Items in Data Streams
Matrix Sketching over Sliding Windows
Augmented Sketch: Faster and More Accurate Stream Processing
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong(崇志宏) , Hongjun Lu.
Bloom filters Probability and Computing Michael Mitzenmacher Eli Upfal
Optimal Elephant Flow Detection Presented by: Gil Einziger,
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Range-Efficient Computation of F0 over Massive Data Streams
Memento: Making Sliding Windows Efficient for Heavy Hitters
Constant Time Updates in Hierarchical Heavy Hitters
Network-Wide Routing Oblivious Heavy Hitters
Heavy Hitters in Streams and Sliding Windows
By: Ran Ben Basat, Technion, Israel
EMOMA- Exact Match in One Memory Access
Ran Ben Basat, Xiaoqi Chen, Gil Einziger, Ori Rottenstreich
Bloom filters From Probability and Computing
Lu Tang , Qun Huang, Patrick P. C. Lee
NitroSketch: Robust and General Sketch-based Monitoring in Software Switches Alan (Zaoxing) Liu Joint work with Ran Ben-Basat, Gil Einziger, Yaron Kassner,
Author: Ramana Rao Kompella, Kirill Levchenko, Alex C
(Learned) Frequency Estimation Algorithms
Presentation transcript:

By: Ran Ben Basat, Technion, Israel Pay for a Sliding Bloom Filter and Get Counting, Distinct Elements, and Entropy for Free By: Ran Ben Basat, Technion, Israel Joint work with Eran Assaf, Gil Einziger, and Roy Friedman IEEE INFOCOM2018 4/15/2019

Computing network statistics. Monitoring a large number of flows. Motivation Computing network statistics. Load balancing, Fairness, Anomaly detection. Monitoring a large number of flows. Allowing real-time queries. 4/15/2019

Did appear in the window? Sliding Bloom Filter Did appear in the window? Recent data is often the most important! No false negatives: 𝐏𝐫 yes 𝒙∈𝑾 =𝟏 Few false positives: 𝐏𝐫 yes 𝒙∉𝑾 ≤𝝐 Traditionally – must fit in the SRAM 1 5 3 7 8 4 2 Year 2012 2014 2016 SRAM (MB) 10-20 30-60 50-100 (SilkRoad, SIGCOMM 2017) 4/15/2019

Lower Bounds for Sliding Bloom Filters Any sliding Bloom filter must 𝔅=𝑊log 𝑊/𝜖 bits. (Naor and Yogev, ISAAC 2013) For convenience we assume that 𝜖= 𝑊 𝑜 1 . Alternatively, log 𝑊/𝜖 = log 𝑊 (1+𝑜 1 ) An algorithm is called succinct if it uses 𝔅 1+𝑜 1 space.

Sliding Window Bloom Filter (Liu et al., INFOCOM 2013) Use a Cuckoo Hash Table. Current time: 𝟎 𝟏 𝟒 𝟐 𝟑 Table 1 Thm: if the load factor is ≤ 𝟎.𝟓 then with high probability all operations take constant time Table 2 FP Timestamp FP Timestamp 𝟏𝟏𝟎 𝟐 𝟑 𝟒 𝟏 𝟑 𝟏 Space: 𝟐𝑾𝐥𝐨𝐠𝐖 𝟏+𝐨 𝟏 =𝔅 𝟐+𝒐 𝟏 bits 𝒉 𝟎 𝒉 𝟐 𝒉 𝟏 𝒉 𝟏 𝒉 𝟏 𝒉 𝟏 𝒉 𝟐 Has appeared in the last 3 packets?

Per-flow frequency estimation How many times does appear in the window? A generalization of a Sliding Bloom Filter 𝑊𝜖−Additive approximation using 𝑂 𝜖 −1 log 𝑊 bits and with constant time operations (Ben Basat et al., INFOCOM 2016)

Sliding Window Approximate Measurement Protocol (SWAMP) Current Item Pointer (curr) Cyclic Fingerprint Buffer (CFB)

Multiset representations Consider representing a set of 𝑚 items from an 𝑛- sized universe. replace( , ) Universe: multiplicity( ) Set: There exist succinct (use 𝔅(𝑚,𝑛) 1+𝑜 1 bits) representations with 𝑂(1) time operations. (Einziger and Friedman, ICDCN 2016), (Pandey et al., SIGMOD 2017) 4/15/2019

Sliding Window Approximate Measurement Protocol (SWAMP) Current Item Pointer (curr) Cyclic Fingerprint Buffer (CFB) replace( , ) Fingerprint Frequency 2 1 4 (+1) (-1) Aggregates Table

The results Algorithm Space Update Time Counts TBF 𝑂 𝑊log𝑊log 𝜖 −1 SWBF 2+𝑜 1 𝑊 log 2 𝑊 𝑂(1) SWAMP 1+𝑜 1 𝑊 log 2 𝑊

Is SWAMP a good counting algorithm? We compared to the state of the art WCSS algorithm (Ben Basat et al., INFOCOM 2016)

Counting distinct elements over sliding windows How many distinct flows appear in the window? (1+𝜖)−multiplicative approximation using 𝑂 𝜖 −2 log 𝑊 log log 𝑊 bits and with constant update time (Fusy and Giroire, ANALCO 2007), (Chabchoub and Hebrail, ICDM 2010),

Counting distinct elements with SWAMP Current Item Pointer (curr) Cyclic Fingerprint Buffer (CFB) Distinct Fingerprints: 𝒁=6 (-1) Requires just 𝐥𝐨𝐠𝐖 bits! Fingerprint Frequency 2 1 4 Aggregates Table

Counting distinct elements with SWAMP Guarantees: Pr 𝐷≥𝑍 =1 Pr 𝐷−𝑍≥𝜖𝐷log 𝛿 −1 ≤𝛿 (never overestimate, likely to not underestimate by much) (approximate) Maximum Likelihood Estimate: Return ln 1 − 𝑍 2 𝐿 ln 1 − 1 2 𝐿

Counting distinct elements with SWAMP Instead of paying 𝛀 𝝐 −𝟐 𝐥𝐨𝐠𝑾 bits using the existing algorithms, SWAMP required 𝑶(𝑾𝒍𝒐𝒈(𝑾/𝝐)) which is more efficient when 𝝐 is small

Takeaways A succinct sliding bloom filter that can also count. Beats the state of the art for: Sliding Bloom Filter Per-flow Frequency Estimation Counting Distinct Elements Computing Entropy (in the paper) 4/15/2019

Any Questions 4/15/2019

Distribution Entropy over Sliding Windows What is the distribution entropy of the window? (1+𝜖)−multiplicative approximation using 𝑂 𝜖 −2 log 𝑊 bits and with 𝑂 𝜖 −2 update time (Braverman et al., PODS 2009).

Computing Entropy with SWAMP We can track 𝐻 − the entropy of the finger print distribution Guarantees: Pr 𝐻≥ 𝐻 =1 Pr 𝐻− 𝐻 ≥𝜖 ≤𝛿

Computing Entropy with SWAMP Instead of paying 𝛀 𝝐 −𝟐 𝐥𝐨𝐠𝑾 bits using the existing algorithms, SWAMP required 𝑶(𝑾𝒍𝒐𝒈(𝑾/𝝐)) which is more efficient when 𝝐 is small

Set Membership (Bloom Filter) Did appear in the stream? How about ? Can’t allocate a bit for each potential flow! Traditionally – must fit in the SRAM 1 5 3 7 8 4 2 Year 2012 2014 2016 SRAM (MB) 10-20 30-60 50-100 (SilkRoad, SIGCOMM 2017) 4/15/2019

The Bloom Filter (Bloom, 1970) Use a bit-array of size 𝑚 and 𝑘 hash functions ℎ 𝑖 :𝑈→ 1,…,𝑚 No False Negatives! Few False Positives. Has appeared? 1 1 1 4/15/2019

The Timing Bloom Filter (Zhang and Guan, ICDCS 2008) Use a timestamp-array of size 𝑚 and 𝑘 hash functions ℎ 𝑖 :𝑈→ 1,…,𝑚 Current time: 𝟎 𝟑 𝟒 𝟓 𝟏 𝟐 Space: 𝑶 𝑾𝐥𝐨𝐠𝑾𝐥𝐨𝐠 𝝐 −𝟏 Update/Query: 𝑶 𝐥𝐨𝐠 𝝐 −𝟏 Has appeared in the last 3 packets? 2 4 3 5 2 1 2 1 3 1 2 4 3

Any Questions 4/15/2019

Any Questions 4/15/2019

Sliding Window Approximate Measurement Protocol (SWAMP) Current Item Pointer (curr) Cyclic Fingerprint Buffer (CFB) ℎ(𝑥 𝑛 )=

1.0 0.8 0.6 0.4 0.2 0.0 Recall 1.0 0.8 0.6 0.4 0.2 0.0 10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 Mean Square Error Precision 2 5 2 6 2 7 2 8 2 9 2 10 2 11 0.0 0.2 0.4 0.6 0.8 1.0 2 5 2 6 2 7 2 8 2 9 2 10 2 11 Recall 1.0 0.8 0.6 0.4 0.2 0.0 Recall 0 2 4 6 8 10 Number of Packets [x100K]

10 9 10 8 10 7 10 6 10 5 10 4 10 3 Mean Square Error 10 9 10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 Mean Square Error 10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 Mean Square Error 2 5 2 6 2 7 2 8 2 9 2 10 2 11 2 5 2 6 2 7 2 8 2 9 2 10 2 11

1.0 0.8 0.6 0.4 0.2 0.0 Recall 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 Number of Packets [x100K] Number of Packets [x100K] Number of Videos [x100K]

1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 Precision Precision Precision 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Recall Recall

1.0 0.8 0.6 0.4 0.2 0.0 Recall 1.0 0.8 0.6 0.4 0.2 0.0 Recall 1.0 0.8 0.6 0.4 0.2 0.0 Recall 2 5 2 6 2 7 2 8 2 9 2 10 2 11 2 5 2 6 2 7 2 8 2 9 2 10 2 11