Download presentation
Presentation is loading. Please wait.
1
Pyramid Sketch: a Sketch Framework
for Frequency Estimation of Data Streams Tong Yang, Yang Zhou, Hao Jin, Peking University Shigang Chen, University of Florida, USA Xiaoming Li, Peking University, China Good afternoon, everyone. My name is Tong Yang from Peking University, China. today, my topic is Pyramid sketch:
2
Outline 3. Evaluation 1. Background Experiment setup
Effects of techniques Accuracy Speed 4. Conclusion 1. Background Problem to address Prior art 2. Pyramid Techniques Counter-pair sharing Word acceleration Word constraint Word sharing One hashing Ostrich policy Here is the outline, We first introduce the background.
3
Outline 3. Evaluation 1. Background Experiment setup
Effects of techniques Accuracy Speed 4. Conclusion 1. Background Problem to address Prior art 2. Three Techniques Counter-pair sharing Word acceleration Word constraint Word sharing One hashing Ostrich policy
4
Background High speed Hot Items Updating Data Structure
Problem: High speed Hot Items Updating Data Structure Frequency Query Updating A data stream is composed of hot items and cold items. Each item can appear more than once. In practice, most items are cold items with low frequencies, while a few items are hot items with high frequencies. Given an Item, the question is how many times does it appear? One straightforward solution is to use a hash table. However, hash table is not memory efficient, and the update speed is slow and not reasonable bounded. Nowadays, the speed of data stream is often very high, and it is often impractical and unnecessary to exactly record all item information. Cold Items Hash tables: memory inefficient, and slow
5
Background Typical sketches:
• CM sketch Journal of Algorithms 2005, cited 976 times. • CU sketch SIGCOMM 2002, cited 949 times. • Count sketch Automata, Languages and Programming, 2002, cited 715 times. • Augmented sketch SIGMOD 2016 • Slim-Fat sketch ICDE 2017 To address this problem, sketch, a probabilistic data structure becomes popular. There are various sketches, typical sketches include
6
Background e Insertion: when a new item e comes
Prior art --- CM Sketch Insertion: when a new item e comes Query: query for the frequency of the item e Deletion: delete item e 5 7 -1 10 +1 -1 +1 +1 -1 Reported value: 5 … … e The most well known sketches are CM and CU sketches.
7
Background e Insertion: when a new item e comes
Prior art --- CU Sketch Insertion: when a new item e comes Query: query for the frequency of the item e 5 7 10 +1 Reported value: 5 … … e Obviously, CU sketch achieves higher accuracy than CM Sketch.
8
Background Hot item Cold item 2 • Design goal: High memory efficiency
High update speed High accuracy Hot items need large counters, to meet the need of hot items, existing sketches use large counters
9
Outline 3. Evaluation 1. Background Experiment setup
Effects of techniques Accuracy Speed 4. Conclusion 1. Background Problem to address Prior art 2. Pyramid Techniques Counter-pair sharing Word acceleration Word constraint Word sharing One hashing Ostrich policy Then we show how our pyramid sketch to achieve high accuracy and high speed. Our pyramid sketch including the following techniques
10
Techniques I Hybrid Counter ... Pure Counter ... … e
1 Counter-pair Sharing Hybrid Counter ... Pure Counter … … ... … … … … The first technique is called counter-pair sharing. There are multiple layers in our framework, each layer is a counter array, each counter is the same size, for example, 4 bits. Each counter has only four bits, thus it could overflow during insertions. When a counter overflows, we use its parent counter to record the number of overflows. Note that every two adjacent counters share one parent counter at the higher layer. Obviously, the number of counters is halved layer by layer. The counters at the first layer are pure counters. It means that each counter is used to only record frequencies. Other counters at the rest layers are hybrid counters. … e
11
Techniques I left flag right flag counting part parent left child
1 Counter-pair Sharing left flag right flag counting part parent Let show the data structure of hybrid counters. left child right child
12
Techniques I Insertion Example: The counter size is set to 4 bits.
1 Counter-pair Sharing Insertion Example: The counter size is set to 4 bits. parent L2 1 1 An item e comes in. Right child counter is supposed to be incremented L1 10 16 15 left child right child Perform a carry operation e
13
Techniques I Query Example: The counter size is set to 4 bits. L3 L2
1 Counter-pair Sharing Query Example: The counter size is set to 4 bits. L3 2 1 L2 parent 1 1 We want to query the item e. Query value from the right child can be obtained as shown. L1 10 left child right child 0*1 + 1*1*16 + 0*2*64 = 16 e
14
Techniques I • Memory efficiency: 1) Counter size is kept small.
1 Counter-pair Sharing • Memory efficiency: 1) Counter size is kept small. 2) It automatically assigns appropriate number of small counters to store the frequency of each item.
15
Techniques II 2 Word acceleration
16
Techniques II ... e 2.1 Word constraint
Assume we hash an item e to k counters Word Constraint e A machine word L1 L2 ... Each insertion needs: k memory accesses and k hash computations at layer 1. Each insertion needs: 1 memory access and k+1 hash computations at layer 1.
17
Techniques II e e 2.2 Word Sharing L3 L3 L2 L2 L1 L1 Word sharing
L3 L3 Word sharing L2 L2 L1 L1 e A machine word e Using this method, we can alleviate the problem of hash collisions.
18
Techniques II 2.3 One hashing L2 ... L1 ... e A machine word Use one hash function to compute a 32 bit hash value. First 16 bits, locating a word (64 bits) The rest 4*4 bits, locating 4 counters in the word
19
Techniques III ... ... e Ostrich Policy can be only applied to
3 Ostrich Policy Ostrich Policy can be only applied to CU sketch with Pyramid: PCU. ... Without Ostrich policy, the strict insertion strategy of PCU will be slow … … ... … … When an item e comes ... … … Just like ostrich, we pretend that there are no parent counters. … … e
20
Techniques III ... ... e Using Ostrich Policy, PCU will insert e as …
3 Ostrich Policy Using Ostrich Policy, PCU will insert e as … ... When an item e comes ... … … We merely query the three colored counter to get three values. ... … … … … … … e
21
Techniques III ... ... e Using Ostrich Policy, PCU achieves...
3 Ostrich Policy Using Ostrich Policy, PCU achieves... ... 1) Speed acceleration: Around one memory access for each insertion. … … ... … … 2) Amazingly, accuracy improvement! … … … … e
22
Outline 3. Evaluation 1. Background Experiment setup
Accuracy Speed 4. Conclusion 1. Background Problem to address Prior art 2. Four Techniques Counter-pair sharing Word acceleration Word constraint Word sharing One hashing Ostrich policy Here is the outline, including Background, Pyramid Techniques, Evaluation, and Conclusion.
23
Evaluation Datasets: We use three kinds of datasets as follows.
Experiment setup Datasets: We use three kinds of datasets as follows. 1) Real IP-Trace Streams 2) Real-Life Transaction Dataset 3) Synthetic Datasets Implementation: We applied Pyramid to 4 typical sketches. Computation platform: A machine with 12-core CPUs and 62 GB DRAM. CPU has three levels of cache memory: two 32KB L1 caches for each core, one 256KB L2 cache for each core, and one 15MB L3 cache shared by all cores.
24
Evaluation Accuracy We apply our framework to four typical sketches: CM, CU, Count, and Augmented sketch, and find that the error rate is significantly reduced. And we also find that when applying pyramid to the CU sketch, the accuracy is the best, and thus we compare P_CU with other sketches in the following experiments.
25
Evaluation Effects of techniques
We have proposed five techniques: counter-pair sharing (T1), word constraint (T2), word sharing (T3), one hashing (T4), and Ostrich policy (T5). These figures show that with all our five techniques, the accuracy and speed are both optimized.
26
Evaluation Accuracy Here we vary the skewness and data ID, and find that, P_CU sketch achieves a much higher accuracy than the four typical sketches.
27
Evaluation Speed We apply our framework to four typical sketches: CM, CU, Count, and Augmented sketch, and find that the insertion speed and query speed are both improved.
28
Evaluation Speed Similarly, with different skewness and dataset ID, P_CU achieves a much fewer number of memory accesses than the four typical sketches.
29
Evaluation Speed Here we vary the skewness and dataset ID, we find that, P_CU sketch achieves a much higher insertion speed and query speed than the four typical sketches.
30
Outline 3. Evaluation 1. Background Experiment setup
Effects of techniques Accuracy Speed 4. Conclusion 1. Background Problem to address Prior art 2. Pyramid Techniques Counter-pair sharing Word acceleration Word constraint Word sharing One hashing Ostrich policy Here is the outline, including Background, Pyramid Techniques, Evaluation, and Conclusion.
31
Conclusion Sketches have been applied to various fields. In this paper, we propose a sketch framework - the Pyramid sketch, to significantly improve the update speed and accuracy. We applied our framework to four typical sketches: sketches of CM, CU, Count, and Augmented. Experimental results show that our framework significantly improves both accuracy and speed. We believe our framework can be applied to many more sketches.
32
Thanks! Pyramid Sketch: a Sketch Framework for
Frequency Estimation of Data Streams Source codes: 18 November 2018 IWQoS 2015
33
Conclusion Sketches have been applied to various fields. In this paper, we propose a sketch framework - the Pyramid sketch, to significantly improve the update speed and accuracy. We applied our framework to four typical sketches: sketches of CM, CU, Count, and Augmented. Experimental results show that our framework significantly improves both accuracy and speed. We believe our framework can be applied to many more sketches.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.