Download presentation
Presentation is loading. Please wait.
Published byEmily Melanie Cain Modified over 8 years ago
1
2006/3/211 Multiple Aggregations over Data Stream Rui Zhang, Nick Koudas, Beng Chin Ooi Divesh Srivastava SIGMOD 2005
2
2006/3/212 Outline Introduction to Giga-Scope DSMS Multiple Aggregations Problem The proposed approach - choice of phantoms - space allocation problem Conclusion
3
2006/3/213 Giga-Scope A DSMS appears to monitor high speed IP traffic data. LFTA HFTA Main Memory Processing low speed data stream seed by LFTA. Network Interface Card Simple low level query over high speed data stream, which serve to reduce data volumes DSMS
4
2006/3/214 2,1 24,1 3,1 17,1 2,22,3 4,1 Single Aggregation in Giga- Scope 2 24 2 2 3 17 4 1 2 3 4 5 6 7 8 9 0 LFTAHFTA (group, count) R Select A, count(*) From R Group by A;
5
2006/3/215 Cost of Processing a Single Aggregation probe (c1) : The cost of looking up the hash table in LFTAs and possible update in case of a collision eviction (c2) : The cost of transferring an entry from LFTAs to HFTAs
6
2006/3/216 Processing Multiple Aggregation Naively Select A, count(*) From R Group by A; Select B, count(*) From R Group by B; Select C, count(*) From R Group by C; (2, 3, 4 ) (24, 4, 3) (2, 3, 4) (4, 2, 3) R(A, B, C) LFTAHFTA Hash Table A Hash Table B Hash Table C (2,1) (3,1) (4,1) (24,1) (4,1) (3,1) (2,3) (3,3) (4,3) (2,1) (4,1) (3,2) 15c1 +1c2+7c2 The end of Epoch !!
7
2006/3/217 Processing Multiple Aggregation by maintaining phantoms R(A, B, C) (2, 3, 4 ) (24, 4, 3) (2, 3, 4) (4, 2, 3) The end of Epoch !! LFTA Hash Table A Hash Table B Hash Table C Hash Table ABC (2, 3 ) (3, 3 ) (4, 3 ) (24, 1 ) (4, 1 ) (2, 1 ) (4, 1 ) (3, 1 ) 14c1 +8c2 HFTA 1 2 3 4 5 6 7 8 9 0 (2, 3, 4, 1 ) (24, 4, 3, 1) (2, 3, 4, 2 ) (4, 2, 3, 1 ) (2, 3, 4, 3 ) (3, 1 )(3, 2 )
8
2006/3/218 The problem Consider a set of aggregation queries over a data stream that differ only in their group attribute. Determine an optimal sharing setting for the queries with limit memory. ABBCBDCD ABCABDBCD ABCD Q1 Q2 Q3Q4 Given queries -choice of phantoms -space allocation
9
2006/3/219 Idea by maintaining phantoms : the collision rate without phantoms : collision rate with phantoms : the collision rate of phantom ABC The total cost: –Without phantom : –With the phantom : E1= 3nc 1 +3x 1 nc 2 E2= nc 1 +3x 2 nc 1 +3x 1 ’ x 2 nc 2 x1x1 x1’x1’ x2x2
10
2006/3/2110 Example A B C ABC C2 C1 In the case, the phantom benefits the cost To be fair,the total space used for the hash tables should be the same with or without the phantoms E1= 3c 1 +3x 1 c 2 E2= c 1 +3x 2 c 1 +3x 1 ’ x 2 c 2 A B C M/3 x1x1 x1’x1’ M/4 E1-E2=(2-3x 2 )c 1 +3(x 1 -x 1 ’ x 2 )c 2 When x 2 0, the phantom benefits the cost. x2x2 C1 x1x1 x1x1 E1-E2=F(x 1, x 2, x 1 ’ )
11
2006/3/2111 g=3000 b=1000 The probability of k groups out of g hashed to a buckets B k is the number of buckets having k groups n rg :The expected number of record for each group (1-1/k): the collision rate in the bucket :collision happen in the bucket g: number of groups of a relation b: number of buckets in the hash table Key point The collision rate estimation
12
2006/3/2112 Algorithmic strategies for choosing the phantoms Benefit=the difference between the maintenance costs without or with the phantom. Greedy by Increasing Collision Rate The configuration I only includes all the queries We calculate the maintenance cost if a phantom R is added to I By comparing with the maintenance cost when R is not in I, we can get the benefit After we add this phantom to I,we iterate with the other phantoms As more phantoms are added into I, the overall collision rate goes up and benefit decreases Stop when the benefit becomes negative.
13
2006/3/2113 Algorithmic strategies for choosing the phantoms Greedy by Increasing Collision Rate ABBCBDCD ABCABDBCD ABCD Q1 Q2 Q3Q4 g=2837 g=2117 g=1846 g=2387g=2249 g=1946g=1899g=1999 Available memory=12000 Allocate AB=(1846/7690)*120000 Allocate BC … Allocate BD … Allocate CD … Try ABCD (Linear proportional Allocation) Allocate ABCD=(2837/10527)*12000 Allocate AB=(1846/10527)*12000 Allocate BC … Allocate BD … Allocate CD … The process ends when benefit become negative E1-E2=F(x 1, x 2, x 1 ’ ) b ABCD x ABCD Benefit
14
2006/3/2114 Space Allocation AB AB By partial derivatives of e to 0. When, e has minimum cost. Thereby, the space allocated is proportional to square root of number of group. Optimal solution for the two level graph x0x0 x1x1 x2x2
15
2006/3/2115 Algorithmic strategies for choosing the phantoms One way of allocating hash table space to a relation is proportional to the number of groups in the table We can allocate space for a relation with g is a constant and we set it large
16
2006/3/2116 Algorithmic strategies for choosing the phantoms Greedy by Increasing Space We calculate the benefit of each phantom according to the cost model We calculate the benefit per unit space for each phantom R, benefit/ We choose the phantom with the largest benefit per unit space as the first phantom to instantiate The process ends when the benefit per unit space becomes negative
17
2006/3/2117 Algorithmic strategies for choosing the phantoms Greedy by Increasing Space ABBCBDCD ABCABDBCD ABCD Q1 Q2 Q3Q4 g=2837 g=2117 g=1846 g=2387g=2249 g=1946g=1899g=1999 E1-E2=(2-3x 2 )c 1 +3(x 1 -x 1 ’ x 2 )c 2 Benefit/Space as a metric Benefit=2 Benefit=1 Benefit=-1 Try ABCD Available memory=12000 12000-7690=4310 4310-2837=1473 The process ends when 1.Benefit become negative 2.The space is exhausted
18
2006/3/2118 Drawback needs to be tuned to find the best performance
19
2006/3/2119 Space Allocation According to Abel’s impossibility theorem, equations of order higher than 4 cannot be solved algebraically, we say unsolvable More general multi-level configurations generate equations of even higher order which are unsolvable We would use heuristics to decide space allocation for the these unsolvable cases based on the analysis available
20
2006/3/2120 Space Allocation Super-node with Linear Combination Super-node with Square Root Combination Linear Proportional Allocation Square Root Proportional Allocation
21
2006/3/2121 Conclusion We address the problem of efficiently computing multiple aggregations over high speed data streams In real DSMS, the value of “g” is unknown.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.