Download presentation
Presentation is loading. Please wait.
Published byJoseph Wilkerson Modified over 8 years ago
1
February 14, 2006CS6392 - DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath Poosala Presented By: Daniel Kuang
2
February 14, 2006CS6392 - DB Exploration2 Outline Problems with Group-By queries Congressional sampling Rewriting Performance Conclusion
3
February 14, 2006CS6392 - DB Exploration3 Problems with Group-By Queries Decision support queries routinely segment the data into groups. For example, a group-by query on the U.S. census database could be used to determine the per capita income per state. However,there can be a huge discrepancy in the sizes of different groups, e.g., the state of California has nearly 70 times the population of Wyoming. As a result, a uniform random sample of the relation will contain disproportionately fewer tuples from the smaller groups, which leads to poor accuracy for answers on those groups because accuracy is highly dependent on the number of sample tuples that belong to that group. Standard error is inversely proportional to √n for uniform sample. n is the uniform sample random size.
4
February 14, 2006CS6392 - DB Exploration4 Solution (Congressional Sampling) Consider US Congress which is hybrid of House and Senate. House has representative from each state in proportion to its population. Senate has equal number of representative from each state. Then apply House and Senate scenario for representing different groups. House sample: Uniform random sampling from each group. Senate sample: Sample an equal number of tuples from each group.
5
February 14, 2006CS6392 - DB Exploration5 Solution (Congressional Sampling) Consider a relation R with two grouping attributes A, and B Number of tuples for the groups (a1, b1) – 3000, (a1, b2) – 3000, (a1, b3) – 1500, (a2, b3) -- 2500 Basic Congress (sample size = 100) ABHouse Sg,0 Senate Sg,AB Basic Congress before scaling Basic Congress a1b130253027.3 a1b230253027.3 a1b31525 22.7 a2b325 22.7
6
February 14, 2006CS6392 - DB Exploration6 Solution (Congressional Sampling) ABHouse Sg,0 Senate Sg,AB Basic Congress before scaling Basic Congress a1b130253027.3 a1b230253027.3 a1b31525 22.7 a2b325 22.7 Sg,ASg,BCongress before scaling Congress 20 (of 50)33.3 23.5 20 (of 50)33.3 23.5 10 (of 50)12.5 (of 33.3)2517.7 5020.8 (of 33.3)5035.3
7
February 14, 2006CS6392 - DB Exploration7 Congressional Sampling Basic congress sample size allocated to each group Congress sample size allocated to each group
8
February 14, 2006CS6392 - DB Exploration8 Rewriting Query rewriting involves two key steps: a) scaling up the aggregate expressions and b) deriving error bounds on the estimate. ScaleFactor be the inverse sampling rate for its strata. How to associate each tuple with its scalefactor: a) store the ScaleFactor(SF) with each tuple in sample relation b) use a separate table to store the ScaleFactors for the groups KeyGrouping columnAggregate column KABCQ k1a1b1c1q1 k2a1b1c2q2 Select A, B, sum(Q) From Rel Group by A, B Relation Rel with two example tuples
9
February 14, 2006CS6392 - DB Exploration9 Rewriting (Integrated Rewriting)
10
February 14, 2006CS6392 - DB Exploration10 Normalized Rewriting
11
February 14, 2006CS6392 - DB Exploration11 Key-normalized Rewriting
12
February 14, 2006CS6392 - DB Exploration12 Nested-integrated Rewriting
13
February 14, 2006CS6392 - DB Exploration13 Performance Three Queries Grouping on returnflag, linestatus, shipdate skewed group sizes z = 1.5 Sample Percentage at 7%
14
February 14, 2006CS6392 - DB Exploration14 Performance
15
February 14, 2006CS6392 - DB Exploration15 Performance
16
February 14, 2006CS6392 - DB Exploration16 Performance
17
February 14, 2006CS6392 - DB Exploration17 Performance Times taken for different sample percentages Actual query time = 40sec
18
February 14, 2006CS6392 - DB Exploration18 Conclusions Congressional samples are effective for group-by queries with arbitrary group-bys (including none)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.