February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath Poosala Presented By: Daniel Kuang
February 14, 2006CS DB Exploration2 Outline Problems with Group-By queries Congressional sampling Rewriting Performance Conclusion
February 14, 2006CS DB Exploration3 Problems with Group-By Queries Decision support queries routinely segment the data into groups. For example, a group-by query on the U.S. census database could be used to determine the per capita income per state. However,there can be a huge discrepancy in the sizes of different groups, e.g., the state of California has nearly 70 times the population of Wyoming. As a result, a uniform random sample of the relation will contain disproportionately fewer tuples from the smaller groups, which leads to poor accuracy for answers on those groups because accuracy is highly dependent on the number of sample tuples that belong to that group. Standard error is inversely proportional to √n for uniform sample. n is the uniform sample random size.
February 14, 2006CS DB Exploration4 Solution (Congressional Sampling) Consider US Congress which is hybrid of House and Senate. House has representative from each state in proportion to its population. Senate has equal number of representative from each state. Then apply House and Senate scenario for representing different groups. House sample: Uniform random sampling from each group. Senate sample: Sample an equal number of tuples from each group.
February 14, 2006CS DB Exploration5 Solution (Congressional Sampling) Consider a relation R with two grouping attributes A, and B Number of tuples for the groups (a1, b1) – 3000, (a1, b2) – 3000, (a1, b3) – 1500, (a2, b3) Basic Congress (sample size = 100) ABHouse Sg,0 Senate Sg,AB Basic Congress before scaling Basic Congress a1b a1b a1b a2b
February 14, 2006CS DB Exploration6 Solution (Congressional Sampling) ABHouse Sg,0 Senate Sg,AB Basic Congress before scaling Basic Congress a1b a1b a1b a2b Sg,ASg,BCongress before scaling Congress 20 (of 50) (of 50) (of 50)12.5 (of 33.3) (of 33.3)5035.3
February 14, 2006CS DB Exploration7 Congressional Sampling Basic congress sample size allocated to each group Congress sample size allocated to each group
February 14, 2006CS DB Exploration8 Rewriting Query rewriting involves two key steps: a) scaling up the aggregate expressions and b) deriving error bounds on the estimate. ScaleFactor be the inverse sampling rate for its strata. How to associate each tuple with its scalefactor: a) store the ScaleFactor(SF) with each tuple in sample relation b) use a separate table to store the ScaleFactors for the groups KeyGrouping columnAggregate column KABCQ k1a1b1c1q1 k2a1b1c2q2 Select A, B, sum(Q) From Rel Group by A, B Relation Rel with two example tuples
February 14, 2006CS DB Exploration9 Rewriting (Integrated Rewriting)
February 14, 2006CS DB Exploration10 Normalized Rewriting
February 14, 2006CS DB Exploration11 Key-normalized Rewriting
February 14, 2006CS DB Exploration12 Nested-integrated Rewriting
February 14, 2006CS DB Exploration13 Performance Three Queries Grouping on returnflag, linestatus, shipdate skewed group sizes z = 1.5 Sample Percentage at 7%
February 14, 2006CS DB Exploration14 Performance
February 14, 2006CS DB Exploration15 Performance
February 14, 2006CS DB Exploration16 Performance
February 14, 2006CS DB Exploration17 Performance Times taken for different sample percentages Actual query time = 40sec
February 14, 2006CS DB Exploration18 Conclusions Congressional samples are effective for group-by queries with arbitrary group-bys (including none)