Download presentation
Presentation is loading. Please wait.
Published byApril Kelley Modified over 8 years ago
1
Sameer Agarwal, Aurojit Panda, Barzan Moxafari Samuel Madden, Ion Stoica
2
Objective Offer bounded response times + bounded error on queries Use pre-prepared samples
3
And it works
4
Obvious questions How to accurately represent the data with samples? Data is generally not uniform Queries care about small fractions of the data, e.g Count # Republicans in San Francisco How to tolerate unseen queries? If you cache stuff based on queries that have already occurred, it’s not that great for interactive exploration How to tolerate changing data? New data is continuously being added, how is that handled.
5
Stratified samples (deal with non-uniform data)
6
Optimization Can’t build stratified samples for everything, it grows too fast Optimize based on: How poorly uniform samples would perform (data skew) How likely the samples would be used based on query templates Storage costs for the samples By working at granularity of usage in WHERE / GROUP BY instead of queries, you increase tolerance of unseen queries
7
Changing data How to maintain guarantees with fast changing data? Sampling is offline Consider a database of request latencies for a large system. For locating errors, numbers abnormally larger than average are interesting They are poorly represented by uniform samples The interesting data might be the most recent data (samples take 5-30 min to generate when run) Could we run cheap queries at insert time to deduce if the inserted data changes the distribution? Can we merge existing stratified samples and new data with predictable error?
8
Questions Where else can stratified samples help? Is this applicable to workloads on online data? Can the stratified samples be maintained online? Can we use a similar technique to obtain results from a degraded cluster where some of the data is unreachable?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.