Download presentation
Presentation is loading. Please wait.
Published byCori Joseph Modified over 9 years ago
1
Page 1 Online Aggregation for Large MapReduce Jobs Niketan Pansare, Vinayak Borkar, Chris Jermaine, Tyson Condie VLDB 2011 IDS Fall Seminar 2011. 11. 11. Presented by Yang Byoung Ju
2
Page 2 Online Aggregation (OLA) ▶ select avg(stock_price) from nasdaq_db where company = 'xyz'; ▶ Conventional DB: ▶ With OLA Extension: [0, 2000] with 95% probability After 1 seconds
3
Page 3 Online Aggregation (OLA) ▶ select avg(stock_price) from nasdaq_db where company = 'xyz'; ▶ Conventional DB: ▶ With OLA Extension: [900, 1100] with 95% probability After 2 minutes
4
Page 4 Online Aggregation (OLA) ▶ select avg(stock_price) from nasdaq_db where company = 'xyz'; ▶ Conventional DB: ▶ With OLA Extension: [995, 1005] with 95% probability After 10 minutes
5
Page 5 Online Aggregation (OLA) ▶ select avg(stock_price) from nasdaq_db where company = 'xyz'; ▶ Conventional DB: 1000 ▶ With OLA Extension: 1000 After 2 hours
6
Page 6 Online Aggregation (OLA) ▶ User gets estimates of an aggregate query ▶ At all times during the query processing, a database system gives user a statistically valid estimate for the final answer (Ex. Output range estimate: [990, 1010] with 95% probability) ▶ Advantages Can get reasonable answer very quickly (depends of application) Can save time and computing resourse ▶ Distavantages Implementation requires changes to the database kernel In a self-managed system, decreased resource cost may not benefit the user directly
7
Page 7 Why ‘Online Aggregation’? ▶ OLA was proposed in 1997, but its commercial impact has been limited or even non-existent due to two reasons OLA require extensive changes to the database kernel Saving resources has never been compelling ▶ Why OLA now? People are implementing all sorts of new databases thesedays Given the current move into the cloud, as a query runs, dollars flow from the end-user’s pocket to the cloud
8
Page 8 OLA in a distributed environment ▶ Classic OLA Set of data(tuples) at any point in the computation is a random subset of the data in the system Easy to estimate the final answer using statistics method ▶ OLA for Large-scale The basic unit of data that is processed is a block (Ex. 64MB) A lot of variation in the time taken to process each block This variation in processing time is tremendously important, if it is correlated with the aggregate value of the block
9
Page 9 OLA in a distributed environment ▶ OLA for Large-scale (Cond.) Blocks with a lot of data may have greater aggregate value, and takes longer to process So, the set of blocks completed at any particular point are more likely to have small values, leading to biased estimates -> “Inspection Paradox” This paper solved the ‘inspection paradox’ problem, consequently making OLA possible in a distributed environment
10
Page 10 Inspection Paradox ▶ In a renewal process, if we wait some predetermined time t and then observe how large the renewal interval containing t is, we should expect it to be typically larger than a renewal interval of average size.
11
Page 11 Inspection Paradox ▶ Explanation #1 If we randomly shot arrows to the target below, there would be more arrows on larger target
12
Page 12 Inspection Paradox ▶ Explanation #2 There are buses that has an average interval as 10 minutes. How long you wait, when you get to the busstop randomly? 5 minutes? Yes. If bus arrives every 10 minutes What if arrival intervals are not uniform(random)? Ex. 5min, 15min, 5min, 15min (average 10min) Waiting time: 1/4 X 2.5min + 3/4 X 7.5 min = 6.25 min 10 min20 min30 min40 min 5 min20 min25 min40 min
13
Page 13 Inspection Paradox ▶ Explanation #2 (Cond.) Waiting time – Area of the triangle is the waiting time Different even if their avg. interval is same In the latter case, if the inspector sit down at the busstop all day and average intervals of all buses, he can get 10 minutes But, if the inspector get to the busstop at particular point and estimates avg. interval based on his waiting time(6.25 min), he will get 12.5 minutes “Inspection Paradox” 10 min20 min5 min20 min
14
Page 14 Inspection Paradox ▶ If someone tries to get information from randomly intervaled data at a particular point, he will be at the larger interval, consequently he will get biased(wrong) estimation ▶ Explanation #3 On a machine of the distributed system, block processing time will be different depending on its data, even if every block’s size is same If we take snapshot at a particular point to get an estimation, it will be the time that larger block is being processed. It means that we just get the information of the smaller blocks which contain less information while we cannot include the information of a larger block to the estimation. completed Block 1Block 2Block 3Block 4 processingwaiting snapshot
15
Page 15 Inspection Paradox ▶ Let’s make ‘inspection paradox’ go away Take 3 parameters of the block for estimation -x : aggregate value of the block -t sch : waiting time of the block to be scheduled -t proc : processing time of the block t sch and t proc will allow us to make predictions about the x value that we have not seen. For example, if we have a particular block that has been processed for 125 seconds (not completed yet), where it took 5 seconds to be scheduled, we can correctly view x as a random sample from the distribution, f( x | t sch = 5, t proc >= 125)
16
Page 16 Implementation ▶ Implemented OLA mode in Hyracks ▶ Hyracks Open source project that supports Map and Reduce operation Relational operations such as selection, projection, and join Architecture is similar to Hadoop ▶ Modification of the Hyracks Logical block queue to make their order statistically random Estimator in the reduce task during the shuffle phase -Completed map tasks are gathered in the shuffle phase -The estimator receives aggregate value (x) and meta-data (t sch and t proc )
17
Page 17 Estimation ▶ Bayesian approach is applied for estimation Z is randomly sampled from blocks Z produces observed data, X and hidden data, Y Θ includes any data that is unobserved Process below is repeated to get an estimation
18
Page 18 Experiments ▶ 6 months of data from Wikipedia page traffic data Counting the # of page per language 220GB, 3960 blocks On 11 nodes (1 master, 10 slaves) 80 mappers and 10 reducers Took 46 minutes to run to completion ▶ Experimented on 3 different versions w/ random block order, w/ correlation (inspection paradox) w/o random block order, w/ correlation (inspection paradox) w/ random block order, w/o correlation (inspection paradox)
19
Page 19 Experiments (a) Posterior query result distribution for number of English language page at various time, using both randomized and arbitrary block ordering (actual result: black vertical line) (b) Posterior query result distribution for number of English language page at various time, taking into account and ignoring correlation between aggregate value and processing time
20
Page 20 Conclusion ▶ The authors proposed a system model that is appropriate for OLA over MapReduce in a large-scale, distributed environment ▶ The model accounts for biases that can arise when estimating aggregates in a cluster environment (deals with ‘inspection paradox’) ▶ This model allows us to export “early returns” of query aggregates that are statistically robust
21
Page 21 Q & A Thank you
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.