Presentation is loading. Please wait.

Presentation is loading. Please wait.

Page 1 Online Aggregation for Large MapReduce Jobs Niketan Pansare, Vinayak Borkar, Chris Jermaine, Tyson Condie VLDB 2011 IDS Fall Seminar 2011. 11. 11.

Similar presentations


Presentation on theme: "Page 1 Online Aggregation for Large MapReduce Jobs Niketan Pansare, Vinayak Borkar, Chris Jermaine, Tyson Condie VLDB 2011 IDS Fall Seminar 2011. 11. 11."— Presentation transcript:

1 Page 1 Online Aggregation for Large MapReduce Jobs Niketan Pansare, Vinayak Borkar, Chris Jermaine, Tyson Condie VLDB 2011 IDS Fall Seminar 2011. 11. 11. Presented by Yang Byoung Ju

2 Page 2 Online Aggregation (OLA) ▶ select avg(stock_price) from nasdaq_db where company = 'xyz'; ▶ Conventional DB: ▶ With OLA Extension: [0, 2000] with 95% probability After 1 seconds

3 Page 3 Online Aggregation (OLA) ▶ select avg(stock_price) from nasdaq_db where company = 'xyz'; ▶ Conventional DB: ▶ With OLA Extension: [900, 1100] with 95% probability After 2 minutes

4 Page 4 Online Aggregation (OLA) ▶ select avg(stock_price) from nasdaq_db where company = 'xyz'; ▶ Conventional DB: ▶ With OLA Extension: [995, 1005] with 95% probability After 10 minutes

5 Page 5 Online Aggregation (OLA) ▶ select avg(stock_price) from nasdaq_db where company = 'xyz'; ▶ Conventional DB: 1000 ▶ With OLA Extension: 1000 After 2 hours

6 Page 6 Online Aggregation (OLA) ▶ User gets estimates of an aggregate query ▶ At all times during the query processing, a database system gives user a statistically valid estimate for the final answer (Ex. Output range estimate: [990, 1010] with 95% probability) ▶ Advantages  Can get reasonable answer very quickly (depends of application)  Can save time and computing resourse ▶ Distavantages  Implementation requires changes to the database kernel  In a self-managed system, decreased resource cost may not benefit the user directly

7 Page 7 Why ‘Online Aggregation’? ▶ OLA was proposed in 1997, but its commercial impact has been limited or even non-existent due to two reasons  OLA require extensive changes to the database kernel  Saving resources has never been compelling ▶ Why OLA now?  People are implementing all sorts of new databases thesedays  Given the current move into the cloud, as a query runs, dollars flow from the end-user’s pocket to the cloud

8 Page 8 OLA in a distributed environment ▶ Classic OLA  Set of data(tuples) at any point in the computation is a random subset of the data in the system  Easy to estimate the final answer using statistics method ▶ OLA for Large-scale  The basic unit of data that is processed is a block (Ex. 64MB)  A lot of variation in the time taken to process each block  This variation in processing time is tremendously important, if it is correlated with the aggregate value of the block

9 Page 9 OLA in a distributed environment ▶ OLA for Large-scale (Cond.)  Blocks with a lot of data may have greater aggregate value, and takes longer to process  So, the set of blocks completed at any particular point are more likely to have small values, leading to biased estimates -> “Inspection Paradox” This paper solved the ‘inspection paradox’ problem, consequently making OLA possible in a distributed environment

10 Page 10 Inspection Paradox ▶ In a renewal process, if we wait some predetermined time t and then observe how large the renewal interval containing t is, we should expect it to be typically larger than a renewal interval of average size.

11 Page 11 Inspection Paradox ▶ Explanation #1  If we randomly shot arrows to the target below, there would be more arrows on larger target

12 Page 12 Inspection Paradox ▶ Explanation #2  There are buses that has an average interval as 10 minutes. How long you wait, when you get to the busstop randomly?  5 minutes? Yes. If bus arrives every 10 minutes  What if arrival intervals are not uniform(random)? Ex. 5min, 15min, 5min, 15min (average 10min)  Waiting time: 1/4 X 2.5min + 3/4 X 7.5 min = 6.25 min 10 min20 min30 min40 min 5 min20 min25 min40 min

13 Page 13 Inspection Paradox ▶ Explanation #2 (Cond.)  Waiting time – Area of the triangle is the waiting time Different even if their avg. interval is same  In the latter case, if the inspector sit down at the busstop all day and average intervals of all buses, he can get 10 minutes  But, if the inspector get to the busstop at particular point and estimates avg. interval based on his waiting time(6.25 min), he will get 12.5 minutes “Inspection Paradox” 10 min20 min5 min20 min

14 Page 14 Inspection Paradox ▶ If someone tries to get information from randomly intervaled data at a particular point, he will be at the larger interval, consequently he will get biased(wrong) estimation ▶ Explanation #3  On a machine of the distributed system, block processing time will be different depending on its data, even if every block’s size is same  If we take snapshot at a particular point to get an estimation, it will be the time that larger block is being processed.  It means that we just get the information of the smaller blocks which contain less information while we cannot include the information of a larger block to the estimation. completed Block 1Block 2Block 3Block 4 processingwaiting snapshot

15 Page 15 Inspection Paradox ▶ Let’s make ‘inspection paradox’ go away  Take 3 parameters of the block for estimation -x : aggregate value of the block -t sch : waiting time of the block to be scheduled -t proc : processing time of the block  t sch and t proc will allow us to make predictions about the x value that we have not seen.  For example, if we have a particular block that has been processed for 125 seconds (not completed yet), where it took 5 seconds to be scheduled, we can correctly view x as a random sample from the distribution, f( x | t sch = 5, t proc >= 125)

16 Page 16 Implementation ▶ Implemented OLA mode in Hyracks ▶ Hyracks  Open source project that supports Map and Reduce operation  Relational operations such as selection, projection, and join  Architecture is similar to Hadoop ▶ Modification of the Hyracks  Logical block queue to make their order statistically random  Estimator in the reduce task during the shuffle phase -Completed map tasks are gathered in the shuffle phase -The estimator receives aggregate value (x) and meta-data (t sch and t proc )

17 Page 17 Estimation ▶ Bayesian approach is applied for estimation  Z is randomly sampled from blocks  Z produces observed data, X and hidden data, Y  Θ includes any data that is unobserved  Process below is repeated to get an estimation

18 Page 18 Experiments ▶ 6 months of data from Wikipedia page traffic data  Counting the # of page per language  220GB, 3960 blocks  On 11 nodes (1 master, 10 slaves)  80 mappers and 10 reducers  Took 46 minutes to run to completion ▶ Experimented on 3 different versions  w/ random block order, w/ correlation (inspection paradox)  w/o random block order, w/ correlation (inspection paradox)  w/ random block order, w/o correlation (inspection paradox)

19 Page 19 Experiments (a) Posterior query result distribution for number of English language page at various time, using both randomized and arbitrary block ordering (actual result: black vertical line) (b) Posterior query result distribution for number of English language page at various time, taking into account and ignoring correlation between aggregate value and processing time

20 Page 20 Conclusion ▶ The authors proposed a system model that is appropriate for OLA over MapReduce in a large-scale, distributed environment ▶ The model accounts for biases that can arise when estimating aggregates in a cluster environment (deals with ‘inspection paradox’) ▶ This model allows us to export “early returns” of query aggregates that are statistically robust

21 Page 21 Q & A Thank you


Download ppt "Page 1 Online Aggregation for Large MapReduce Jobs Niketan Pansare, Vinayak Borkar, Chris Jermaine, Tyson Condie VLDB 2011 IDS Fall Seminar 2011. 11. 11."

Similar presentations


Ads by Google