Page 1 Online Aggregation for Large MapReduce Jobs Niketan Pansare, Vinayak Borkar, Chris Jermaine, Tyson Condie VLDB 2011 IDS Fall Seminar 2011. 11. 11.

Slides:



Advertisements
Similar presentations
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Advertisements

MapReduce.
This material is approved for public release. Distribution is limited by the Software Engineering Institute to attendees. Sponsored by the U.S. Department.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
SkewTune: Mitigating Skew in MapReduce Applications
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
CSE 5522: Survey of Artificial Intelligence II: Advanced Techniques Instructor: Alan Ritter TA: Fan Yang.
Online Aggregation Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT.
3.3 Toward Statistical Inference. What is statistical inference? Statistical inference is using a fact about a sample to estimate the truth about the.
Variance Reduction Techniques
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
SIMULATION. Simulation Definition of Simulation Simulation Methodology Proposing a New Experiment Considerations When Using Computer Models Types of Simulations.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Dennis Shasha From a book co-written with Manda Wilson
14. Introduction to inference
Sampling January 9, Cardinal Rule of Sampling Never sample on the dependent variable! –Example: if you are interested in studying factors that lead.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HAMS Technologies 1
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
Combining the strengths of UMIST and The Victoria University of Manchester Utility Driven Adaptive Workflow Execution Kevin Lee School of Computer Science,
Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,
Copyright © 1994 Carnegie Mellon University Disciplined Software Engineering - Lecture 3 1 Software Size Estimation I Material adapted from: Disciplined.
Hadoop System simulation with Mumak Fei Dong, Tianyu Feng, Hong Zhang Dec 8, 2010.
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
McGraw-Hill/Irwin © 2006 The McGraw-Hill Companies, Inc., All Rights Reserved. 1.
Chapter 10 Verification and Validation of Simulation Models
CHAPTER 15: Tests of Significance The Basics ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
Confidence Interval Estimation For statistical inference in decision making:
McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
1 Probability and Statistics Confidence Intervals.
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
Inside the Planning Fallacy The study of Buehler et al. (1994):Buehler et al. (1994): People often commit a planning fallacy where they are overly optimistic.
Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.
CHAPTER 15: Tests of Significance The Basics ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
Matrix Multiplication in Hadoop
BIG DATA/ Hadoop Interview Questions.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Big Data is a Big Deal!.
Seth Pugsley, Jeffrey Jestes,
Hadoop MapReduce Framework
Unit 5: Hypothesis Testing
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Introduction to MapReduce and Hadoop
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Chapter 10 Verification and Validation of Simulation Models
Hadoop Basics.
Data Analysis and Statistical Software I ( ) Quarter: Autumn 02/03
Cse 344 May 4th – Map/Reduce.
Daniela Stan Raicu School of CTI, DePaul University
Pregelix: Think Like a Vertex, Scale Like Spandex
Significance Tests: The Basics
DryadInc: Reusing work in large-scale computations
Daniela Stan Raicu School of CTI, DePaul University
Presentation transcript:

Page 1 Online Aggregation for Large MapReduce Jobs Niketan Pansare, Vinayak Borkar, Chris Jermaine, Tyson Condie VLDB 2011 IDS Fall Seminar Presented by Yang Byoung Ju

Page 2 Online Aggregation (OLA) ▶ select avg(stock_price) from nasdaq_db where company = 'xyz'; ▶ Conventional DB: ▶ With OLA Extension: [0, 2000] with 95% probability After 1 seconds

Page 3 Online Aggregation (OLA) ▶ select avg(stock_price) from nasdaq_db where company = 'xyz'; ▶ Conventional DB: ▶ With OLA Extension: [900, 1100] with 95% probability After 2 minutes

Page 4 Online Aggregation (OLA) ▶ select avg(stock_price) from nasdaq_db where company = 'xyz'; ▶ Conventional DB: ▶ With OLA Extension: [995, 1005] with 95% probability After 10 minutes

Page 5 Online Aggregation (OLA) ▶ select avg(stock_price) from nasdaq_db where company = 'xyz'; ▶ Conventional DB: 1000 ▶ With OLA Extension: 1000 After 2 hours

Page 6 Online Aggregation (OLA) ▶ User gets estimates of an aggregate query ▶ At all times during the query processing, a database system gives user a statistically valid estimate for the final answer (Ex. Output range estimate: [990, 1010] with 95% probability) ▶ Advantages  Can get reasonable answer very quickly (depends of application)  Can save time and computing resourse ▶ Distavantages  Implementation requires changes to the database kernel  In a self-managed system, decreased resource cost may not benefit the user directly

Page 7 Why ‘Online Aggregation’? ▶ OLA was proposed in 1997, but its commercial impact has been limited or even non-existent due to two reasons  OLA require extensive changes to the database kernel  Saving resources has never been compelling ▶ Why OLA now?  People are implementing all sorts of new databases thesedays  Given the current move into the cloud, as a query runs, dollars flow from the end-user’s pocket to the cloud

Page 8 OLA in a distributed environment ▶ Classic OLA  Set of data(tuples) at any point in the computation is a random subset of the data in the system  Easy to estimate the final answer using statistics method ▶ OLA for Large-scale  The basic unit of data that is processed is a block (Ex. 64MB)  A lot of variation in the time taken to process each block  This variation in processing time is tremendously important, if it is correlated with the aggregate value of the block

Page 9 OLA in a distributed environment ▶ OLA for Large-scale (Cond.)  Blocks with a lot of data may have greater aggregate value, and takes longer to process  So, the set of blocks completed at any particular point are more likely to have small values, leading to biased estimates -> “Inspection Paradox” This paper solved the ‘inspection paradox’ problem, consequently making OLA possible in a distributed environment

Page 10 Inspection Paradox ▶ In a renewal process, if we wait some predetermined time t and then observe how large the renewal interval containing t is, we should expect it to be typically larger than a renewal interval of average size.

Page 11 Inspection Paradox ▶ Explanation #1  If we randomly shot arrows to the target below, there would be more arrows on larger target

Page 12 Inspection Paradox ▶ Explanation #2  There are buses that has an average interval as 10 minutes. How long you wait, when you get to the busstop randomly?  5 minutes? Yes. If bus arrives every 10 minutes  What if arrival intervals are not uniform(random)? Ex. 5min, 15min, 5min, 15min (average 10min)  Waiting time: 1/4 X 2.5min + 3/4 X 7.5 min = 6.25 min 10 min20 min30 min40 min 5 min20 min25 min40 min

Page 13 Inspection Paradox ▶ Explanation #2 (Cond.)  Waiting time – Area of the triangle is the waiting time Different even if their avg. interval is same  In the latter case, if the inspector sit down at the busstop all day and average intervals of all buses, he can get 10 minutes  But, if the inspector get to the busstop at particular point and estimates avg. interval based on his waiting time(6.25 min), he will get 12.5 minutes “Inspection Paradox” 10 min20 min5 min20 min

Page 14 Inspection Paradox ▶ If someone tries to get information from randomly intervaled data at a particular point, he will be at the larger interval, consequently he will get biased(wrong) estimation ▶ Explanation #3  On a machine of the distributed system, block processing time will be different depending on its data, even if every block’s size is same  If we take snapshot at a particular point to get an estimation, it will be the time that larger block is being processed.  It means that we just get the information of the smaller blocks which contain less information while we cannot include the information of a larger block to the estimation. completed Block 1Block 2Block 3Block 4 processingwaiting snapshot

Page 15 Inspection Paradox ▶ Let’s make ‘inspection paradox’ go away  Take 3 parameters of the block for estimation -x : aggregate value of the block -t sch : waiting time of the block to be scheduled -t proc : processing time of the block  t sch and t proc will allow us to make predictions about the x value that we have not seen.  For example, if we have a particular block that has been processed for 125 seconds (not completed yet), where it took 5 seconds to be scheduled, we can correctly view x as a random sample from the distribution, f( x | t sch = 5, t proc >= 125)

Page 16 Implementation ▶ Implemented OLA mode in Hyracks ▶ Hyracks  Open source project that supports Map and Reduce operation  Relational operations such as selection, projection, and join  Architecture is similar to Hadoop ▶ Modification of the Hyracks  Logical block queue to make their order statistically random  Estimator in the reduce task during the shuffle phase -Completed map tasks are gathered in the shuffle phase -The estimator receives aggregate value (x) and meta-data (t sch and t proc )

Page 17 Estimation ▶ Bayesian approach is applied for estimation  Z is randomly sampled from blocks  Z produces observed data, X and hidden data, Y  Θ includes any data that is unobserved  Process below is repeated to get an estimation

Page 18 Experiments ▶ 6 months of data from Wikipedia page traffic data  Counting the # of page per language  220GB, 3960 blocks  On 11 nodes (1 master, 10 slaves)  80 mappers and 10 reducers  Took 46 minutes to run to completion ▶ Experimented on 3 different versions  w/ random block order, w/ correlation (inspection paradox)  w/o random block order, w/ correlation (inspection paradox)  w/ random block order, w/o correlation (inspection paradox)

Page 19 Experiments (a) Posterior query result distribution for number of English language page at various time, using both randomized and arbitrary block ordering (actual result: black vertical line) (b) Posterior query result distribution for number of English language page at various time, taking into account and ignoring correlation between aggregate value and processing time

Page 20 Conclusion ▶ The authors proposed a system model that is appropriate for OLA over MapReduce in a large-scale, distributed environment ▶ The model accounts for biases that can arise when estimating aggregates in a cluster environment (deals with ‘inspection paradox’) ▶ This model allows us to export “early returns” of query aggregates that are statistically robust

Page 21 Q & A Thank you