ApproxHadoop Bringing Approximations to MapReduce Frameworks

ApproxHadoop Bringing Approximations to MapReduce Frameworks
Íñigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D. Nguyen

Approximate computing
We’re producing more data than we can analyze Many applications do not require precise outputs Being precise is expensive Approximate computation Time and/or energy vs. accuracy TB [IEEE Design 2014] Technology scaling Data warehouse growth growth rate = 173%

Data analytics using MapReduce
Example: Process web access logs to extract top pages MapReduce is a popular framework User provides code (map and reduce) Framework manages data access and parallel execution Higher level languages on top: Pig, Hive,… Hadoop is deployed widely at large scale Facebook: 30PB Hadoop clusters Yahoo: 16 Hadoop clusters >42000 nodes Yahoo Computing Coop

Our contributions Approximations in MapReduce
Approximation mechanisms Error bounds based on statistical theories ApproxHadoop: implementation for Hadoop Approximate common applications Achieve target error bounds online Large execution time and energy savings with high accuracy

Approximations in MapReduce
Why can we approximate with MapReduce? Lines in a block have similarities Block 1 Map 1 Block 2 Map 2 Reduce 1 Output 1 Block 3 Map 3 Reduce 2 Output 2 Blocks have similarities Block 4 Map 4 Example application: What is the average length of the lines of each color?

Mechanisms and error bounds
Similarities allow for accurate approximations Approximation mechanisms for MapReduce: Drop map tasks Sample input data User-defined approximations (technical report) Bound approximation errors using: Multistage sampling for aggregation applications (e.g., sum, average, ratio) Extreme value theory for extreme value computations (e.g., min, max)

Multistage sampling and MapReduce
Combines inter/intra-cluster sampling techniques Simple random sampling: inside a block → Data sampling Cluster sampling: between blocks → Task dropping Given sampling/dropping ratios and variances Compute error bounds with confidence level Population Cluster

Mapping multistage sampling to MapReduce
Block → Cluster Track sampling ratios Intra cluster sampling (data sampling) Block 1 Map 1 Block 2 Map 2 Reduce 1 Output 1 Inter cluster sampling (task dropping) Block 3 Map 3 Reduce 2 Output 2 Block 4 Map 4 Y±X% Population Use inter/intra variances for each line color Approximation with error bounds Example application: What is the approximate average length of the lines of each color? 𝑁 𝑛 𝑖=1 𝑛 𝑀 𝑖 𝑚 𝑖 𝑗=1 𝑚 𝑖 𝑣 𝑖𝑗 ± 𝑡 𝑛−1,1−𝛼/2 𝑉𝑎𝑟 ( 𝜏 ) 𝑉𝑎𝑟 𝜏 =𝑁 𝑁−𝑛 𝑠 𝑢 2 𝑛 + 𝑁 𝑛 𝑖=1 𝑛 𝑀 𝑖 ( 𝑀 𝑖 − 𝑚 𝑖 ) 𝑠 𝑖 2 𝑚 𝑖 𝑀 1 𝑚 1 = 5 3 𝑁 𝑛 = 4 2 𝜏 ± 𝑡 𝑛−1,1−𝛼/2 𝑉𝑎𝑟 ( 𝜏 )

Our contributions Approximations in MapReduce
Approximation mechanisms Error bounds based on statistical theories ApproxHadoop: implementation for Hadoop Approximate common applications Achieve target error bounds online Large execution time and energy savings with high accuracy

Example: Using ApproxHadoop
class WordCount: class WCMapper extends Mapper: void map(String key, String value): foreach word w in value: context.write(w, 1); class WCReducer extends Reducer: void reduce(String key, Iterator values): int result = 0; foreach int v in values: result += v; context.write(key, result); void main(): setInputFormat(TextInputFormat); run(); class ApproxWordCount: class ApproxWCMapper extends MultiStageSamplingMapper: void map(String key, String value): foreach word w in value: context.write(w, 1); class ApproxWCReducer extends MultiStageSamplingReducer: void reduce(String key, Iterator values): int result = 0; foreach int v in values: result += v; context.write(key, result); void main(): setInputFormat(ApproxTextInputFormat); run();

How to specify approximations?
User specifies the dropping/sampling ratios ApproxHadoop calculates the error bound User specifies the target error bound Example: maximum error (±1%) with a confidence level (95% confidence) ApproxHadoop: No Run first subset of tasks Select dropping/ sampling ratios Run next subset of tasks Calculate final error bounds Target bound? Yes Run first wave Produce pilot sample Select dropping/ sampling ratios Monitor intermediate outputs After target error bound achieved, drop Calculate final error bound

Implementation: ApproxHadoop
Extends Hadoop 1.2.1 Implements approximation mechanisms Extended reducers Bound estimation Incremental reducers Tune sampling ratios New data types ApproxInteger Map 2 Map 3 Block 2 Block 3 Block 1 Map 1 Y±X% Block 2 Map 2 Reduce 1 Output 1 Block 3 Map 3 Reduce 2 Output 2 Block 4 Map 4

Evaluation methodology
Datasets Wikipedia access logs: 1 week with 4 billion accesses for 216.9GB Wikipedia articles: 40GB in XML Other applications and datasets in the paper Metrics Actual % error (approximation vs precise) Approximation with 95% confidence interval (e.g., 10±1%) Run time 20 runs reporting min, max and average Executions on 10- and 60-node clusters

Example: Precise and approximate processing
Wikipedia project popularity 1% sampling Wikipedia article length 1% sampling 1% input sampling introduces different errors in different applications Actual values within bounds

User-specified input sampling ratio
More than 30% run time reduction for less than 0.1% ratio Applications exhibit different speedups for the same ratios Wikipedia project popularity not dropping

User-specified dropping/sampling ratios
More than 55% run time reduction for less than 1% error Task dropping increases errors significantly but decreases run time too Wikipedia project popularity 25% task dropping

User-specified target error
Wikipedia project popularity No sampling Input data Maximum Task dropping ApproxHadoop tunes the sampling/dropping ratios depending on target

Impact of input data size
Wikipedia project popularity from 1 day (27GB) to 1 year (12.5TB) Compressed log size (in GB) Runtime (seconds) Larger input data brings larger savings (up to 32x)

Conclusions Apply statistical theories to MapReduce
Approximation mechanisms, such as input data sampling and task dropping Applicable to (large) classes of analytics applications Achieve target error bounds online with ApproxHadoop Tradeoff between execution time and accuracy Significant execution time reduction with high accuracy Scales well for large datasets

ApproxHadoop Bringing Approximations to MapReduce Frameworks
Íñigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D. Nguyen

Related work Multiple levels Others studied In Hadoop
Language [PLDI10], distributed systems [NSDI10], databases [TODS07], hardware [MICRO13],… Others studied In Hadoop Only one sampling level (simple random or stratified) [NSDI14] Particular applications; no general framework [CIKM12] Pre-build samples [EuroSys10] Creation overhead Queries require a particular sample

Impact on energy consumption
Time and energy are correlated, except for short jobs (e.g., one wave) Same run time but significant energy savings as we run fewer maps Calculate request rate for 11GB Apache log

Sampling and error bounds in MapReduce
Bound errors Process only some lines (input data sampling) Block 1 Map 1 Y±X% Block 2 Map 2 Reduce 1 Output 1 Multistage sampling Process only some blocks (map task dropping) Block 3 Map 3 Reduce 2 Output 2 Block 4 Map 4 Example application: What is approximately the average length of the lines of each color?

Roadmap Motivation Approximations in MapReduce ApproxHadoop
Mechanisms Statistical theories and their mappings ApproxHadoop Interfaces Implementation Experimental results Conclusions

Mechanisms added to Hadoop
New interfaces and data types: MultiStageSamplingMapper, ApproxInteger,… Input data sampling Read only X% of fields (random) New input formats Task dropping New task state Allow missing outputs Tasks executed in random order Error estimation Current values Cluster and population sizes Incremental reduce tasks Monitor current error bound Adapt input data sampling Decide to drop pending maps

Approximate computing: data processing
Data to process grows faster than processing capacity TB Size of the Largest Data Warehouse in the Winter Top Ten Survey CAGR=173% Technology Scaling Actual Projected [IEEE Design 2014]

Future work Extend to higher level languages
Pig, Hive,… Workflows of multiple jobs Support approximate inputs Decide where to approximate Build mechanisms into Tez

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Similar presentations

Presentation on theme: "ApproxHadoop Bringing Approximations to MapReduce Frameworks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Similar presentations

Presentation on theme: "ApproxHadoop Bringing Approximations to MapReduce Frameworks"— Presentation transcript:

Similar presentations

About project

Feedback