Download presentation
Presentation is loading. Please wait.
Published byAntony Roberts Modified over 9 years ago
1
Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011
2
MapReduce is victorious Google statistics: Hadoop statistics: 7 PB+ Vertica clusters vs. 22 PB+ Cloudera Hadoop clusters 1 Aug 04Mar 06Sept 07May 10 Number of jobs29K171K2127K4474K Machine years used21720021108139121 Input Data (TB)3,28852,254403,152946,460 Output Data (TB)1932,97014,01845,720 Average worker machines157268394368 1. Omer Trajman, Cloudera VP, http://www.dbms2.com/
3
MapReduce in relational land Designers original Intention: free-formed data o web-scale indexing/log processing But, many relational workloads 1 o Complex queries/data analysis Caveat: MR performance lags RDBMS performance 1.Karmasphere corporation: A study of hadoop developers, http://karmasphere.com, 2010
4
Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009 Selection is Slower with MapReduce
5
Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009 Join is Even Slower
6
MR Lags in Relational Land Stonebraker, Dewitt: ''MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamed whenever an index is the better access mechanism.’’ 1 Query processing tasks o No metadata, semantics, indices o Free-formed input is a double-edged sword 1. MapReduce: a major step backwards, http://databasecolumn.vertica.com/, 2008http://databasecolumn.vertica.com/
7
Manimal Manimal is a hybrid system, combining MapReduce programming model and well-known execution techniques Techniques today only found in RDBMS, but should be in MapReduce, too.
8
Manimal Approach bytecode *.class MR Engine Static Analyzer Optimizer logic Execution Framework optimization opportunities execution path void map(Text key, WebPage w) { if(w.rank > 10) emit(w.url,w.rank); } Challenges: o Safely detect query semantic optimization o How much performance gain? SELECTION from B+Tree index on W.RANK
9
Manimal Contributions Our Manimal system: o Detect safe relational optimizations in users’ compiled MapReduce programs Our results: o Runs with unmodified MapReduce code o Runs up to 11x faster on same code o Provides framework for more optimizations
10
Outline Introduction Execution Framework Optimization/Analyzer Examples Experiments o Analyzer recall o Performance gain Related Work and Conclusion
11
Execution framework public void map(Text key, WebPage w, OutputCollector out) { if(w.rank > 10) emit(w.url, w.rank); }
12
Execution Framework varload ‘value’ invokevirtual astore ‘text’ … ifeq … Analyzer Optimizer Execution
13
13 Execution Framework void map(k, w) { out.set(indexedOutputFormat); emit(w.rank, (k,w)) } (SELECT f, w.rank>10) Analyzer in: user program Analyzer out: optimization descriptor index-generation program varload ‘value’ invokevirtual astore ‘text’ … ifeq … Analyzer Optimizer Execution
14
14 Execution Framework Optimizer in: optimization descriptor catalog Optimizer out: execution descriptor /logs/log.1/logs/log.1.idxselect src… /logs/log.2/logs/log.2.idxselect src… (SELECT,“log.1.idx”, w.rank>10) varload ‘value’ invokevirtual astore ‘text’ … ifeq … Analyzer Optimizer Execution (SELECT f, w.rank>10)
15
15 Execution Framework numwords 19519 (SELECT,“log.1.idx”, w.rank>10) varload ‘value’ invokevirtual astore ‘text’ … ifeq … Analyzer Optimizer Execution Execution in: execution descriptor user program Execution out: program output
16
Outline Introduction Execution Framework Optimization/Analyzer Examples Experiments o Analyzer recall o Performance gain Related Work and Conclusion
17
An Optimization Example //webpage.java: SCHEMA! Class WebPage {String URL,int rank,String content} //mapper.java void map(Text key, WebPage w) { if (w.url==‘teaparty.fr’) emit(w.url, 1); } Data-centric programming idioms == relational ops PROJECTED view: (url,null,null) DIRECT-OP on compressed Webpage
18
Semantic Extraction Query semantic are obvious to human readers, but not explicit in the code for framework EXTRACT IT! o Static code analysis o Control-flow graph and data-flow graph o Find opportunities: selection, projection, direct op o Safe optimizations: same output
19
Analyzer: An Example //webpage.java Class WebPage {String URL,int rank,String content} //mapper.java map(Text key,Webpage w) { if (w.rank > 10) emit(w.url,w.rank); } Fn Entry w.rank > 10Fn Exit Analyzer emit(url,rank)
20
Current Optimizations B+-Tree for Selections Projected views Delta compression on numerics Direct operation of compressed data Hadoop compression is not semantic aware
21
Outline Introduction Execution Framework Optimization/Analyzer Examples Experiments o Analyzer recall o Performance gain Related Work and Conclusion
22
Experiments: Analyzer Test MapReduce programs from Pavlo, SIGMOD ‘ 09: Detected 5 out of 8 opportunities: o Two misses due to custom serialization class o Another miss requires knowledge of java.util.Hashtable semantics
23
Experiments: Performance Optimize four Web page handling tasks: o Selection (filtering) o Projection (aggregation on subfield of page) o Join (pages to user visits) o User Defined Functions (aggregation) 5 cluster nodes, 123GB of data
24
Experiments: Performance DescriptionHadoop Selection 430 s Projection 5496 s Join 6078 s
25
Experiments: Performance DescriptionHadoopManimalSpeedup Selection 430 s38 s11.2 Projection 5496 s1856 s2.96 Join 6078 s904 s6.73
26
Experiments: Performance Up to 11x speedup over original Hadoop Performance comparable to DBMS-X from Pavlo UDF not detected: running time identical DescriptionHadoopManimalSpeedupSpace Overhead Selection 430 s38 s11.2 0.1% Projection 5496 s1856 s2.96 20% Join 6078 s904 s6.73 11.7%
27
Outline Introduction Execution Framework Optimization/Analyzer Examples Experiments o Analyzer recall o Performance gain Related Work and Conclusion
28
Related Work Lots of recent MapReduce activity o Quincy: Task scheduling (Isard et al, SOSP, 2009) o HadoopDB (Abouzeid et al, PVLDB 2009) o Hadoop++ (Dittrich et al, PVLDB 2010) o HaLoop (Bu et al, PVLDB 2010) o Twister (Ekanayake et al, HPDC 2010) o Starfish (Herodotou et al, CIDR 2011) Manimal does not introduce new optimizations. It detects and applies existing optimizations to code
29
Lessons Learned The Good: We can recognize data processing idioms in real code. Relational operations still exist even in NoSQL world The Ugly: When we started this project in 2009, we underestimated interest in writing in higher level languages (e.g., Pig Latin)
30
Conclusion Manimal provides framework for applying well-known optimization techniques to MapReduce o Automatic optimization of user code o Up to 11x speed increase o Provides framework for more optimizations
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.