Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011
MapReduce is victorious Google statistics: Hadoop statistics: 7 PB+ Vertica clusters vs. 22 PB+ Cloudera Hadoop clusters 1 Aug 04Mar 06Sept 07May 10 Number of jobs29K171K2127K4474K Machine years used Input Data (TB)3,28852,254403,152946,460 Output Data (TB)1932,97014,01845,720 Average worker machines Omer Trajman, Cloudera VP,
MapReduce in relational land Designers original Intention: free-formed data o web-scale indexing/log processing But, many relational workloads 1 o Complex queries/data analysis Caveat: MR performance lags RDBMS performance 1.Karmasphere corporation: A study of hadoop developers,
Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009 Selection is Slower with MapReduce
Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009 Join is Even Slower
MR Lags in Relational Land Stonebraker, Dewitt: ''MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamed whenever an index is the better access mechanism.’’ 1 Query processing tasks o No metadata, semantics, indices o Free-formed input is a double-edged sword 1. MapReduce: a major step backwards, http://databasecolumn.vertica.com/
Manimal Manimal is a hybrid system, combining MapReduce programming model and well-known execution techniques Techniques today only found in RDBMS, but should be in MapReduce, too.
Manimal Approach bytecode *.class MR Engine Static Analyzer Optimizer logic Execution Framework optimization opportunities execution path void map(Text key, WebPage w) { if(w.rank > 10) emit(w.url,w.rank); } Challenges: o Safely detect query semantic optimization o How much performance gain? SELECTION from B+Tree index on W.RANK
Manimal Contributions Our Manimal system: o Detect safe relational optimizations in users’ compiled MapReduce programs Our results: o Runs with unmodified MapReduce code o Runs up to 11x faster on same code o Provides framework for more optimizations
Outline Introduction Execution Framework Optimization/Analyzer Examples Experiments o Analyzer recall o Performance gain Related Work and Conclusion
Execution framework public void map(Text key, WebPage w, OutputCollector out) { if(w.rank > 10) emit(w.url, w.rank); }
Execution Framework varload ‘value’ invokevirtual astore ‘text’ … ifeq … Analyzer Optimizer Execution
13 Execution Framework void map(k, w) { out.set(indexedOutputFormat); emit(w.rank, (k,w)) } (SELECT f, w.rank>10) Analyzer in: user program Analyzer out: optimization descriptor index-generation program varload ‘value’ invokevirtual astore ‘text’ … ifeq … Analyzer Optimizer Execution
14 Execution Framework Optimizer in: optimization descriptor catalog Optimizer out: execution descriptor /logs/log.1/logs/log.1.idxselect src… /logs/log.2/logs/log.2.idxselect src… (SELECT,“log.1.idx”, w.rank>10) varload ‘value’ invokevirtual astore ‘text’ … ifeq … Analyzer Optimizer Execution (SELECT f, w.rank>10)
15 Execution Framework numwords (SELECT,“log.1.idx”, w.rank>10) varload ‘value’ invokevirtual astore ‘text’ … ifeq … Analyzer Optimizer Execution Execution in: execution descriptor user program Execution out: program output
Outline Introduction Execution Framework Optimization/Analyzer Examples Experiments o Analyzer recall o Performance gain Related Work and Conclusion
An Optimization Example //webpage.java: SCHEMA! Class WebPage {String URL,int rank,String content} //mapper.java void map(Text key, WebPage w) { if (w.url==‘teaparty.fr’) emit(w.url, 1); } Data-centric programming idioms == relational ops PROJECTED view: (url,null,null) DIRECT-OP on compressed Webpage
Semantic Extraction Query semantic are obvious to human readers, but not explicit in the code for framework EXTRACT IT! o Static code analysis o Control-flow graph and data-flow graph o Find opportunities: selection, projection, direct op o Safe optimizations: same output
Analyzer: An Example //webpage.java Class WebPage {String URL,int rank,String content} //mapper.java map(Text key,Webpage w) { if (w.rank > 10) emit(w.url,w.rank); } Fn Entry w.rank > 10Fn Exit Analyzer emit(url,rank)
Current Optimizations B+-Tree for Selections Projected views Delta compression on numerics Direct operation of compressed data Hadoop compression is not semantic aware
Outline Introduction Execution Framework Optimization/Analyzer Examples Experiments o Analyzer recall o Performance gain Related Work and Conclusion
Experiments: Analyzer Test MapReduce programs from Pavlo, SIGMOD ‘ 09: Detected 5 out of 8 opportunities: o Two misses due to custom serialization class o Another miss requires knowledge of java.util.Hashtable semantics
Experiments: Performance Optimize four Web page handling tasks: o Selection (filtering) o Projection (aggregation on subfield of page) o Join (pages to user visits) o User Defined Functions (aggregation) 5 cluster nodes, 123GB of data
Experiments: Performance DescriptionHadoop Selection 430 s Projection 5496 s Join 6078 s
Experiments: Performance DescriptionHadoopManimalSpeedup Selection 430 s38 s11.2 Projection 5496 s1856 s2.96 Join 6078 s904 s6.73
Experiments: Performance Up to 11x speedup over original Hadoop Performance comparable to DBMS-X from Pavlo UDF not detected: running time identical DescriptionHadoopManimalSpeedupSpace Overhead Selection 430 s38 s % Projection 5496 s1856 s % Join 6078 s904 s %
Outline Introduction Execution Framework Optimization/Analyzer Examples Experiments o Analyzer recall o Performance gain Related Work and Conclusion
Related Work Lots of recent MapReduce activity o Quincy: Task scheduling (Isard et al, SOSP, 2009) o HadoopDB (Abouzeid et al, PVLDB 2009) o Hadoop++ (Dittrich et al, PVLDB 2010) o HaLoop (Bu et al, PVLDB 2010) o Twister (Ekanayake et al, HPDC 2010) o Starfish (Herodotou et al, CIDR 2011) Manimal does not introduce new optimizations. It detects and applies existing optimizations to code
Lessons Learned The Good: We can recognize data processing idioms in real code. Relational operations still exist even in NoSQL world The Ugly: When we started this project in 2009, we underestimated interest in writing in higher level languages (e.g., Pig Latin)
Conclusion Manimal provides framework for applying well-known optimization techniques to MapReduce o Automatic optimization of user code o Up to 11x speed increase o Provides framework for more optimizations