Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011.

Similar presentations


Presentation on theme: "Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011."— Presentation transcript:

1 Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

2 MapReduce is victorious Google statistics: Hadoop statistics: 7 PB+ Vertica clusters vs. 22 PB+ Cloudera Hadoop clusters 1 Aug 04Mar 06Sept 07May 10 Number of jobs29K171K2127K4474K Machine years used21720021108139121 Input Data (TB)3,28852,254403,152946,460 Output Data (TB)1932,97014,01845,720 Average worker machines157268394368 1. Omer Trajman, Cloudera VP, http://www.dbms2.com/

3 MapReduce in relational land Designers original Intention: free-formed data o web-scale indexing/log processing But, many relational workloads 1 o Complex queries/data analysis Caveat: MR performance lags RDBMS performance 1.Karmasphere corporation: A study of hadoop developers, http://karmasphere.com, 2010

4 Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009 Selection is Slower with MapReduce

5 Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009 Join is Even Slower

6 MR Lags in Relational Land Stonebraker, Dewitt: ''MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamed whenever an index is the better access mechanism.’’ 1 Query processing tasks o No metadata, semantics, indices o Free-formed input is a double-edged sword 1. MapReduce: a major step backwards, http://databasecolumn.vertica.com/, 2008http://databasecolumn.vertica.com/

7 Manimal Manimal is a hybrid system, combining MapReduce programming model and well-known execution techniques Techniques today only found in RDBMS, but should be in MapReduce, too.

8 Manimal Approach bytecode *.class MR Engine Static Analyzer Optimizer logic Execution Framework optimization opportunities execution path void map(Text key, WebPage w) { if(w.rank > 10) emit(w.url,w.rank); } Challenges: o Safely detect query semantic optimization o How much performance gain? SELECTION from B+Tree index on W.RANK

9 Manimal Contributions Our Manimal system: o Detect safe relational optimizations in users’ compiled MapReduce programs Our results: o Runs with unmodified MapReduce code o Runs up to 11x faster on same code o Provides framework for more optimizations

10 Outline Introduction Execution Framework Optimization/Analyzer Examples Experiments o Analyzer recall o Performance gain Related Work and Conclusion

11 Execution framework public void map(Text key, WebPage w, OutputCollector out) { if(w.rank > 10) emit(w.url, w.rank); }

12 Execution Framework varload ‘value’ invokevirtual astore ‘text’ … ifeq … Analyzer Optimizer Execution

13 13 Execution Framework void map(k, w) { out.set(indexedOutputFormat); emit(w.rank, (k,w)) } (SELECT f, w.rank>10) Analyzer in: user program Analyzer out: optimization descriptor index-generation program varload ‘value’ invokevirtual astore ‘text’ … ifeq … Analyzer Optimizer Execution

14 14 Execution Framework Optimizer in: optimization descriptor catalog Optimizer out: execution descriptor /logs/log.1/logs/log.1.idxselect src… /logs/log.2/logs/log.2.idxselect src… (SELECT,“log.1.idx”, w.rank>10) varload ‘value’ invokevirtual astore ‘text’ … ifeq … Analyzer Optimizer Execution (SELECT f, w.rank>10)

15 15 Execution Framework numwords 19519 (SELECT,“log.1.idx”, w.rank>10) varload ‘value’ invokevirtual astore ‘text’ … ifeq … Analyzer Optimizer Execution Execution in: execution descriptor user program Execution out: program output

16 Outline Introduction Execution Framework Optimization/Analyzer Examples Experiments o Analyzer recall o Performance gain Related Work and Conclusion

17 An Optimization Example //webpage.java: SCHEMA! Class WebPage {String URL,int rank,String content} //mapper.java void map(Text key, WebPage w) { if (w.url==‘teaparty.fr’) emit(w.url, 1); } Data-centric programming idioms == relational ops PROJECTED view: (url,null,null) DIRECT-OP on compressed Webpage

18 Semantic Extraction Query semantic are obvious to human readers, but not explicit in the code for framework EXTRACT IT! o Static code analysis o Control-flow graph and data-flow graph o Find opportunities: selection, projection, direct op o Safe optimizations: same output

19 Analyzer: An Example //webpage.java Class WebPage {String URL,int rank,String content} //mapper.java map(Text key,Webpage w) { if (w.rank > 10) emit(w.url,w.rank); } Fn Entry w.rank > 10Fn Exit Analyzer emit(url,rank)

20 Current Optimizations B+-Tree for Selections Projected views Delta compression on numerics Direct operation of compressed data Hadoop compression is not semantic aware

21 Outline Introduction Execution Framework Optimization/Analyzer Examples Experiments o Analyzer recall o Performance gain Related Work and Conclusion

22 Experiments: Analyzer Test MapReduce programs from Pavlo, SIGMOD ‘ 09: Detected 5 out of 8 opportunities: o Two misses due to custom serialization class o Another miss requires knowledge of java.util.Hashtable semantics

23 Experiments: Performance Optimize four Web page handling tasks: o Selection (filtering) o Projection (aggregation on subfield of page) o Join (pages to user visits) o User Defined Functions (aggregation) 5 cluster nodes, 123GB of data

24 Experiments: Performance DescriptionHadoop Selection 430 s Projection 5496 s Join 6078 s

25 Experiments: Performance DescriptionHadoopManimalSpeedup Selection 430 s38 s11.2 Projection 5496 s1856 s2.96 Join 6078 s904 s6.73

26 Experiments: Performance Up to 11x speedup over original Hadoop Performance comparable to DBMS-X from Pavlo UDF not detected: running time identical DescriptionHadoopManimalSpeedupSpace Overhead Selection 430 s38 s11.2 0.1% Projection 5496 s1856 s2.96 20% Join 6078 s904 s6.73 11.7%

27 Outline Introduction Execution Framework Optimization/Analyzer Examples Experiments o Analyzer recall o Performance gain Related Work and Conclusion

28 Related Work Lots of recent MapReduce activity o Quincy: Task scheduling (Isard et al, SOSP, 2009) o HadoopDB (Abouzeid et al, PVLDB 2009) o Hadoop++ (Dittrich et al, PVLDB 2010) o HaLoop (Bu et al, PVLDB 2010) o Twister (Ekanayake et al, HPDC 2010) o Starfish (Herodotou et al, CIDR 2011) Manimal does not introduce new optimizations. It detects and applies existing optimizations to code

29 Lessons Learned The Good: We can recognize data processing idioms in real code. Relational operations still exist even in NoSQL world The Ugly: When we started this project in 2009, we underestimated interest in writing in higher level languages (e.g., Pig Latin)

30 Conclusion Manimal provides framework for applying well-known optimization techniques to MapReduce o Automatic optimization of user code o Up to 11x speed increase o Provides framework for more optimizations


Download ppt "Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011."

Similar presentations


Ads by Google