Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011.

Slides:



Advertisements
Similar presentations
DataGarage: Warehousing Massive Performance Data on Commodity Servers
Advertisements

Shark:SQL and Rich Analytics at Scale
Introduction to Data Center Computing Derek Murray October 2010.
A walk in cloud (and look for databases) Jian Xu DMM DB-talk, Feb 2010.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.
Distributed Computations
Hive: A data warehouse on Hadoop
(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.
Distributed Computations MapReduce
Cloud Computing Other Mapreduce issues Keke Chen.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010.
1 A Comparison of Approaches to Large-Scale Data Analysis Pavlo, Paulson, Rasin, Abadi, DeWitt, Madden, Stonebraker, SIGMOD’09 Shimin Chen Big data reading.
Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
李智宇、 林威宏、 施閔耀. + Outline Introduction Architecture of Hadoop HDFS MapReduce Comparison Why Hadoop Conclusion
Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.
MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
HadoopDB project An Architetural hybrid of MapReduce and DBMS Technologies for Analytical Workloads Anssi Salohalla.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
Master Thesis Defense Jan Fiedler 04/17/98
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
SLIDE 1IS 257 – Fall 2014 NewSQL and VoltDB University of California, Berkeley School of Information IS 257: Database Management.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Efficient Processing of Semantic Information on the Web Georg Lausen Technische Fakultät Universität Freiburg.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Research Meeting Jaeseok Myung. Copyright  2009 by CEBT Summary  TA DB : project 3, midterm(24 명 응시 ) WEC : report, project (android), classroom,
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
MapReduce and Parallel DMBSs: Friends or Foes? Michael Stonebraker, Daniel Abadi, David J. Dewitt, Sam Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Prediction-Based Multivariate Query Modeling Analytic Queries.
Hive Big data for CSci 4707 students! Eric Atherton and Henry Hoang.
Some slides adapted from those of Yuan Yu and Michael Isard
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
An Open Source Project Commonly Used for Processing Big Data Sets
Introduction to MapReduce and Hadoop
Introduction to HDFS: Hadoop Distributed File System
Central Florida Business Intelligence User Group
MapReduce Simplied Data Processing on Large Clusters
Pregelix: Think Like a Vertex, Scale Like Spandex
Charles Tappert Seidenberg School of CSIS, Pace University
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

MapReduce is victorious Google statistics: Hadoop statistics: 7 PB+ Vertica clusters vs. 22 PB+ Cloudera Hadoop clusters 1 Aug 04Mar 06Sept 07May 10 Number of jobs29K171K2127K4474K Machine years used Input Data (TB)3,28852,254403,152946,460 Output Data (TB)1932,97014,01845,720 Average worker machines Omer Trajman, Cloudera VP,

MapReduce in relational land Designers original Intention: free-formed data o web-scale indexing/log processing But, many relational workloads 1 o Complex queries/data analysis Caveat: MR performance lags RDBMS performance 1.Karmasphere corporation: A study of hadoop developers,

Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009 Selection is Slower with MapReduce

Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009 Join is Even Slower

MR Lags in Relational Land Stonebraker, Dewitt: ''MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamed whenever an index is the better access mechanism.’’ 1 Query processing tasks o No metadata, semantics, indices o Free-formed input is a double-edged sword 1. MapReduce: a major step backwards, http://databasecolumn.vertica.com/

Manimal Manimal is a hybrid system, combining MapReduce programming model and well-known execution techniques Techniques today only found in RDBMS, but should be in MapReduce, too.

Manimal Approach bytecode *.class MR Engine Static Analyzer Optimizer logic Execution Framework optimization opportunities execution path void map(Text key, WebPage w) { if(w.rank > 10) emit(w.url,w.rank); } Challenges: o Safely detect query semantic optimization o How much performance gain? SELECTION from B+Tree index on W.RANK

Manimal Contributions Our Manimal system: o Detect safe relational optimizations in users’ compiled MapReduce programs Our results: o Runs with unmodified MapReduce code o Runs up to 11x faster on same code o Provides framework for more optimizations

Outline Introduction Execution Framework Optimization/Analyzer Examples Experiments o Analyzer recall o Performance gain Related Work and Conclusion

Execution framework public void map(Text key, WebPage w, OutputCollector out) { if(w.rank > 10) emit(w.url, w.rank); }

Execution Framework varload ‘value’ invokevirtual astore ‘text’ … ifeq … Analyzer Optimizer Execution

13 Execution Framework void map(k, w) { out.set(indexedOutputFormat); emit(w.rank, (k,w)) } (SELECT f, w.rank>10) Analyzer in: user program Analyzer out: optimization descriptor index-generation program varload ‘value’ invokevirtual astore ‘text’ … ifeq … Analyzer Optimizer Execution

14 Execution Framework Optimizer in: optimization descriptor catalog Optimizer out: execution descriptor /logs/log.1/logs/log.1.idxselect src… /logs/log.2/logs/log.2.idxselect src… (SELECT,“log.1.idx”, w.rank>10) varload ‘value’ invokevirtual astore ‘text’ … ifeq … Analyzer Optimizer Execution (SELECT f, w.rank>10)

15 Execution Framework numwords (SELECT,“log.1.idx”, w.rank>10) varload ‘value’ invokevirtual astore ‘text’ … ifeq … Analyzer Optimizer Execution Execution in: execution descriptor user program Execution out: program output

Outline Introduction Execution Framework Optimization/Analyzer Examples Experiments o Analyzer recall o Performance gain Related Work and Conclusion

An Optimization Example //webpage.java: SCHEMA! Class WebPage {String URL,int rank,String content} //mapper.java void map(Text key, WebPage w) { if (w.url==‘teaparty.fr’) emit(w.url, 1); } Data-centric programming idioms == relational ops PROJECTED view: (url,null,null) DIRECT-OP on compressed Webpage

Semantic Extraction Query semantic are obvious to human readers, but not explicit in the code for framework EXTRACT IT! o Static code analysis o Control-flow graph and data-flow graph o Find opportunities: selection, projection, direct op o Safe optimizations: same output

Analyzer: An Example //webpage.java Class WebPage {String URL,int rank,String content} //mapper.java map(Text key,Webpage w) { if (w.rank > 10) emit(w.url,w.rank); } Fn Entry w.rank > 10Fn Exit Analyzer emit(url,rank)

Current Optimizations B+-Tree for Selections Projected views Delta compression on numerics Direct operation of compressed data Hadoop compression is not semantic aware

Outline Introduction Execution Framework Optimization/Analyzer Examples Experiments o Analyzer recall o Performance gain Related Work and Conclusion

Experiments: Analyzer Test MapReduce programs from Pavlo, SIGMOD ‘ 09: Detected 5 out of 8 opportunities: o Two misses due to custom serialization class o Another miss requires knowledge of java.util.Hashtable semantics

Experiments: Performance Optimize four Web page handling tasks: o Selection (filtering) o Projection (aggregation on subfield of page) o Join (pages to user visits) o User Defined Functions (aggregation) 5 cluster nodes, 123GB of data

Experiments: Performance DescriptionHadoop Selection 430 s Projection 5496 s Join 6078 s

Experiments: Performance DescriptionHadoopManimalSpeedup Selection 430 s38 s11.2 Projection 5496 s1856 s2.96 Join 6078 s904 s6.73

Experiments: Performance Up to 11x speedup over original Hadoop Performance comparable to DBMS-X from Pavlo UDF not detected: running time identical DescriptionHadoopManimalSpeedupSpace Overhead Selection 430 s38 s % Projection 5496 s1856 s % Join 6078 s904 s %

Outline Introduction Execution Framework Optimization/Analyzer Examples Experiments o Analyzer recall o Performance gain Related Work and Conclusion

Related Work Lots of recent MapReduce activity o Quincy: Task scheduling (Isard et al, SOSP, 2009) o HadoopDB (Abouzeid et al, PVLDB 2009) o Hadoop++ (Dittrich et al, PVLDB 2010) o HaLoop (Bu et al, PVLDB 2010) o Twister (Ekanayake et al, HPDC 2010) o Starfish (Herodotou et al, CIDR 2011) Manimal does not introduce new optimizations. It detects and applies existing optimizations to code

Lessons Learned The Good: We can recognize data processing idioms in real code. Relational operations still exist even in NoSQL world The Ugly: When we started this project in 2009, we underestimated interest in writing in higher level languages (e.g., Pig Latin)

Conclusion Manimal provides framework for applying well-known optimization techniques to MapReduce o Automatic optimization of user code o Up to 11x speed increase o Provides framework for more optimizations