Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

MapReduce.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
CMU SCS : Multimedia Databases and Data Mining Extra: intro to hadoop C. Faloutsos.
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
Distributed Computations
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Introduction to MapReduce Amit K Singh. “The density of transistors on a chip doubles every 18 months, for the same cost” (1965) Do you recognize this.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Cloud Computing Other High-level parallel processing languages Keke Chen.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
HAMS Technologies 1
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
An Introduction to HDInsight June 27 th,
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
© Hortonworks Inc Hadoop: Beyond MapReduce Steve Loughran, Big Data workshop, June 2013.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Operating Systems and The Cloud, Part II: Search => Cluster Apps => Scalable Machine Learning David E. Culler CS162 – Operating Systems and Systems Programming.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
1 Source A. Haeberlen, Z. Ives University of Pennsylvania MapReduceIntro.pptx Introduction to MapReduce.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
MapReduce using Hadoop Jan Krüger … in 30 minutes...
Image taken from: slideshare
Big Data is a Big Deal!.
Some slides adapted from those of Yuan Yu and Michael Isard
Data Management with Google File System Pramod Bhatotia wp. mpi-sws
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Central Florida Business Intelligence User Group
Introduction to Spark.
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Introduction to Apache
Overview of big data tools
Charles Tappert Seidenberg School of CSIS, Pace University
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014 *Parts of the presentation content are taken from MapReduce/Pig papers and Web

Reliance on online services Pramod Bhatotia 2 Collect data Process data Provide services E.g. Web-crawl, Click-streams, social network graph E.g. PageRank, clustering, machine learning algorithms E.g. search, recommendations, spam detection Raw data System Information Today’s class!

How much data? >10 PB data, 75B DB calls per day (6/2012) processes 20 PB a day (2008) crawls 20B web pages a day (2012) >100 PB of user data TB/day (8/2012) S3: 449B objects, peak 290k request/second (7/2011) 1T objects (6/2012) Distributed Systems!

Data-center Pramod Bhatotia 4 Cluster of 100s of thousands of machines

In today’s class How to easily write parallel applications on distributed computing systems? Pramod Bhatotia 5 1.MapReduce 2.Pig A high-level language built on top of MapReduce

Design challenges How to parallelize application logic? How to communicate? How to synchronize? How to perform load balancing? How to handle faults? How to schedule jobs? Pramod Bhatotia 6 Design Implement Debug Optimize Maintain For each and every application!

The power of abstraction Pramod Bhatotia 7 Parallelization Communication Fault-tolerance Load balancing Synchronization Scheduling Application LibraryMapReduce

Programming model Programmer writes two methods: Map & Reduce Run-time library Takes care of everything else! Pramod Bhatotia 8

MapReduce programming model Inspired from functional programming Data-parallel application logic Programmer’s interface: Map(key, value)  (key, value) Reduce(key, )  (key,value) Pramod Bhatotia 9

MapReduce run-time system Pramod Bhatotia 10 Map tasks Input (InK1, InV1) (InK2, InV2) (InK3, InV3) (InK4, InV4) Map outputs (K1, V1) (K1, V3) (K2, V2) (K2, V4) (K1, ) (K2, ) Shuffle and sort Reduce tasks R R R R R2 R1 Output (OK1, OV1) (OK2, OV2) M M M M M M M M M M M M M M M M M4 M3 M2 M1

An example: word-count Input: Given a corpus of documents, such as Wikipedia Output: Count the frequency of each distinct word Pramod Bhatotia 11

MapReduce for word-count Pramod Bhatotia 12 map(string key, string value) ‏ //key: document name //value: document contents for each word w in value EmitIntermediate(w, “1”); reduce(string key, iterator values) ‏ //key: word //values: list of counts int results = 0; for each v in values result += ParseInt(v); Emit(key, AsString(result));

Word-count example Pramod Bhatotia 13 Map tasks Input (Doc1, “the”) (Doc2, “for”) (Doc3, “the”) (Doc4, “for”) Map outputs (“the”, “1”) (“for”, “1”) (“the”, ) (“for”, ) Shuffle and sort Reduce tasks R R R R R2 R1 Output (“the”, “2”) (“for”, “2”) M M M M M M M M M M M M M M M M M4 M3 M2 M1

MapReduce software stack Pramod Bhatotia 14 MapReduce library Distributed file system: GFS/HDFS Master Distributed software

MapReduce software stack Pramod Bhatotia 15 Task tracker Data node Namenode (HDFS) Namenode (HDFS) Task tracker Data node Task tracker Data node Task tracker Data node Job tracker (MapReduce) Job tracker (MapReduce) Master

Runtime execution Pramod Bhatotia 16

Design challenges revisited Pramod Bhatotia 17 Parallelization Communication Synchronization Load balancing Faults & semantics Scheduling

References MapReduce [OSDI’04] and YARN [SoCC’13] Original M/R, and the next generation of M/R Dryad [EuroSys’07] Generalized framework for data-parallel computations Spark [NSDI’12] In-memory distributed data parallel computing Pramod Bhatotia 18

Limitations of MapReduce Graph algorithms Pregel [SIGMOD ‘10], GraphX [OSDI’14] Iterative algorithms Haloop [VLDB’10], CIEL [NSDI ’11] Stream processing – Low latency D-stream [SOSP’13], Naiad [SOSP’13], Storm, S4 Low-level abstraction for common data analysis tasks! Pig [SIGMOD’10], Shark [SIGMOD’13], DryadLINQ [OSDI’08] Pramod Bhatotia 19

Motivation for Pig Programmers are lazy! (they don’t even wish to write Map and Reduce) Pramod Bhatotia 20

Data analysis tasks Common operations: Filter, join, group-by, sort, etc. MapReduce offers a low-level primitive Requires repeated re-implementation of these operators The power of abstraction! Design once and reuse Pramod Bhatotia 21

Pig Latin Distributed dataflow queries Pig Latin = SQL-kind queries + Distributed execution Pramod Bhatotia 22

Pig architecture Pramod Bhatotia 23 RR RR MM MM MM MM MM MM RR RR MM MM MM MM MM MM First MapReduc e job Second MapReduce job Pipelined Pig compiler Pig Latin script MapReduce Runtime MapReduce Runtime

Overview of the compilation process Pramod Bhatotia 24 Pig compiler Logical plan Physical plan MapReduce plan

An example Pramod Bhatotia 25

Example: contd. Pramod Bhatotia 26

Example: contd. Pramod Bhatotia 27

Advantages of staged-compilation SQL query optimizations MapReduce specific optimizations Refer Pig papers for details [SIGMOD ‘08, VLDB’09] Pramod Bhatotia 28

Related systems Apache HIVE Built on top of MapReduce DryadLINQ [OSDI’08] or SCOPE [VLDB’08] Built on top of Dryad Shark [SIGMOD’13] Built on top of Spark Pramod Bhatotia 29

Summary Data-intensive computing with MapReduce Data-parallel programming model Runtime library to handle all low-level details Pig: high-level abstraction for common tasks Resources: Hadoop: Spark: Dryad: Pramod Bhatotia 30

Thanks! Pramod Bhatotia 31