Google’s MapReduce Connor Poske Florida State University.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce.
MAP-REDUCE: WIN EPIC WIN MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics.
Computations have to be distributed !
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Spring 2007 Except as otherwise noted, the content of.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce: Simplified Data Processing on Large Clusters
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
Introduction to MapReduce Amit K Singh. “The density of transistors on a chip doubles every 18 months, for the same cost” (1965) Do you recognize this.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
1 MapReduce: Theory and Implementation CSE 490h – Intro to Distributed Computing, Modified by George Lee Except as otherwise noted, the content of this.
PPCC Spring Map Reduce1 MapReduce Prof. Chris Carothers Computer Science Department
Information Retrieval Lecture 9. Outline Map Reduce, cont. Index compression [Amazon Web Services]
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Map Reduce. Functional Programming Review r Functional operations do not modify data structures: They always create new ones r Original data still exists.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Map Reduce.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
Distributed System Gang Wu Spring,2018.
Cloud Computing MapReduce, Batch Processing
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Google’s MapReduce Connor Poske Florida State University

Outline Part I: Part I: –History –MapReduce architecture and features –How it works Part II: Part II: –MapReduce programming model and example

Initial History There is a demand for large scale data processing. There is a demand for large scale data processing. The folks at Google have discovered certain common themes for processing very large input sizes. The folks at Google have discovered certain common themes for processing very large input sizes. - Multiple machines are needed - Multiple machines are needed - There are usually 2 basic operations on the input data: 1) Map 2) Reduce

Map Similar to the Lisp primitive Similar to the Lisp primitive Apply a single function to multiple inputs Apply a single function to multiple inputs In the MapReduce model, the map function applies an operation to a list of pairs of the form (input_key, input_value), and produces a set of INTERMEDIATE key/value tuples. Map(input_key, input_value) -> (output_key, intermediate_value) list

Reduce Accepts the set of intermediate key/value tuples as input Accepts the set of intermediate key/value tuples as input Applies a reduce operation to all values that share the same key Applies a reduce operation to all values that share the same key Reduce(output_key, intermediate_value list) -> output list

Quick example Pseudo-code counts the number of occurrences of each word in a large collection of documents Pseudo-code counts the number of occurrences of each word in a large collection of documents Map(String fileName, String fileContents) //fileName is input key, fileContents is input value For each word w in fileContents EmitIntermediate(w, “1”) Reduce(String word, Iterator Values) //word: input key, values: a list of counts int count = 0 for each v in values count += 1 Emit(AsString(count))

The idea sounds good, but… We can’t forget about the problems arising from large scale, multiple-machine data processing We can’t forget about the problems arising from large scale, multiple-machine data processing How do we parallelize everything? How do we parallelize everything? How do we balance the input load? How do we balance the input load? Handle failures? Handle failures? Enter the MapReduce model…

MapReduce The MapReduce implementation is an abstraction that hides these complexities from the programmer The MapReduce implementation is an abstraction that hides these complexities from the programmer The User defines the Map and Reduce functions The User defines the Map and Reduce functions The MapReduce implementation automatically distributes the data, then applies the user-defined functions on the data The MapReduce implementation automatically distributes the data, then applies the user-defined functions on the data Actual code slightly more complex than previous example Actual code slightly more complex than previous example

MapReduce Architecture User program with Map and Reduce functions User program with Map and Reduce functions Cluster of average PCs Cluster of average PCs Upon execution, cluster is divided into: Upon execution, cluster is divided into: –Master worker –Map workers –Reduce workers

Execution Overview 1) Split up input data, start up program on all machines 2) Master machine assigns M Map and R Reduce tasks to idle worker machines 3) Map function executed and results buffered locally 4) Periodically, data in local memory is written to disk. Locations on disk of data are forwarded to master --Map phase complete— 5) Reduce worker uses RPCs to read intermediate data from Map machines. Data is sorted by key. 6) Reduce worker iterates over data and passes each unique key along with associated values to the Reduce function 7) Master wakes up the user program, MapReduce call returns.

Execution Overview

Master worker Stores state information about Map and Reduce workers Stores state information about Map and Reduce workers –Idle, in-progress, or completed Stores location and sizes on disk of intermediate file regions on Map machines Stores location and sizes on disk of intermediate file regions on Map machines –Pushes this information incrementally to workers with in- progress reduce tasks Displays status of entire operation via HTTP Displays status of entire operation via HTTP –Runs internal HTTP server –Displays progress I.E. bytes of intermediate data, bytes of output, processing rates, etc

Parallelization Map() runs in parallel, creating different intermediate output from different input keys and values Map() runs in parallel, creating different intermediate output from different input keys and values Reduce() runs in parallel, each working on a different key Reduce() runs in parallel, each working on a different key All data is processed independently by different worker machines All data is processed independently by different worker machines Reduce phase cannot begin until Map phase is completely finished! Reduce phase cannot begin until Map phase is completely finished!

Load Balancing User defines a MapReduce “spec” object User defines a MapReduce “spec” object –MapReduceSpecification spec –Spec.set_machines(2000) –Spec.set_map_megabytes(100) –Spec.set_reduce_megabytes(100) That’s it! The library will automatically take care of the rest.

Fault Tolerance - Master pings workers periodically Switch(ping response) case (idle): Assign task if possible case (idle): Assign task if possible case (in-progress): do nothing case (in-progress): do nothing case (completed): reset to idle case (completed): reset to idle case (no response): Reassign task case (no response): Reassign task

Fault Tolerance What if a map task completes but the machine fails before the intermediate data is retrieved via RPC? What if a map task completes but the machine fails before the intermediate data is retrieved via RPC? –Re-execute the map task on an idle machine What if the intermediate data is partially read, but the machine fails before all reduce operations can complete? What if the intermediate data is partially read, but the machine fails before all reduce operations can complete? What if the master fails…? PWNED What if the master fails…? PWNED

Fault Tolerance Skipping bad records Skipping bad records –Optional parameter to change mode of execution –When enabled, MapReduce library detects records that cause crashes and skips them Bottom line: MapReduce is very robust in its ability to recover from failure and handle errors Bottom line: MapReduce is very robust in its ability to recover from failure and handle errors

Part II: Programming Model MapReduce library is extremely easy to use MapReduce library is extremely easy to use Involves setting up only a few parameters, and defining the map() and reduce() functions Involves setting up only a few parameters, and defining the map() and reduce() functions –Define map() and reduce() –Define and set parameters for MapReduceInput object –Define and set parameters for MapReduceOutput object –Main program

Map() Class WordCounter : public Mapper{ public: public: virtual void Map(const MapInput &input) { //parse each word and for each word //emit(word, “1”) }};REGISTER_MAPPER(WordCounter);

Reduce() Class Adder : public Reducer { virtual void Reduce(ReduceInput *input) { //Iterate over all entries with same key //and add the values }};REGISTER_REDUCER(Adder);

Main() int main(int argc, char ** argv) { MapReduceSpecification spec; MapReduceInput *input; //store list of input files into “spec” for( int i = 0; I < argc; ++i) { input = spec.add_input(); input->set_format(“text”);input->set_filepattern(argv[i]);input->set_mapper_class(“WordCounter”); } }

Main() //Specify the output files MapReductOutput *output = spec.output(); out->set_filebase (“/gfs/test/freq”); out->set_num_tasks(100); // freq of // freq of // freq of // freq of-00100out->set_format(“text”);out->set_reducer_class(“Adder”);

Main() //Tuning parameters and actual MapReduce call spec.set_machines(2000);spec.set_map_megabytes(100);spec.set_reduce_megabytes(100); MapReduceResult result; if(!MapReduce(spec, &result)) abort(); Return 0; } //end main

Other possible uses Distributed grep Distributed grep –Map emits a line if it matches a supplied pattern –Reduce simply copies intermediate data to output Count URL access frequency Count URL access frequency –Map processes logs of web page requests and emits (URL, 1) –Reduce adds all values for each URL and emits (URL, count) Inverted Index Inverted Index –Map parses each document and emits a sequence of (word, document ID) pairs. –Reduce accepts all pairs for a given word, sorts the list based on Document ID, and emits (word, list(document ID)) Many more… Many more…

Conclusion MapReduce provides a easy to use, clean abstraction for large scale data processing MapReduce provides a easy to use, clean abstraction for large scale data processing Very robust in fault tolerance and error handling Very robust in fault tolerance and error handling Can be used for multiple scenarios Can be used for multiple scenarios Restricting the programming model to the Map and Reduce paradigms makes it easy to parallelize computations and make them fault-tolerant Restricting the programming model to the Map and Reduce paradigms makes it easy to parallelize computations and make them fault-tolerant