Süleyman Fatih GİRİŞ 500512009. CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Computations have to be distributed !
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simpliyed Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat To appear in OSDI 2004 (Operating Systems Design and Implementation)
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
MapReduce “Divide and Conquer”.
Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MAPREDUCE PRESENTED BY: KATIE WOODS & JORDAN HOWELL.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Map Reduce. Functional Programming Review r Functional operations do not modify data structures: They always create new ones r Original data still exists.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Large-scale file systems and Map-Reduce
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 4th – Map/Reduce.
CS 345A Data Mining MapReduce This presentation has been altered.
Cloud Computing MapReduce, Batch Processing
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Distributed Systems and Concurrency: Map Reduce
Map Reduce, Types, Formats and Features
Presentation transcript:

Süleyman Fatih GİRİŞ

CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2 Master Data Structures 3.3 Fault Tolerance 3.4 Backup Tasks

CONTENT 4. Refinements 4.1 Partitioning Function 4.2 Combiner Function 4.3 Input and Output Types 4.4 Side-effects 4.5 Skipping Bad Records 5. Experience 6. Conclusions

1.Introduction MapReduce is a programming model and an associated implementation for processing and generating large data sets. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

1.Introduction map reduce Inspired by the map and reduce primitives present in Lisp and many other functional languages. Map()--applies a given function to each element of a list Reduce()--recombine through use of a given combining operation Enables automatic parallelization and distribution of large-scale computations High performance on large clusters

2.Programming Model Takes a set of input key/value pairs, and produces a set of output key/value pairs Expresses the computation w/two functions : Map Reduce Map and Reduce Map Map, takes an input pair and produces a set of intermediate key/value pairs Reduce,Map, Reduce, merges together values,was sent from Map, to form a possibly smaller set of values

2.Programming Model

2.1 Example There are fruits and number of sold ones are given. Apple 9 Banana 10 Apple 2 Strawberry 5 Strawberry 9 Apple 4 Strawberry 3 Banana 8

2.1 Example We split them into 2 pieces. Apple 9 Banana 10 Apple 2 Strawberry Strawberry 9 Apple 4 Strawberry 3 Banana 8

2.1 Example Apple 11 Banana 10 Strawberry Strawberry 12 Apple 4 Banana 8

2.1 Example Apple 11 Banana 10 Strawberry Strawberry 12 Apple 4 Banana 8 Key-value pair

2.2 More Examples Word Count Distributed Grep Count of URL Access Frequency ReverseWeb-Link Graph Term-Vector per Host Inverted Index Distributed Sort

2.1 Example As a result we have: Apple 15 Banana 18 Strawberry 17

3. Implementation Many different implementations of the MapReduce interface are possible The right choice depends on the environment We will check mostly the implementation targeted to the computing environment in wide use at Google

3. Implementation (1) Machines are typically dual-processor x86 processors running Linux, with 2-4 GB of memory per machine. (2) Commodity networking hardware is used – typically either 100 megabits/second or 1 gigabit/second at the machine level.

3. Implementation (3) A cluster consists of hundreds or thousands of machines, and therefore machine failures are common. (4) Storage is provided by inexpensive IDE disks attached directly to individual machines. (5) Users submit jobs to a scheduling system. Each job consists of a set of tasks, and is mapped by the scheduler to a set of available machines within a cluster.

3.1 Execution Overview The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function(e.g., hash(key) mod R)

3.1 Execution Overview 1. The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece

3.1 Execution Overview 2. One of the copies of the program is special – the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.

3.1 Execution Overview 3. A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function.

3.1 Execution Overview 4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master.

3.1 Execution Overview 5. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.

3.1 Execution Overview 6. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together.

3.1 Execution Overview 7. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user’s Reduce function.

3.1 Execution Overview 8. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.

3.2 Master Data Structures For each map task and reduce task, it stores : State (idle, in-progress,or completed) Identity of the worker machine(for non-idle tasks). For each completed map task, the master stores the locations and sizes of the R intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed.

3.3 Fault Tolerance Worker Failure Worker Failure If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state. Completed map tasks are re-executed on a failure. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system.

3.3 Fault Tolerance Master Failure Master Failure Writes periodic checkpoints of data structures If the master task dies, a new copy can be started from the last checkpointed state Its failure is unlikely; therefore current implementation aborts the MapReduce computation if the master fails

3.4 Backup Tasks One of the reason causes that lengthens the total time taken for a MapReduce operation is Straggler :A machine that takes an unusually long time to complete one of the last few map or reduce tasks in the computation. Solution: When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks The task is marked as completed whenever either the primary or the backup execution completes.

4. Refinements Although the basic functionality provided by simply writing Map and Reduce functions is sufficient for most needs, there are a few extensions useful.

4.1 Partitioning Function The users specify the number of reduce tasks/output files that they desire (R). Data gets partitioned across these tasks using a partitioning function on the intermediate key. A default partitioning function is provided that uses hashing (e.g. “hash(key) mod R”)

4.2 Combiner Function All of counts will be sent over the network to a single reduce task and then added together by the Reduce function to produce one number. It is allowed the user to specify an optional Combiner function that does partial merging of this data before it is sent over the network. The Combiner function is executed on each machine that performs a map task

4.3 Input and Output Types The MapReduce library provides support for reading input data in several different formats For example, “text” mode input treats each line as a key/value pair: Key is the offset in the file and the value is the contents of the line. Another common supported format stores a sequence of key/value pairs sorted by key.

4.4 Side-effects In some cases, users of MapReduce have found it convenient to produce auxiliary files as additional outputs from their map and/or reduce operators Typically the application writes to a temporary file and atomically renames this file once it has been fully generated.

4.5 Skipping Bad Records There are bugs in user code that cause the Map or Reduce functions to crash deterministically on certain records. Sometimes it is acceptable to ignore a few records(e.g when doing statistical analysis on a large data set) MapReduce library detects which records cause deterministic crashes and skips these records in order to make forward progress.

5. Experience It has been used across a wide range of domains within Google, including: Large-scale machine learning problems Clustering problems for the Google News and Froogle products Extraction of data used to produce reports of popular queries (e.g. Google Zeitgeist) Extraction of properties of web pages for new experiments and products (e.g. extraction of geographical locations from a large corpus of web pages for localized search) Large-scale graph computations

5. Experience

6. Conclusions The model is easy to use, even for programmers without experience with parallel and distributed systems A large variety of problems are easily expressible as MapReduce computations Scales to large clusters of machines comprising thousands of machines

6. Conclusions Restricting the programming model makes it easy to parallelize and distribute computations and to make such computations fault-tolerant Network bandwidth is a limitted resource; A number of optimizations in this system are reducing the amount of data sent across the network Redundant execution can be used to reduce the impact of slow machines, and to handle machine failures and data loss.

THANKS FOR LISTENING