Distributed and Parallel Processing Technology Chapter2. MapReduce

Slides:



Advertisements
Similar presentations
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Advertisements

Analysis of Computer Algorithms
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 1 Embedded Computing.
Chapter 5 Input/Output 5.1 Principles of I/O hardware
Chapter 6 File Systems 6.1 Files 6.2 Directories
Auto-scaling Axis2 Web Services on Amazon EC2 By Afkham Azeez.
Beyond Mapper and Reducer
Configuration management
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
CS525: Special Topics in DBs Large-Scale Data Management
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
Chapter 6 File Systems 6.1 Files 6.2 Directories
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
© 2012 National Heart Foundation of Australia. Slide 2.
Chapter 10: The Traditional Approach to Design
Systems Analysis and Design in a Changing World, Fifth Edition
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
PSSA Preparation.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
MapReduce.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
Developing a MapReduce Application – packet dissection.
O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.
Spark: Cluster Computing with Working Sets
Hadoop: The Definitive Guide Chap. 2 MapReduce
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Big Data Analytics with R and Hadoop
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Distributed and Parallel Processing Technology Chapter6
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
MapReduce 資工碩一 黃威凱. Outline Purpose Example Method Advanced 資工碩一 黃威凱.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
O’Reilly – Hadoop: The Definitive Guide Ch.7 MapReduce Types and Formats 29 July 2010 Taikyoung Kim.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  Concept of the Project  System architecture  Implementation – HDFS  Implementation – System.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
GSICS Annual Meeting Laurie Rokke, DPhil NOAA/NESDIS/STAR March 26, 201.
Hadoop Aakash Kag What Why How 1.
Map-Reduce framework.
Hadoop MapReduce Framework
MapReduce Types, Formats and Features
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Introduction to MapReduce and Hadoop
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
The Basics of Apache Hadoop
Cloud Distributed Computing Environment Hadoop
MAPREDUCE TYPES, FORMATS AND FEATURES
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

Distributed and Parallel Processing Technology Chapter2. MapReduce Sun Jo

Introduction MapReduce is a programming model for data processing. Hadoop can run MapReduce programs written in various languages. We shall look at the same program expressed in Java, Ruby, Python, and C++.

A Weather Dataset Program that mines weather data Data Format Weather sensors collect data every hour at many locations across the globe They gather a large volume of log data, which is good candidate for analysis with MapReduce Data Format Data from the National Climate Data Center(NCDC) Stored using a line-oriented ASCII format, in which each line is a record

A Weather Dataset Data Format Data files are organized by date and weather station. There is a directory for each year from 1901 to 2001, each containing a gzipped file for each weather station with its readings for that year. The whole dataset is made up of a large number of relatively small files since there are tens of thousands of weather station. The data was preprocessed so that each year’s readings were concatenated into a single file.

Analyzing the Data with Unix Tools What’s the highest recorded global temperature for each year in the dataset? Unix Shell script program with awk, the classic tool for processing line-oriented data Beginning of a run The complete run for the century took 42 minutes in one run on a single EC2 High-CPU Extra Large Instance. The scripts loops through the compressed year files printing the year  processing each file using awk Awk extracts the air temperature and the quality code from the data. Temperature value 9999 signifies a missing value in the NCDC dataset. Maximum temperature is 31.7℃ for 1901.

Analyzing the Data with Unix Tools To speed up the processing, run parts of the program in parallel Problems for parallel processing Dividing the work into equal-size pieces isn’t always easy or obvious. The file size for different years varies The whole run is dominated by the longest file A better approach is to split the input into fixed-size chunks and assign each chunk to a process Combining the results from independent processes may need further processing. Still limited by the processing capacity of a single machine, handling coordination and reliability for multiple machines It’s feasible to parallelize the processing, though, it’s messy in practice.

Analyzing the Data with Hadoop – Map and Reduce MapReduce works by breaking the processing into 2 phases: the map and the reduce. Both map and reduce phases have key-value pairs as input and output. Programmers have to specify two functions: map and reduce function. The input to the map phase is the raw NCDC data. Here, the key is the offset of the beginning of the line and the value is each line of the data set. The map function pulls out the year and the air temperature from each input value. The reduce function takes <year, temperature> pairs as input and produces the maximum temperature for each year as the result.

Analyzing the Data with Hadoop – Map and Reduce Original NCDC Format Input file for the map function, stored in HDFS Output of the map function, running in parallel for each block Input for the reduce function & Output of the reduce function

Analyzing the Data with Hadoop – Map and Reduce The whole data flow Map() Shuffling Reduce () <1950, 0> <1950, 22> <1949,111> <1949, [111, 78]> <1950, [0, 22, -11]> <1949, 111> <1950, 22> <1951, 10> <1952, 22> <1951, [10, 76,34], 19> <1952 ,[22, 34]> <1953, [45]> <1955, [23]> <1951, 76> <1952, 34> <1953, 45> <1955,25> <1954, 0> <1954, 22> <1950, -11> <1949, 78> <1951, 25> Input File

Analyzing the Data with Hadoop – Java MapReduce Having run through how the MapReduce program works, express it in code A map function, a reduce function, and some code to run the job are needed. Map function

Analyzing the Data with Hadoop – Java MapReduce Reduce function

Analyzing the Data with Hadoop – Java MapReduce Main function for running the MapReduce job

Analyzing the Data with Hadoop – Java MapReduce A test run The output is written to the output directory, which contains one output file per reducer

Analyzing the Data with Hadoop – Java MapReduce The new Java MapReduce API The new API, referred to as “Context Objects”, is type-incompatible with the old, so applications need to be rewritten to take advantage of it. Notable differences Favors abstract classes over interfaces. The Mapper and Reducer interfaces are abstract classes. The new API is in the org.apache.hadoop.mapreduce package and subpackages. The old API can still be found in org.apache.hadoop.mapred Makes extensive use of context objects that allow the user code to communicate with MapReduce system i.e.) The MapContext unifies the role of the JobConf, the OutputCollector, and the Reporter Supports both a ‘push’ and a ‘pull’ style of iteration Basically key-value record pairs are pushed to the mapper, but in addition, the new API allows a mapper to pull records from within the map() method. The same goes for the reducer Configuration has been unified. The old API has a JobConf object for job configuration, which is an extension of Hadoop’s vanilla Configuration object. In the new API, job configuration is done through a Configuration. Job control is performed through the Job class rather than JobClient. Output files are named slightly differently part-m-nnnnn for map outputs, part-r-nnnnn for reduce outputs (nnnnn is an integer designating the part number, starting from 0)

Analyzing the Data with Hadoop – Java MapReduce The new Java MapReduce API Example 2-6 shows the MaxTemperature application rewritten to use the new API.

Scaling Out To scale out, we need to store the data in a distributed filesystem, HDFS. Hadoop moves the MapReduce computation to each machine hosting a part of the data. Data Flow A MapReduce job consists of the input data, the MapReduce program, and configuration information. Hadoop runs the job by dividing it into 2 types of tasks, map and reduce tasks. Two types of nodes, 1 jobtracker and several tasktrackers Jobtracker : coordinates and schedules tasks to run on tasktrakers. Tasktrackers : run tasks and send progress report to the jobtracker. Hadoop divides the input into fixed-size pieces, called input splits, or just splits. Hadoop creates one map task for each split, which runs the user-defined map function for each record in the split. The quality of the load balancing increases as the splits become more fine-grained. Default size : 1 HDFS block, 64MB Map tasks write their output to the local disk, not to HDFS. If the node running a map task fails, Hadoop will automatically rerun the map task on another node to re-create the map output.

Scaling Out Data Flow – single reduce task Reduce tasks don’t have the advantage of data locality – the input to a single reduce task is normally the output from all mappers. All map outputs are merged across the network and passed to the user-defined reduce function. The output of the reduce is normally stored in HDFS.

Scaling Out Data Flow – multiple reduce tasks The number of reduce tasks is specified independently not governed by the input size. The map tasks partition their output by keys, each creating one partition for each reduce task. There can be many keys and their associated values in each partition, but the records for any key are all in a single partition.

Scaling Out Data Flow – zero reduce task

Scaling Out Combiner Functions Many MapReduce jobs are limited by the bandwidth available on the cluster. It pays to minimize the data transferred between map and reduce tasks. Hadoop allows the user to specify a combiner function to be run on the map output – the combiner function’s output forms the input to the reduce function. The contract for the combiner function constrains the type of function that may be used. Example without a combiner function Example with a combiner function, finding maximum temperature for a map <1950, 0> <1950, 20> <1950, 10> Reduce () <1950, [0, 20, 10, 25, 15]> <1950, 25> <1950, 15> Map () shuffling <1950, 0> <1950, 20> <1950, 10> Reduce () <1950, [20, 25]> <1950, 25> <1950, 15> Map () shuffling combiner

Scaling Out Combiner Functions The function calls on the temperature values can be expressed as follows: Max(0, 20, 10, 25, 15) = max( max(0, 20, 10), max(25, 15) ) = max(20, 25) = 25 Calculating ‘mean’ temperatures couldn’t use the mean as the combiner function mean(0, 20, 10, 25, 15) = 14 mean( mean(0, 20, 10), mean(25, 15) ) = mean(10, 20) = 15. The combiner function doesn’t replace the reduce function. It can help cut down the amount of data shuffled between the maps and the reduces

Scaling Out Combiner Functions Specifying a combiner function The combiner function is defined using the Reducer interface It is the same implementation as the reducer function in MaxTemperatureReducer. The only change is to set the combiner class on the JobConf.

Hadoop Streaming Hadoop provides an API to MapReduce Hadoop Streaming write the map and reduce functions in languages other than Java. We can use any language in MapReduce program. Hadoop Streaming Map input data is passed over standard input to your map function. The map function processes the data line by line and writes lines to standard output. A map output key-value pair is written as a single tab-delimited line. Reduce function reads lines from standard input (sorted by key), and writes its results to standard output.

Hadoop Streaming Ruby The map function can be expressed in Ruby. Simulating the map function in Ruby with a Unix pipeline The reduce function for maximum temperature in Ruby

Hadoop Streaming Ruby Simulating the whole MapReduce pipeline with a Unix pipeline Hadoop command to run the whole MapReduce job When using the combiner which is coded in any streaming language

Hadoop Streaming Python Streaming supports any programming language that can read from standard input and write to standard output. The map and reduce script in Python Test the programs and run the job in the same way we did in Ruby.

Hadoop Pipes Hadoop Pipes The name of the C++ interface to Hadoop MapReduce. Pipes uses sockets as the channel over which the tasktracker communicates with the process running the C++ map or reduce function. The source code for the map and reduce functions in C++

Hadoop Pipes The source code for the map and reduce functions in C++

Hadoop Pipes Compiling and Running Makefile for C++ MapReduce program Defining PLATFORM which specifies the operating system, architecture, and data model (e.g., 32- or 64-bit). To run a Pipes job, we need to run Hadoop (daemon) in pseudo-distributed model. Next step is to copy the executable code (program) to HDFS. Next, the sample data is copied from the local filesystem to HDFS.

Hadoop Pipes Compiling and Running Now, we can run the job. For this, we use the Hadoop pipes command, passing URI of the executable in HDFS using the –program argument: