Distributed and Parallel Processing Technology Chapter7. MAPREDUCE TYPES AND FORMATS NamSoo Kim 1.

Slides:



Advertisements
Similar presentations
Hadoop Programming. Overview MapReduce Types Input Formats Output Formats Serialization Job g/apache/hadoop/mapreduce/package-
Advertisements

Distributed and Parallel Processing Technology Chapter2. MapReduce
Beyond Mapper and Reducer
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China
MapReduce Simplified Data Processing on Large Clusters
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.
Clydesdale: Structured Data Processing on MapReduce Jackie.
Map reduce with Hadoop streaming and/or Hadoop. Hadoop Job Hadoop Mapper Hadoop Reducer Partitioner Hadoop FileSystem Combiner Shuffle Sort Shuffle Sort.
Hadoop: The Definitive Guide Chap. 2 MapReduce
Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.
Lecture 3 – Hadoop Technical Introduction CSE 490H.
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
 A data processing system is a combination of machines and people that for a set of inputs produces a defined set of outputs. The inputs and outputs.
 2004 Prentice Hall, Inc. All rights reserved. Chapter 25 – Perl and CGI (Common Gateway Interface) Outline 25.1 Introduction 25.2 Perl 25.3 String Processing.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Big Data Analytics with R and Hadoop
大规模数据处理 / 云计算 Lecture 3 – Hadoop Environment 彭波 北京大学信息科学技术学院 4/23/2011 This work is licensed under a Creative Commons.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
Distributed and Parallel Processing Technology Chapter6
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
CIS 218 Advanced UNIX1 CIS 218 – Advanced UNIX (g)awk.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
O’Reilly – Hadoop: The Definitive Guide Ch.7 MapReduce Types and Formats 29 July 2010 Taikyoung Kim.
Lecture 5 Books: “Hadoop in Action” by Chuck Lam,
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.
Files Tutor: You will need ….
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
CHP - 9 File Structures.
Map-Reduce framework.
Ch 8 and Ch 9: MapReduce Types, Formats and Features
MapReduce Types, Formats and Features
Lecture 17 (Hadoop: Getting Started)
Introduction to Scripting
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Hadoop MapReduce Types: 2/2
MapReduce Simplied Data Processing on Large Clusters
Hadoop MapReduce Types
Cloud Distributed Computing Environment Hadoop
Data processing with Hadoop
Lecture 3 – Hadoop Technical Introduction
MAPREDUCE TYPES, FORMATS AND FEATURES
Real-World File Structures
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

Distributed and Parallel Processing Technology Chapter7. MAPREDUCE TYPES AND FORMATS NamSoo Kim 1

MapReduce Types  Map & Reduce function types are as follows:  The map input key and value types ( K1 and V1 ) are different from the map output types (K2 and V2).  The reduce input must have the same types as the map output, although the reduce output types may be different again (K3 and V3).  If a combine function is used then it is the same form as the reduce function (and is an implementation of Reducer), except its output types are the intermediate key and value types (K2 and V2), so they can feed the reduce function.  The partition function operates on the intermediate key and value types ( K2 and V2 ),and returns the partition index. In practice, the partition is determined solely by the key (the value is ignored) 2 (Input -> Middle -> output)

MapReduce Types  Input types are set by the input format. Example) a TextInputFormat generates keys of type LongWritable and values of type Text. So if K2 and K3 are the same, you don’t need to call setMapOutputKeyClass(), since it falls back to the type set by calling setOutputKeyClass(). if V2 and V3 are the same, you only need to use setOutputValueClass(). 3

MapReduce Types 4

 The default Streaming job In Streaming, the default job is similar, but not identical, to the Java equivalent. The minimal form is: Notice that you have to supply a mapper The default Streaming job is actually very useful. since the key is just the line offset in the file, and the value is the line, which is all most applications are interested in. The overall effect of this job is to perform a sort of the input. MapReduce Types 5

 Keys and values in Streaming A Streaming application can control the separator that is used when a key-value pair is turned into a series of bytes and sent to the map or reduce process over standard input. The default is a tab character, but it is useful to be able to change it in the case that the keys or values themselves contain tab characters. The key from the output can be composed of more than the first field 6

Input Formats  Input Splits and Records  An input split is a chunk of the input that is processed by a single map.  Each map processes a single split.  Each split is divided into records, and the map processes each record—a key-value pair—in turn. a split might correspond to a range of rows from a table a record to a row in that range  Input Splits Input splits are represented by the Java interface, InputSplit (which, like all of the classes mentioned in this section, is in the org. apache.hadoop. mapred package): 7

Input Formats  Input Splits(Continue…) The storage locations are used by the MapReduce system to place map tasks as close to the split’s data as possible. the size is used to order the splits so that the largest get processed first, in an attempt to minimize the job runtime As a MapReduce application writer, you don’t need to deal with InputSplits directly, as they are created by an InputFormat.  InputFormat An InputFormat is responsible for creating the input splits, and dividing them into records. 8

Input Formats  InputFormat(Continue…) Having calculated the splits, the client sends them to the jobtracker. Jobtracker uses their storage locations to schedule map tasks to process them on the tasktrackers. On a tasktracker, the map task passes the split to the getRecordReader() method on InputFormat to obtain a RecordReader for that split. A RecordReader is little more than an iterator over records, and the map task uses one to generate record key-value pairs, which it passes to the map function.  MapRunner MapRunner is only one way of running mappers. MultithreadedMapRunner is another implementation of the MapRunnable interface that runs mappers concurrently in a configurable number of threads. 9

Input Formats  FileInputFormat FileInputFormat is the base class for all implementations of InputFormat that use files as their data source (see Figure 7-2). It provides two things 1. a place to define which files are included as the input to a job. 2. an implementation for generating splits for the input files. 10

Input Formats  FileInputFormat input paths FileInputFormat offers four static convenience methods for setting a JobConf’s input paths: 1. The addInputPath() and addInputPaths() methods add a path or paths to the list of inputs. 2. The setInputPaths() methods set the entire list of paths in one go. 11

Input Formats  FileInputFormat input splits Given a set of files, how does FileInputFormat turn them into splits? 1. FileInputFormat splits only large files(Here “large” means larger than an HDFS block). 2. The split size is normally the size of an HDFS block, which is appropriate for most applications. 12

Input Formats  FileInputFormat input splits(Continue…) The minimum split size is usually 1 byte, although some formats have a lower bound on the split size. Applications may impose a minimum split size. The maximum split size defaults to the maximum value that can be represented by a Java long type. It has an effect only when it is less than the block size, forcing splits to be smaller than a block. The split size is calculated by the formula. So the split size is blockSize. 13

Input Formats 14

Input Formats  Small files and CombineFileInputFormat Hadoop works better with a small number of large files than a large number of small files. If the file is very small (“small” means significantly smaller than an HDFS block) and there are a lot of them, then each map task will process very little input, and there will be a lot of them (one per file), each of which imposes extra bookkeeping overhead. The situation is alleviated somewhat by CombineFileInputFormat, which was designed to work well with small files. CombineFileInputFormat? 1. Where FileInputFormat creates a split per file, CombineFileInputFormat packs many files into each split so that each mapper has more to process. 2. Crucially, CombineFileInputFormat takes node and rack locality into account when deciding which blocks to place in the same split 3. CombineFileInputFormat does not compromise the speed at which it can process the input in a typical MapReduce job. 15

Input Formats  Preventing splitting The point of need Example) a simple way to check if all the records in a file are sorted is to go through the records in order, checking whether each record is not less than the preceding one. There are a couple of ways to ensure that an existing file is not split. 1. The first (quick and dirty) way is to increase the minimum split size to be larger than the largest file in your system. 2. The second is to subclass the concrete subclass of FileInputFormat that you want to use, to override the isSplitable() method to return false. 16

Input Formats  Processing a whole file as a record 17

Input Formats  Text Input  TextInputFormat The key, a LongWritable, is the byte offset within the file of the beginning of the line. The value is the contents of the line, excluding any line terminators (newline, carriage return), and is packaged as a Text object. So a file containing the following text: is divided into one split of four records. The records are interpreted as the following key-value pairs: 18

Input Formats  Text Input  KeyValueTextInputFormat This is the output produced by TextOutputFor mat, Hadoop’s default OutputFormat. To interpret such files correctly, KeyValueTextInputFormat is appropriate. You can specify the separator via the key.value.separator.in.input.line property. It is a tab character by default. Consider the following input file, where → represents a (horizontal) tab character: Like in the TextInputFormat case, the input is in a single split comprising four records, although this time the keys are the Text sequences before the tab in each line: 19

Input Formats  Text Input  NLineInputFormat With TextInputFormat and KeyValueTextInputFormat, each mapper receives a variable number of lines of input. The number depends on the size of the split and the length of the lines. If you want your mappers to receive a fixed number of lines of input, then NLineInputFormat is the InputFormat to use.  XML Large XML documents that are composed of a series of “records” (XML document fragments) can be broken into these records using simple string or regular-expression matching to find start and end tags of records. StreamXmlRecordReader, the page elements can be interpreted as records for processing by a mapper. 20

Input Formats  Binary Input  SequenceFileInputFormat Hadoop’s sequence file format stores sequences of binary key-value pairs.  SequenceFileAsTextInputFormat SequenceFileAsTextInputFormat is a variant of SequenceFileInputFormat that converts the sequence file’s keys and values to Text objects.  SequenceFileAsTextInputFormat SequenceFileAsBinaryInputFormat is a variant of SequenceFileInputFormat that retrieves the sequence file’s keys and values as opaque binary objects. 21

Input Formats  Multiple Inputs  one might be tab-separated plain text, the other a binary sequence file. Even if they are in the same format, they may have different representations, and therefore need to be parsed differently.  These cases are handled elegantly by using the MultipleInputs class. MultipleInputs class, which allows you to specify the InputFormat and Mapper to use on a per-path basis. Example) if we had weather data from the U.K. Met Office# that we wanted to combine with the NCDC data for our maximum temperature analysis, then we might set up the input as follows: 22

Input Formats  Database Input (and Output)  DBInputFormat is an input format for reading data from a relational database, using JDBC. 23

Output Formats  Text Output  The default output format, TextOutputFormat, writes records as lines of text.  TextOutputFormat keys and values may be of any type.  Each key-value pair is separated by a tab character, although that may be changed using the mapred.textoutputformat.separator property.  You can suppress the key or the value (or both, making this output format equivalent to NullOutputFormat, which emits nothing) from the output using a NullWritable type. 24

Output Formats  Binary Output  SequenceFileOutputFormat As the name indicates, SequenceFileOutputFormat writes sequence files for its output. This is a good choice of output if it forms the input to a further MapReduce job, since it is compact, and is readily compressed.  SequenceFileAsBinaryOutputFormat SequenceFileAsBinaryOutputFormat is the counterpart to SequenceFileAsBinaryInput Format. SequenceFileAsBinaryOutputFormat writes keys and values in raw binary format into a SequenceFile container.  MapFileOutputFormat MapFileOutputFormat writes MapFiles as output. 25

Output Formats  Multiple Outputs  FileOutputFormat and its subclasses generate a set of files in the output directory. There is one file per reducer files are named by the partition number: part-00000, part-00001, etc.  There is sometimes a need to have more control over the naming of the files, or to produce multiple files per reducer.  MapReduce comes with two libraries to help you do this: MultipleOutputFormat and MultipleOutputs. 26

Output Formats  MultipleOutputFormat MultipleOutputFormat allows you to write data to multiple files whose names are derived from the output keys and values.  MultipleOutputs There’s a second library in Hadoop for generating multiple outputs, provided by the MultipleOutputs class. Unlike MultipleOutputFormat, MultipleOutputs can emit different types for each output. On the other hand, there is less control over the naming of outputs. What’s the Difference Between MultipleOutputFormat and MultipleOutputs ? So in summary, MultipleOutputs is more fully featured, but MultipleOutputFormat hasmore control over the output directory structure and file naming. 27

Output Formats  Lazy Output  FileOutputFormat subclasses will create output (part-nnnnn) files, even if they are empty.  Some applications prefer that empty files not be created, which is where LazyOutput Format helps.  Streaming and Pipes support a -lazyOutput option to enable LazyOutputFormat. 28