MAPREDUCE TYPES, FORMATS AND FEATURES

Slides:



Advertisements
Similar presentations
Hadoop Programming. Overview MapReduce Types Input Formats Output Formats Serialization Job g/apache/hadoop/mapreduce/package-
Advertisements

Beyond Mapper and Reducer
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce.
Beyond map/reduce functions partitioner, combiner and parameter configuration Gang Luo Sept. 9, 2010.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
Spark: Cluster Computing with Working Sets
Hadoop: The Definitive Guide Chap. 2 MapReduce
Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.
Lecture 3 – Hadoop Technical Introduction CSE 490H.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Big Data Analytics with R and Hadoop
大规模数据处理 / 云计算 Lecture 3 – Hadoop Environment 彭波 北京大学信息科学技术学院 4/23/2011 This work is licensed under a Creative Commons.
HADOOP ADMIN: Session -2
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Christopher Jeffers August 2012
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Distributed and Parallel Processing Technology Chapter7. MAPREDUCE TYPES AND FORMATS NamSoo Kim 1.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HAMS Technologies 1
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
O’Reilly – Hadoop: The Definitive Guide Ch.7 MapReduce Types and Formats 29 July 2010 Taikyoung Kim.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Lecture 5 Books: “Hadoop in Action” by Chuck Lam,
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
CS350 - MAPREDUCE USING HADOOP Spring PARALLELIZATION: BASIC IDEA Parallelization is “easy” if processing can be cleanly split into n units:
MapReduce “MapReduce allows us to stop thinking about fault tolerance.” Cathy O’Neil & Rachel Schutt, 2013.
Software Systems Development
Map-Reduce framework.
Ch 8 and Ch 9: MapReduce Types, Formats and Features
Hadoop MapReduce Framework
MapReduce Types, Formats and Features
Spark Presentation.
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Lecture 17 (Hadoop: Getting Started)
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Hadoop MapReduce Types
Cloud Distributed Computing Environment Hadoop
CS6604 Digital Libraries IDEAL Webpages Presented by
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
CS110: Discussion about Spark
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Lecture 18 (Hadoop: Programming Examples)
Data processing with Hadoop
Lecture 3 – Hadoop Technical Introduction
CS639: Data Management for Data Science
5/7/2019 Map Reduce Map reduce.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

MAPREDUCE TYPES, FORMATS AND FEATURES

MAPREDUCE GENERAL FORM The general form of map and reduce functions in Hadoop MapReduce is: The map input key K1 and value V1 are different from the map output key K2 and value V2. The reduce input is same as the map output. The reduce output is different as key K3 and value V3. Data Flow:

BASIC HADOOP API Mapper void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) void configure(JobConf job) void close() throws IOException Reducer/Combiner void reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter) Partitioner void getPartition(K2 key, V2 value, int numPartitions)

JAVA IMPLEMENTATION OF MAPREDUCE Java allows various types of keys and values to be passed to the mapper function because it uses generics in its implementation. When Hadoop receives a job request, it determines if the job can run on the local JVM (Java Virtual Machine). If it is not possible, other JVM instances are reserved on other nodes to avail parallel processing. The JVM can pull data from sources outside the java environment using streams. The size of the job by default is 10 and it can be configured. The size of the job is determined by the number of mappers required to process the data.

MAPREDUCE INPUT FORMATS The Java Abstract InputFormat class is responsible for creating input splits. Application writers create implementation of InputFormat class. They do not deal with splits. Input split is the input chunk which is processed by a single map and a single split is processed by each map. Each split is divided into records which is processed by the map which is in the form of a key-value pair. Input splits are represented by the Java class InputSplit. It has two methods: -The first method returns length of the split -The second method returns the location of the split. There are two types of input formats: File Input Formats Text Input Formats

FILE INPUT FORMATS FileInputFormat: It is the base class for all implementations of InputFormat which use files as their data source. It provides a place to define the input files required for a job, and an implementation for generating splits for the input files. CombineFileInputFormat: In this input format, many files are packed into each split so that each mapper has more to process. Hadoop works better with a small number of large files than a large number of small files. WholeFileInputFormat: In this input format, keys are not used and values are the file contents. It takes a FileSplit and converts it into a single record.

TEXT INPUT FORMATS TextInputFormat: It is the default InputFormat in which each record is a line of input. The key is the byte offset. The value is the content of the line and is packaged as a Text object. KeyValueTextInputFormat: The keys in TextInputFormat are not normally useful as they are just offsets within a file. Each line in a file can be a key-value pair, separated by a delimiter such as a tab character. NLineInputFormat: For mappers to receive a fixed number of lines of input, thenNLineInputFormat should be used. XML: StreamXmlRecordReader (which is in the org.apache.hadoop.streaming.mapreduce package) is used for Hadoop.

OTHER INPUT FORMATS Binary Input: SequenceFileInputFormat: This input format is designed to handle Hadoop’s sequential file format which consists of binary key value pair. SequenceFileAsTextInputFormat: This input format converts Sequence file key and value to text objects and makes it suitable for streaming. SequenceFileAsBinaryInputFormat: This input format retrieves the sequence file’s keys and values as binary objects. FixedLengthInputFormat: This input format is used for reading fixed-width binary records from a file where the records are not separated by delimiters. Multiple Input: This input format is used to handle multiple inputs of different formats. Database Input: This Input format is used to read data from a relational database.

OUTPUT FORMATS TextOutputFormat: It is the default output format. It writes records as lines of text. Binary Output: SequenceFileOutputFormat: This output format writes sequence files as the output. SequenceFileAsBinaryOutputFormat: This output format writes keys and values in binary format into a sequence file container. MapFileOutputFormat: This output format writes map files as the output. Multiple Output: This output format allows the programmer to write data to files. The file names are derived from the output key-value pairs to create more than one file. LazyOutput: It is a wrapper output format which ensures tha the output file is created only when the first record is emitted in a given partition. Database Output: This output format writes to relational database and Hbase.

FEATURES Counters: Counters are handy when you want to diagnose a problem in case there is one, or to gather statistical information about the task. Types of Counters: Built in counters: Task Counters: It gathers information about the task and then results are aggregated for all the tasks in a job. Job Counters: It is maintained by an application master, it measures job level statistics such as total number of map and reduce tasks. User Defined counters: Dynamic counters are app. writer which is created by counters using Java enum or by interface. Retrieving counters give the job level statistics even when the job is still running instead of waiting till the end for the job to finish. Built-in Java APIs. User-defined streaming functions: They can increment counters by sending a specially formatted line to the standard error stream. The line must have the following format: reporter:counter:group,counter,amount.

SORTING The ability to sort data is one of the main features of MapReduce capabilities. MapReduce can be sorted in the following ways: Partial Sort: This is the default sort where MapReduce sorts the data by keys. Individual output files are sorted but no globally sorted file is combined and produced. Total Sort: Produces globally sorted output files. It does this by using a partitioner that respects the total order of the output and the partition sizes must be fairly even. Secondary Sort: Sorts the records by key in the map phase instead of the reduce phase.

SORTING

JOINS Joins can be used in MapReduce to combine large datasets together. Implementation depends upon how large the dataset is and on how they are partitioned. When processing large data sets the need for joining data by a common key can be very useful, if not essential. By joining data you can further gain insights.

JOINS Map Reduce can be used to join large datasets.

JOINS Map Side Join When the join function is performed by the mapper it is called map side join. A map-side join between large inputs works by performing the join before the data reaches the map function. For this to work, though the inputs to each map must be partitioned and sorted in a particular way. Each input dataset must be divided into the same number of partitions, and it must be sorted by the same key (the join key) in each source. All the records for a particular key must reside in the same partition. To take advantage of map-side joins our data must meet one of following criteria: The datasets to be joined are already sorted by the same key and have the same number of partitions. Of the two datasets to be joined, one is small enough to fit into memory.

JOINS Reduce Side Join If the join is performed by the reducer it is called a reduce-side join. A reduce-side join is more general than a map-side join, as the input datasets don’t have to be structured in any particular way, but it is less efficient because both datasets have to go through the MapReduce shuffle. The basic idea is that the mapper tags each record with its source and uses the join key as the map output key, so that the records with the same key are brought together in the reducer. How does this work? Multiple Inputs: The input sources for the datasets generally have different formats, so it is very convenient to use the MultipleInputs class to separate logic for parsing and tagging. Secondary Sort: Reducer checks the records from both sources that have the same key, but they are not guaranteed to be in any particular order.

SIDE DATA DISTRIBUTION Side Data can be defined as extra read-only data needed by a job to process the main dataset. To make side data available to all the map or reduce tasks is a big challenge. Ways to make data available: One way is to use Job Configuration setter method to set key-value pairs . Another way is using Hadoop’s distributed cache to distribute the datasets. This provides a service for copying files and archives to the task nodes in time for the tasks to use them when they run. Two types of objects can be placed into cache: Files Archives

MapReduce LIBRARY CLASSES Mappers/ Reducers for commonly used functions:

CONCLUSION A simple way to scale an application is provided by MapReduce. Important applications include Social networking and Search engines, PageRank, Statistics and Genomics. This scales out to more machines, rather than scaling up. It can effortlessly scale from a single machine to thousands of machines. This is fault tolerant & has a high performance. If your use case can fit into its paradigm, then scaling is handled by the framework.

THANK YOU