Ch 8 and Ch 9: MapReduce Types, Formats and Features

Slides:

Advertisements

Similar presentations

Hadoop Programming. Overview MapReduce Types Input Formats Output Formats Serialization Job g/apache/hadoop/mapreduce/package-

Advertisements

Beyond Mapper and Reducer

MapReduce Simplified Data Processing on Large Clusters

Beyond map/reduce functions partitioner, combiner and parameter configuration Gang Luo Sept. 9, 2010.

Mapreduce and Hadoop Introduce Mapreduce and Hadoop

MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.

Hadoop: The Definitive Guide Chap. 2 MapReduce

Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.

Lecture 3 – Hadoop Technical Introduction CSE 490H.

CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Hadoop: The Definitive Guide Chap. 8 MapReduce Features

CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:

Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.

Distributed and Parallel Processing Technology Chapter7. MAPREDUCE TYPES AND FORMATS NamSoo Kim 1.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013.

CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

O’Reilly – Hadoop: The Definitive Guide Ch.7 MapReduce Types and Formats 29 July 2010 Taikyoung Kim.

Lecture 5 Books: “Hadoop in Action” by Chuck Lam,

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.

MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,

Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

CS350 - MAPREDUCE USING HADOOP Spring PARALLELIZATION: BASIC IDEA Parallelization is “easy” if processing can be cleanly split into n units:

Image taken from: slideshare

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

Hadoop Aakash Kag What Why How 1.

Map-Reduce framework.

Large-scale file systems and Map-Reduce

Hadoop MapReduce Framework

MapReduce Types, Formats and Features

TABLE OF CONTENTS. TABLE OF CONTENTS Not Possible in single computer and DB Serialised solution not possible Large data backup difficult so data.

Lecture 17 (Hadoop: Getting Started)

Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Ministry of Higher Education

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

Hadoop MapReduce Types: 2/2

Hadoop MapReduce Types

The Basics of Apache Hadoop

Cloud Distributed Computing Environment Hadoop

CS6604 Digital Libraries IDEAL Webpages Presented by

湖南大学-信息科学与工程学院-计算机与科学系

Chapter 27 WWW and HTTP.

Cse 344 May 4th – Map/Reduce.

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Lecture 18 (Hadoop: Programming Examples)

Data processing with Hadoop

Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming

Lecture 16 (Intro to MapReduce and Hadoop)

Lecture 3 – Hadoop Technical Introduction

Charles Tappert Seidenberg School of CSIS, Pace University

MAPREDUCE TYPES, FORMATS AND FEATURES

MapReduce Algorithm Design

CS639: Data Management for Data Science

5/7/2019 Map Reduce Map reduce.

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

MapReduce: Simplified Data Processing on Large Clusters

Map Reduce, Types, Formats and Features

Presentation transcript:

Ch 8 and Ch 9: MapReduce Types, Formats and Features finitive Guide - Ch 8 Pratik

MapReduce Form Review General form of Map/Reduce functions: map: (K1, V1) -> list(K2, V2) reduce: (K2, list(V2)) -> list(K3, V3) General form with Combiner function: combiner: (K2, list(V2)) -> list(K2, V2) Partition function: partition: (K2, V2) -> integer Pratik

Input Formats - Basics Input split - a chunk of the input that is processed by a single map Each map processes a single split, which is divided into records (key-value pair) that are individually processed by the map Represented by Java class InputSplit Set of storage locations (hostname strings) Contains reference to the data not the actual data InputFormat - responsible for creating input splits and dividing them into records so you will not directly deal with with the InputSplit class Controlling split size Usually the size of the HDFS block Minimum size: 1 byte Maximum size: Maximum value of Java long datatype Split size formula: max(minimumSize, min(maximumSize, blockSize)) minimumSize < blockSize < maximumSize Candace Allison

Input Formats - Basics Avoid small files - storing a large number of small files increases the numbers of seeks needed to run the job A sequence file can be used to merge small files into larger files to avoid a large number of small files Preventing splitting - you might want to prevent splitting if you want a single mapper to process each input file as an entire file 1. Increase the minimum split size to be larger than the largest file in the system 2. Subclass the subclass of FileInputFormat to override the isSplitable() method to return false Reading an entire file as a record: RecordRecorder - deliver file contents as the value of the record, must implement createRecordReader() to create a custom implementation of the class WholeFileInputFormat Candace Allison

Input Formats - File Input FileInputFormat - the base class for all implementations of InputFormat that use a file as the source for data Provides a place to define what files are included as input to a job and an implementation for generating splits for the input files Input is often specified as a collection of paths Splits large files (larger than HDFS block) CombineFileInputFormat - Java class designed to work well with small files in Hadoop Each split will contain many of the small files so that each mapper has more to process Takes node and rack locality into account when deciding what blocks to place into the same split WholeFileInputFormat - defines a format where the keys are not used and the values are the file contents Takes a FileSplit and converts it into a single record Abdulla Albuenain

Input Formats - Text Input TextInputFormat - default InputFormat where each record is a line of input Key - byte offset within the file of the beginning of the line; Value - the contents of the line, not including any line terminators, packaged as a Text object mapreduce.input.linerecordreader.line.maxlength - can be used to set a maximum expected line length Safeguards against corrupted files (often appears as a very long line) KeyValueTextInputFormat - Used to interpret TextOutputFormat (default output that contains key-value pairs separated by a delimiter) mapreduce.input.keyvaluelinerecordreader.key.value.separator - used to specify the delimiter/separator which is tab by default NLineInputFormat - used when the mappers need to receive a fixed number of lines of input mapreduce.input.line.inputformat.linespermap - controls the number of input lines (N) StreamXmlRecordReader - used to break XML documents into records Abdulla Albuenain

Input Formats - Binary Input, Multiple Inputs, and Database I/O SequenceFileInputFormat - stores sequences of binary key-value pairs SequenceFileAsTextInputFormat - converts sequence file’s keys and values to Text objects SequenceFileAsBinaryInputFormat - retrieves the sequence file’s keys and values as binary objects FixedLengthInputFormat - reading fixed-width binary records from a file where the records are not separated by delimiters Multiple Inputs: All input is interpreted by a single InputFormat and a single Mapper MultipleInputs - allows programmer to specify which InputFormat and Mapper to use on a per- path basis Database Input/Output: DBInputFormat - input format for reading data from a relational database DBOutputFormat - output format for outputting data from a relational database Ruchee

Output Formats Text Output: TextOutputFormat - default output format; writes records as lines of text (keys and values are turned into strings) KeyValueTextInputFormat - breaks lines into key-value pairs based on a configurable separator Binary Output: SequenceFileOutputFormat - writes sequence files as output SequenceFileAsBinaryOutputFormat - writes keys and values in binary format into a sequence file container MapFileOutputFormat - writes map files as output Multiple Outputs: MultipleOutputs - allows programmer to write data to files whose names are derived from output keys and values to create more than one file Lazy Output: LazyOutputFormat - wrapper output format that ensures the output file is created only when the first record is emitted for a given partition Ruchee

Counters Useful for gathering statistics about a job, quality-control, and problem diagnosis Built-in Counter Types: Task Counters - gather info about tasks as they are executed and results are aggregated over all job tasks Maintained by each task attempt and are sent to the application manager on a regular basis to be globally aggregated May go down if a task fails Job Counters - measure job-level statistics and are maintained by the application master so they do not need to be sent across the network User-Defined Counters: User can define a set of counters to be incremented in a mapper/reducer function Dynamic counters (not defined by Java enum) can be created by the user Fahad Aldosari

Sorting Partial Sort - does not produce a globally- sorted output file Total Sort - produces a globally-sorted output file Produce a set of sorted files that can be concatenated to form a globally-sorted file To do this: use a partitioner that respects the total order of the output and the partition sizes must be fairly even Secondary Sort - Sorts the values of the keys These are usually not sorted by MapReduce Fahad Aldosari

Joins MapReduce can perform joins between large datasets. Ex: Azzahra Alsaif

Joins - Map-Side vs Reduce-Side Map-Side Join Reduce-Side Join the inputs must be divided into the same number of partitions and sorted by the same key (the join key) All the records for a particular key must reside in the same partition CompositeInputFormat can be used to run a map-side join Input datasets do not have to be structured in a particular way Results in records with the same key being brought together in the reducer function Uses MultipleInputs and a secondary sort Azzahra Alsaif

Side Data Distribution Side Data - extra read-only data needed by a job to process the main dataset The main challenge is to make side data available to all the map or reduce tasks (which are spread across the cluster) in way that is convenient and efficient Using the Job Configuration Configuration is a setter method used to set key-value pairs in the job configuration Useful for passing metadata to tasks Distributed Cache Instead of serializing side data in the job config, it is preferred to distribute the datasets using Hadoop’s distributed cache Provides a service for copying files and archives to the task nodes in time for the tasks to use them when they run 2 types of objects can be placed into cache: Files Archives Kevin

MapReduce Library Classes Mappers/Reducers for commonly-used functions: Kevin

Video – Example MapReduce WordCount