IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

Slides:



Advertisements
Similar presentations
Hadoop Programming. Overview MapReduce Types Input Formats Output Formats Serialization Job g/apache/hadoop/mapreduce/package-
Advertisements

Distributed and Parallel Processing Technology Chapter2. MapReduce
Beyond Mapper and Reducer
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Simplified Data Processing on Large Clusters
MapReduce.
Beyond map/reduce functions partitioner, combiner and parameter configuration Gang Luo Sept. 9, 2010.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
Google MapReduce Framework A Summary of: MapReduce & Hadoop API Slides prepared by Peter Erickson
O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.
Spark: Cluster Computing with Working Sets
Clydesdale: Structured Data Processing on MapReduce Jackie.
Map reduce with Hadoop streaming and/or Hadoop. Hadoop Job Hadoop Mapper Hadoop Reducer Partitioner Hadoop FileSystem Combiner Shuffle Sort Shuffle Sort.
Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.
Hadoop: The Definitive Guide Chap. 2 MapReduce
Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.
Lecture 3 – Hadoop Technical Introduction CSE 490H.
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Big Data Analytics with R and Hadoop
大规模数据处理 / 云计算 Lecture 3 – Hadoop Environment 彭波 北京大学信息科学技术学院 4/23/2011 This work is licensed under a Creative Commons.
HADOOP ADMIN: Session -2
Inter-process Communication in Hadoop
MapReduce Programming Yue-Shan Chang. split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Distributed and Parallel Processing Technology Chapter7. MAPREDUCE TYPES AND FORMATS NamSoo Kim 1.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
HAMS Technologies 1
大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 This work is licensed under a Creative Commons.
MapReduce How to painlessly process terabytes of data.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Hadoop Introduction Wang Xiaobo Outline Install hadoop HDFS MapReduce WordCount Analyzing Compile image data TeleNav Confidential.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
1 Week 12 l Overview of Streams and File I/O l Text File I/O Streams and File I/O.
MapReduce design patterns Chapter 5: Join Patterns G 진다인.
Writing a MapReduce Program 1. Agenda  How to use the Hadoop API to write a MapReduce program in Java  How to use the Streaming API to write Mappers.
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
O’Reilly – Hadoop: The Definitive Guide Ch.7 MapReduce Types and Formats 29 July 2010 Taikyoung Kim.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
Lecture 5 Books: “Hadoop in Action” by Chuck Lam,
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, January 31, 2013 Session 2: Hadoop Nuts and Bolts This work is licensed.
Airlinecount CSCE 587 Spring Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.
Csinparallel.org Workshop 307: CSinParallel: Using Map-Reduce to Teach Parallel Programming Concepts, Hands-On Dick Brown, St. Olaf College Libby Shoop,
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Map-Reduce framework.
Ch 8 and Ch 9: MapReduce Types, Formats and Features
MapReduce Types, Formats and Features
Lecture 17 (Hadoop: Getting Started)
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Airlinecount CSCE 587 Fall 2017.
Hadoop MapReduce Types
Lecture 18 (Hadoop: Programming Examples)
Data processing with Hadoop
Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming
Advanced Hadoop Tuning and Optimizations
MAPREDUCE TYPES, FORMATS AND FEATURES
Map Reduce, Types, Formats and Features
Presentation transcript:

IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE

IBM Research | India Research Lab Outline  Map-Reduce Features  Combiner / Partitioner / Counter  Passing Configuration Parameters  Distributed-Cache  Hadoop I/O  Passing Custom Objects as Key-Values  Input and Output Formats  Introduction  Input/Output Formats provided by Hadoop  Writing Custom Input/Output-Formats  Miscellaneous  Chaining Map-Reduce Jobs  Compression  Hadoop Tuning and Optimization

IBM Research | India Research Lab Combiner  A local reduce  Processes the output of each map function  Same signature as of a reduce  Often reduces the number of intermediate key-value pairs

IBM Research | India Research Lab Word-Count Hadoop Map Map Reduce Hadoop Map Map Reduce Map Hadoop Key Value (Hadoop, 1) (Map, 1) (Reduce, 1) (Hadoop, 1) (Map, 1) (Reduce, 1) (Map, 1) (Hadoop, 1) (Key, 1) (Value, 1) Sort/Shuffle (Hadoop, [1,1,1]) (Map, [1,1,1,1,1,1,1]) (Key, [1,1]) (Reduce, [1,1]) (Value, [1,1]) A-I J-Q R-Z (Hadoop, 3) (Map, 7) (Key, 2) (Reduce, 2) (Value, 2)

IBM Research | India Research Lab Word-Count Hadoop Map Map Reduce Hadoop Map Map Reduce Map Hadoop Key Value (Hadoop, 1) (Map, 1) (Reduce, 1) (Hadoop, 1) (Map, 1) (Reduce, 1) (Map, 1) (Hadoop, 1) (Key, 1) (Value, 1) (Hadoop, [2,1]) (Map, [4, 3]) (Key, [2]) (Reduce, [1,1]) (Value, [2]) A-I J-Q R-Z (Hadoop, 3) (Map, 7) (Key, 2) (Reduce, 2) (Value, 2) (Hadoop, [1,1]) (Map, [1,1,1,1]) (Reduce, [1]) (Map, [1,1,1]) (Reduce, 1) (Hadoop, 1) (Key, [1,1]) (Value, [1,1]) (Hadoop, 2) (Map, 4) (Reduce, 1) (Map, 3) (Reduce, 1) (Hadoop, 1) (Key, 2) (Value, 2) COMBINER

IBM Research | India Research Lab COMBINER public class WordCountCombiner extends Reducer { public void reduce(Text key, Iterable values, Context context){ context.write(key, new IntWritable(count(values))); } } Type of Output KeyType of Output ValueType of Input KeyType of Input Value

IBM Research | India Research Lab Word-Count Runner Class public class WordCountRunner{ public static void main(String[] args){ Job job = new Job(); job.setMapperClass(WordCountMap.class); job.setCombinerClass(WordCountCombiner.class); job.setReducerClass(WordCountReduce.class); job.setJarByClass(WordCountRunner.class); FileInputFormat.addInputPath(job, inputFilesPath); FileOutputFormat.addOutputPath(job, outputPath); job.setMapOutputKeyClass(Text.class); job.setMapOutputValuesClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setNumReduceTasks(1); job.waitForCompletion(true); }

IBM Research | India Research Lab Counters

IBM Research | India Research Lab Counters  Built-in Counters  Report Metrics for various aspects of a Job  Task Counters Gather information about tasks over the course of a job Results are aggregated across all tasks MAP_INPUT_RECORDS, REDUCE_INPUT_GROUPS  FileSystem Counters BYTES_READ, BYTES_WRITTEN Bytes Read/Written by each File-System (HDFS, KFS, Local, S3 etc)  FileInputFormat Counters BYTES_READ (Bytes Read through FileInputFormat)  FileOutputFormat Counters BYTES_WRITTEN (Bytes Written through FileOutputFormat)  Job Counters Maintained by Job-Tracker TOTAL_LAUNCHED_MAPS, TOTAL_LAUNCHED_REDUCES

IBM Research | India Research Lab User-Define Counters  public class WordCountMap extends Mapper { enum WCCounters {NOUNS, PRONOUNS, ADJECTIVES}; public void map(LongWritable key, Text line, Context context){ String[] tokens = Tokenize(line); for(int i=0; i<tokens.length; i++){ if(isNoun(tokens[i])) context.getCounter(WCCounter.NOUNS).increment(1); else if(isProNoun(tokens[i])) context.getCounter(WCCounter.PRONOUNS).increment(1); else if(isAdjective(tokens[i])) context.getCounter(WCCount.ADJECTIVES).increment(1); context.write(new Text(tokens[i]), new IntWritable(1)); }

IBM Research | India Research Lab Retrieving the values of a Counter Counter counters = job.getCounters(); Counter counter = counters.findCounter(WCCounters.NOUNS); int value = counter.getValue();

IBM Research | India Research Lab Output 13/10/08 15:36:15 INFO mapred.JobClient WordCountMap.NOUNS= /10/08 15:36:15 INFO mapred.JobClient WordCountMap.PRONOUNS= /10/08 15:36:15 INFO mapred.JobClient WordCountMap.ADJECTIVES=1897

IBM Research | India Research Lab Partitioner  Map keys to reducers/partitions  Determines which reducer receives a certain key  Identical keys produced by different map functions must map to same partition/reducer  If n reducers are used, then n partitions must be filled  Number of reducers are set by the call “setNumReduceTasks”  Hadoop uses HashPartitioner as default partitioner

IBM Research | India Research Lab Defining a Custom Partitioner  Implement a class which extends the Partitioner class  Partitioning impacts load-balancing aspect of a map-reduce program  Word-Count: Many words starting with vowels  Words starting with a different character sent to different reducer  For words starting with vowels, second character may be taken into account

IBM Research | India Research Lab Word-Count Runner Class public class WordCountRunner{ public static void main(String[] args){ Job job = new Job(); job.setMapperClass(WordCountMap.class); job.setCombinerClass(WordCountCombiner.class); job.setReducerClass(WordCountReduce.class); job.setJarByClass(WordCountRunner.class); job.setPartitionerClass(WordCountPartitioner.class); FileInputFormat.addInputPath(job, inputFilesPath); FileOutputFormat.addOutputPath(job, outputPath); job.setMapOutputKeyClass(Text.class); job.setMapOutputValuesClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setNumReduceTasks(1); job.waitForCompletion(true); }

IBM Research | India Research Lab Passing Configuration Parameters  Map-Reduce jobs may require certain input parameters  One may want to avoid counting words starting with certain prefixes  Prefixes can be set in the configuration

IBM Research | India Research Lab Word-Count Runner Class public class WordCountRunner{ public static void main(String[] args){ Job job = new Job(); Configuration conf = job.getConfiguration(); conf.set(“PrefixesToAvoid”, “abs bts bnm swe”); …… job.waitForCompletion(true); }

IBM Research | India Research Lab Word-Count Map  public class WordCountMap extends Mapper { private String[] prefixesToAvoid; public void setup(Context context) throws InterruptedException{ Configuration conf = context.getConfiguration(); String prefixes = conf.get(“PrefixesToAvoid”); this.prefixesToAvoid = prefixes.split(“ “); } public void map(LongWritable key, Text line, Context context){ String[] tokens = Tokenize(line); for(int i=0; i<tokens.length; i++){ context.write(new Text(tokens[i]), new IntWritable(1)); }

IBM Research | India Research Lab Distributed Cache  A file may need to be broadcasted to each map-node  For example, a dictionary in a spell-check  Such file-names can be added in a distributed-cache.  Hadoop copies files added to the cache to all map-nodes.  Step 1 : Put file to HDFS  hdfs dfs –put /tmp/file1 /cachefile1  Step 2 : Add CacheFile in Job Configuration  Configuration conf = job.getConfiguration();  DistributedCache.addCacheFile(new URI(“/cachefile1”), conf);  Step 3 : Access cache file locally at each map  Path[] cacheFiles = context.getLocalCacheFiles();  FileInputStream finputStream = new FileInputStream(cacheFiles[0].toString());

IBM Research | India Research Lab Hadoop I/O : Reading an HDFS File // Get FileSystem Object Instance FileSystem fs = FileSystem.get(conf); // Get File Stream Path infile = new Path(filePath); BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(infile))); // Read file line by line StringBuilder fileContent = new StringBuilder(); String line = br.readLine(); while(line!=null){ fileContent.append(line).append(“\n”); line = br.readLine(); }

IBM Research | India Research Lab Hadoop I/O : Writing to an HDFS file // Get FileSystem Object Instance FileSystem fs = FileSystem.get(conf); // Get FileStream Path path = new Path(filePath); FSDataOutputStream outputStream = hdfs.create(path); // Write to file byte[] bytes = content.getBytes(); outputStream.write(bytes, 0, bytes.length); outputStream.close();

IBM Research | India Research Lab Hadoop I/O : Getting the File being Processed  A map-reduce job may need to process multiple files  The functionality of a map may depend upon which file is being processed FileSplit fileSplit = (FileSplit) context.getInputSplit(); String filename = fileSplit.getPath().getName();

IBM Research | India Research Lab Custom Objects as Key-Values  Passing key and values from map functions to reducers  IntWritable, DoubleWritable, LongWritable, Text, ArrayWritable  Passing key and values of custom classes may be desirable  Objects that can be passed around must implement certain interfaces  Writable for passing as values  WritableComparable for passing as keys

IBM Research | India Research Lab Example Use-Case  Consider Weather data  Temperature and Pressure values at different lattitude-longitude-elevation-timestamp quadruple  Data is hence 4-dimensional  Temperature and Pressure data in separate files  File Format : lattitude, longitude, elevation, timestamp, temperature-value Ex: F F Similarly for Pressure Ex kPa  We want to read the two data files and combine the data Ex: F 101kPa  Let class STPoint represent the coordinates  class STPoint{ double lattitude, longitude, elevation; long timestamp; }

IBM Research | India Research Lab Map to Reduce Flow F F kPa kPa MAP ( , 99F) Text DoubleWritable ( , 101kPa) Text DoubleWritable REDUCE (10, 20, 1, 10, 99F, 101kPa)

IBM Research | India Research Lab Map to Reduce Flow F F kPa kPa MAP (STPoint( ), 99F) STPoint DoubleWritable (STPoint( ), 101kPa) STPoint DoubleWritable REDUCE (STPoint(10, 20, 1, 10), 99F, 101kPa)

IBM Research | India Research Lab Map  public class MyMap extends Mapper { public void map(LongWritable key, Text line, Context context){ String tokens[] = line.split(“ “); double lattitude = new Double(tokens[0]).doubleValue(); double longitude = new Double(tokens[1]).doubleValue(); double elevation = new Double(tokens[2]).doubleValue(); long timestamp = new Long(tokens[3]).longValue(); double attrVal = new Double(tokens[4]).doubleValue(); String keyString = lattitude+” “+longitude+” “+elevation+” “+timestamp; context.write(new Text(keyString), attrVal); } Type of Map Output KeyType of Output Value

IBM Research | India Research Lab New Map public class MyMap extends Mapper { public void map(LongWritable key, Text line, Context context){ String tokens[] = line.split(“ “); double lattitude = new Double(tokens[0]).doubleValue(); double longitude = new Double(tokens[1]).doubleValue(); double elevation = new Double(tokens[2]).doubleValue(); long timestamp = new Long(tokens[3]).longValue(); double attrVal = new Double(tokens[4]).doubleValue(); STPoint stpoint = new STPoint (lattitude, longitude, elevation, timestamp); context.write(stpoint, attrVal); } Type of Map Output KeyType of Output Value More Intuitive, Human Readable, Reduces Processing at Reduce Side

IBM Research | India Research Lab New Reduce public class DataReadReduce extends Reducer { public void reduce(STPoint key, Iterable values, Context context){ } } Type of Output KeyType of Output ValueType of Input KeyType of Input Value

IBM Research | India Research Lab Passing Custom Objects as Key-Values  Key-Value Pairs are written to local disk by map functions  User must tell how to write a custom object  Key-Value Pairs are read by reducers from local disk  User must tell how to read a custom object  Keys are sorted and compared  User must spectify how to compare two keys

IBM Research | India Research Lab WritableComparable Interface  Three Methods  public void readFields(DataInput in) {}  public void write(DataOutput out) {}  public int compareTo(Object other) {}  Objects that are passed as keys must implement WritableComparable interface.  Objects that are passed as values must implement Writable Interface  Writable interface does not have compareTo method  Only keys are compared and not values and hence compareTo method not required for objects being passed only as keys.

IBM Research | India Research Lab Implementing WritableComparable for STPoint  public void readFields(DataInput in) { this.lattitude = in.readDouble(); this.longitude = in.readDouble(); this.elevation = in.readDouble(); long timeStamp = in.readLong(); }  public void write(DataOutput output){ out.writeDouble(this.lattitude); out.writeDouble(this.longitude); out.writeDouble(this.elevation); out.writeLong(this.timestamp); }  public int compareTo(STPoint other){ return this.toString().compareTo(other.toString()); }

IBM Research | India Research Lab InputFormat and OutputFormat  InputFormat  Defines how to read data from file and feed it to the map functions  OutputFormat  Defines how to write data on to a file  Hadoop provides various Input and Output Formats  A user can also implement custom input and output formats  Defining custom input and output formats is a very useful feature of map-reduce

IBM Research | India Research Lab Input-Format  Defines how to read data from file and feed it to the map functions  How to define Splits?  getSplits()  How to define Record?  getRecordReader()  Hadoop provides various Input and Output Formats  A user can also implement custom input and output formats  Defining custom input and output formats is a very useful feature of map-reduce

IBM Research | India Research Lab Split A B C R R R R R R R R R R R R R R R R R R R R MB Split 1 Split 2 Split 3 Split 4 MAP-1 MAP-2 MAP-3 MAP-4

IBM Research | India Research Lab Split A B C R R R R R R R R R R R R R R R R R R R R MB Split 1 Split 2 MAP-1 MAP-2

IBM Research | India Research Lab Split A B C R R R R R R R R R R R R R R R R R R R R MB Split 1 Split 2 MAP-1 MAP-2

IBM Research | India Research Lab Split A B C R R R R R R R R R R R R R R R R R R R R MB Split 1 Split 2 MAP-1 MAP-2 Split 3 MAP-3

IBM Research | India Research Lab Record-Reader R R R R R All records fed to Map task one by one

IBM Research | India Research Lab Record-Reader R R R R R There are three records now

IBM Research | India Research Lab Record-Reader R R R R R All the tuples with identical values in column 1 are bunched in the same record

IBM Research | India Research Lab TextInputFormat  Default Input Format  Key is Byte Offset and Value is Line Content  Suitable for reading raw text files F F TEXT INPUT FORMAT (0, “ F”) offsetline as a string MAP (10, “ F”)

IBM Research | India Research Lab KeyValueInputFormat  Input data in form of key \tab value  Anything before \tab is key  Anything after \tab is value  Input if not in correct format will throw up an error \t 99F \t 98F KEY VALUE INPUT FORMAT (“ ”, “99F”) Key as Content before tabValue as content after tab MAP (“ ”, “98F”)

IBM Research | India Research Lab SequenceInputFormat  Hadoop specific high performance binary input format  Key is user-defined  Value is user-defined Binary File SEQUENCE INPUT FORMAT (“ ”, “99F”) User-defined keyUser-defined Value MAP (“ ”, “98F”)

IBM Research | India Research Lab OutputFormats  TextOutputFormat  Default Output Format  Writes data in Key \tab Value format  This output to read subsequently by KeyValueInputFormat  SequenceOutputFormat  Writes Binary Files suitable for reading into subsequent MR jobs  Keys and Values are User defined

IBM Research | India Research Lab Text Input and Output Format (“ ”, “98F”) (“ ”, “99F”) TEXT OUTPUT FORMAT \tab 99F \tab 98F TEXT INPUT FORMAT (10, “ \tab 98F”) (0, “ \tab 99F”) KEY VALUE INPUT FORMAT (“ ”, “98F”) (“ ”, “99F”)

IBM Research | India Research Lab Custom Input Formats  Allows a user control over how to read data and subsequently feed it to the map functions  Advisable to implement custom input formats for specific use-cases  Simplifies the process of implementing map-reduce algorithms

IBM Research | India Research Lab CustomInputFormat F F MY INPUT FORMAT (STPoint( ), 99F) MAP (STPoint( ), 98F) - Key is of type STPoint

IBM Research | India Research Lab Map public class MyMap extends Mapper { public void map(LongWritable key, Text line, Context context){ String tokens[] = line.split(“ “); double lattitude = new Double(tokens[0]).doubleValue(); double longitude = new Double(tokens[1]).doubleValue(); double elevation = new Double(tokens[2]).doubleValue(); long timestamp = new Long(tokens[3]).longValue(); double attrVal = new Double(tokens[4]).doubleValue(); STPoint stpoint = new STPoint (lattitude, longitude, elevation, timestamp); context.write(stpoint, attrVal); } Type of Map Output KeyType of Output Value

IBM Research | India Research Lab New Map With Custom Input Format class MyMap extends Mapper { public void map(STPoint point, DoubleWritable attrValue, Context context){ context.write(stpoint, attrVal); } Map Output KeyMap Output Value More Intuitive, Human Readable Map Input KeyMap Input Value

IBM Research | India Research Lab Specifying Input and Output Format public class MyRunner{ public static void main(String[] args){ Job job = new Job(); job.setMapperClass(MyMap.class); job.setCombinerClass(MyCombiner.class); job.setReducerClass(MyReduce.class); job.setJarByClass(MyRunner.class); job.setPartitionerClass(MyPartitioner.class); job.setInputFormatClass(MyInputFormat.class); job.setOutputFormatClass(MyOutputFormat.class); FileInputFormat.addInputPath(job, inputFilesPath); FileOutputFormat.addOutputPath(job, outputPath); job.setNumReduceTasks(1); job.waitForCompletion(true); }

IBM Research | India Research Lab Implementing a Custom Input Format  Specify how to split the data  Data Split handled by class FileInputFormat  Custom Input Format can extend this class  RecordReader  Reading the data in each split, parsing it and passing it to map  Iterator over the input data

IBM Research | India Research Lab

Custom Input Format public class MyInputFormat extends FileInputFormat { public RecordReader createRecordReader (InputSplit split, TaskAttemptContext context){ return new MyRecordReader(); }

IBM Research | India Research Lab Custom Record-Reader

IBM Research | India Research Lab Custom Record-Reader public class MyRecordReader extends RecordReader { private STPoint point; private DoubleWritable attrVal; private LineRecordReader lineRecordReader; public void initialize(InputSplit split, TaskAttemptContext context){ lineRecordReader = new LineRecordReader(); lineRecordReader.initialize(split, context); } public boolean nextKeyValue(){ if(!lineRecordReader.nextKeyValue()){ this.point = null; this.attrVal = -1; return false; } String lineString = lineRecordReader.getCurrentValue(); this.point = getSTPoint(lineString); this.attrVal = getAttributeValue(lineString); return true; } public STPoint getCurrentKey(){ return this.point;} public DoubleWritable getCurrentValue() { return this.attrVal; } }

IBM Research | India Research Lab Chaining Map-Reduce Jobs  Simple tasks may be completed by single map and reduce.  Complex tasks will require multiple map and reduce cycles.  Multiple map and reduce cycles need to be chained together  Chaining multiple jobs in a sequence  Chaining multiple jobs in complex dependency  Chaining multiple maps in a sequence

IBM Research | India Research Lab Chaining Map-Reduce Jobs In Sequence  Most Commonly Done  The output of a reducer is an input to the map functions of the next cycle.  A new job starts only after the prior job has finished MAP JOB 1 REDUCE JOB 1 MAP JOB 2 REDUCE JOB 2 MAP JOB 3 REDUCE JOB 3 INPUT OUTPUT

IBM Research | India Research Lab Chaining Multiple Jobs In Sequence Job job1 = new Job(); job1.setInputPath(inputPath1); job1.setOutputPath(outputPath1); // set all other parameters job1.setMapperClass(); …….. job1.waitForCompletion(true); Job job2 = new Job(); job2.setInputPath(outputPath1); job2.setOutputPath(outputPath2); // set all other parameters job2.setMapperClass(); job2.waitForCompletion(true);

IBM Research | India Research Lab Chaining Multiple Jobs in Complex Dependency  Chaining in sequence assumes that the jobs are dependent on a chain fashion.  That may not be so.  Example: Job1 may process a data-set of certain type, Job2 may process a data- set of another type and Job3 combines the results of Job1 and Job2  Job1 and Job2 are independent while Job3 is dependent on Job1 and Job2  Running Jobs in sequence Job1, Job2 and then Job3 may not be ideal.

IBM Research | India Research Lab Chaining Multiple Jobs in Complex Dependency  Use addDependingJob to specify the dependencies  job3.addDependingJob(job1);  Job3.addDependingJob(job2);  Define a JobControl object  JobControl jc = new JobControl();  jc.addJob(job1);  jc.addJob(job2);  jc.addJob(job3);  jc.run();

IBM Research | India Research Lab Chaining Multiple Maps In A Sequence  Multiple Map tasks can also be chained in a sequence followed by a reducer.  Avoids development of large map methods  Avoids multiple MR jobs with additional IO overheads  More ease of development, Code Re-Use.  Use ChainMapper API

IBM Research | India Research Lab Chaining Multiple Maps In A Sequence

IBM Research | India Research Lab Chaining Multiple Maps In A Sequence Job chainJob = new Job(conf); ChainMapper.addMapper(chainJob, Map-1 Info ); ChainMapper.addMapper(chainJob, Map-2 Info ); ChainMapper.addMapper(chainJob, Map-3 Info ); chainJob.setReducer(Reducer-Info); chainJob.waitForCompletion(true);

IBM Research | India Research Lab Compression  MR jobs produce large output  The output of an MR job can be compressed.  Saves a lot of space.  Need to ensure that the compression algorithm used is such, such that it produces splittable files  bzip2 is one such compression algorithm  If a compression algorithm does not produce splittable files, the output will not be split and a single map will process the whole data in a subsequent job.  gzip output is not splittable.

IBM Research | India Research Lab Compressing the Output FileOutputFormat.setCompressOutput(job, true); FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);

IBM Research | India Research Lab Hadoop Tuning and Optimization  A number of parameters may impact the performance of a job.  Whether to compress output or not  Number of Reduce Tasks  Block Size (64 MB or 128 MB or 256 MB etc)  Speculative Execution or Not  Buffer Size for Sorting  Temporary Space Allocation  Many more such parameters  Tuning these parameters is not an exact science  Some recommendations have been developed how to set these parameters

IBM Research | India Research Lab Compression  mapred.compress.map.output  Default  False  Pros  Faster Disk Writes  Lower Disk Space Usage  Lesser Time Spent on Data Transfer  Cons  Overhead in compression and decompression  Recommendation  For large jobs and large cluster, compress.

IBM Research | India Research Lab Speculative Execution  mapred.map/reduce.tasks.speculative.execution  Default  True  Pros  Reduces the job-time if the task progress is slow due to memory unavailability or hardware degradation  Cons  Increases the job-time if the task progress is slow due to complex and large calculations.  Recommendation  Set it to false in case of high average completion task duration (> 1 hr) due to complex and large calculations

IBM Research | India Research Lab Block Size  dfs.block.size  Default  64 MB  Recommendations  Small Cluster and large data-sets Many map tasks will be needed Data Size 160 GB and Block Size 64 MB => # Splits = 2560 Data Size 160 GB and Block Size 128 MB => # Splits = 1280 Data Size 160 GB and Block Size 256 MB => # Splits = 640  In small clusters (6-10 nodes), the map task creation overhead is significant. So Block Size should be large but small enough to utilize all resources  Block Size should be set according to size of the cluster, map task capacity and average size of input files.

IBM Research | India Research Lab References  Hadoop – The Definitive Guide. Oreilly Press  Pro-Hadoop : Build scalable, distributed applications in the Cloud.  Hadoop Tutorial : 