Presentation is loading. Please wait.

Presentation is loading. Please wait.

IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

Similar presentations


Presentation on theme: "IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE."— Presentation transcript:

1 IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE

2 IBM Research | India Research Lab Outline  Map-Reduce Features  Combiner / Partitioner / Counter  Passing Configuration Parameters  Distributed-Cache  Hadoop I/O  Passing Custom Objects as Key-Values  Input and Output Formats  Introduction  Input/Output Formats provided by Hadoop  Writing Custom Input/Output-Formats  Miscellaneous  Chaining Map-Reduce Jobs  Compression  Hadoop Tuning and Optimization

3 IBM Research | India Research Lab Combiner  A local reduce  Processes the output of each map function  Same signature as of a reduce  Often reduces the number of intermediate key-value pairs

4 IBM Research | India Research Lab Word-Count Hadoop Map Map Reduce Hadoop Map Map Reduce Map Hadoop Key Value (Hadoop, 1) (Map, 1) (Reduce, 1) (Hadoop, 1) (Map, 1) (Reduce, 1) (Map, 1) (Hadoop, 1) (Key, 1) (Value, 1) Sort/Shuffle (Hadoop, [1,1,1]) (Map, [1,1,1,1,1,1,1]) (Key, [1,1]) (Reduce, [1,1]) (Value, [1,1]) A-I J-Q R-Z (Hadoop, 3) (Map, 7) (Key, 2) (Reduce, 2) (Value, 2)

5 IBM Research | India Research Lab Word-Count Hadoop Map Map Reduce Hadoop Map Map Reduce Map Hadoop Key Value (Hadoop, 1) (Map, 1) (Reduce, 1) (Hadoop, 1) (Map, 1) (Reduce, 1) (Map, 1) (Hadoop, 1) (Key, 1) (Value, 1) (Hadoop, [2,1]) (Map, [4, 3]) (Key, [2]) (Reduce, [1,1]) (Value, [2]) A-I J-Q R-Z (Hadoop, 3) (Map, 7) (Key, 2) (Reduce, 2) (Value, 2) (Hadoop, [1,1]) (Map, [1,1,1,1]) (Reduce, [1]) (Map, [1,1,1]) (Reduce, 1) (Hadoop, 1) (Key, [1,1]) (Value, [1,1]) (Hadoop, 2) (Map, 4) (Reduce, 1) (Map, 3) (Reduce, 1) (Hadoop, 1) (Key, 2) (Value, 2) COMBINER

6 IBM Research | India Research Lab COMBINER public class WordCountCombiner extends Reducer { public void reduce(Text key, Iterable values, Context context){ context.write(key, new IntWritable(count(values))); } } Type of Output KeyType of Output ValueType of Input KeyType of Input Value

7 IBM Research | India Research Lab Word-Count Runner Class public class WordCountRunner{ public static void main(String[] args){ Job job = new Job(); job.setMapperClass(WordCountMap.class); job.setCombinerClass(WordCountCombiner.class); job.setReducerClass(WordCountReduce.class); job.setJarByClass(WordCountRunner.class); FileInputFormat.addInputPath(job, inputFilesPath); FileOutputFormat.addOutputPath(job, outputPath); job.setMapOutputKeyClass(Text.class); job.setMapOutputValuesClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setNumReduceTasks(1); job.waitForCompletion(true); }

8 IBM Research | India Research Lab Counters

9 IBM Research | India Research Lab Counters  Built-in Counters  Report Metrics for various aspects of a Job  Task Counters Gather information about tasks over the course of a job Results are aggregated across all tasks MAP_INPUT_RECORDS, REDUCE_INPUT_GROUPS  FileSystem Counters BYTES_READ, BYTES_WRITTEN Bytes Read/Written by each File-System (HDFS, KFS, Local, S3 etc)  FileInputFormat Counters BYTES_READ (Bytes Read through FileInputFormat)  FileOutputFormat Counters BYTES_WRITTEN (Bytes Written through FileOutputFormat)  Job Counters Maintained by Job-Tracker TOTAL_LAUNCHED_MAPS, TOTAL_LAUNCHED_REDUCES

10 IBM Research | India Research Lab User-Define Counters  public class WordCountMap extends Mapper { enum WCCounters {NOUNS, PRONOUNS, ADJECTIVES}; public void map(LongWritable key, Text line, Context context){ String[] tokens = Tokenize(line); for(int i=0; i<tokens.length; i++){ if(isNoun(tokens[i])) context.getCounter(WCCounter.NOUNS).increment(1); else if(isProNoun(tokens[i])) context.getCounter(WCCounter.PRONOUNS).increment(1); else if(isAdjective(tokens[i])) context.getCounter(WCCount.ADJECTIVES).increment(1); context.write(new Text(tokens[i]), new IntWritable(1)); }

11 IBM Research | India Research Lab Retrieving the values of a Counter Counter counters = job.getCounters(); Counter counter = counters.findCounter(WCCounters.NOUNS); int value = counter.getValue();

12 IBM Research | India Research Lab Output 13/10/08 15:36:15 INFO mapred.JobClient WordCountMap.NOUNS=2342 13/10/08 15:36:15 INFO mapred.JobClient WordCountMap.PRONOUNS=2124 13/10/08 15:36:15 INFO mapred.JobClient WordCountMap.ADJECTIVES=1897

13 IBM Research | India Research Lab Partitioner  Map keys to reducers/partitions  Determines which reducer receives a certain key  Identical keys produced by different map functions must map to same partition/reducer  If n reducers are used, then n partitions must be filled  Number of reducers are set by the call “setNumReduceTasks”  Hadoop uses HashPartitioner as default partitioner

14 IBM Research | India Research Lab Defining a Custom Partitioner  Implement a class which extends the Partitioner class  Partitioning impacts load-balancing aspect of a map-reduce program  Word-Count: Many words starting with vowels  Words starting with a different character sent to different reducer  For words starting with vowels, second character may be taken into account

15 IBM Research | India Research Lab Word-Count Runner Class public class WordCountRunner{ public static void main(String[] args){ Job job = new Job(); job.setMapperClass(WordCountMap.class); job.setCombinerClass(WordCountCombiner.class); job.setReducerClass(WordCountReduce.class); job.setJarByClass(WordCountRunner.class); job.setPartitionerClass(WordCountPartitioner.class); FileInputFormat.addInputPath(job, inputFilesPath); FileOutputFormat.addOutputPath(job, outputPath); job.setMapOutputKeyClass(Text.class); job.setMapOutputValuesClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setNumReduceTasks(1); job.waitForCompletion(true); }

16 IBM Research | India Research Lab Passing Configuration Parameters  Map-Reduce jobs may require certain input parameters  One may want to avoid counting words starting with certain prefixes  Prefixes can be set in the configuration

17 IBM Research | India Research Lab Word-Count Runner Class public class WordCountRunner{ public static void main(String[] args){ Job job = new Job(); Configuration conf = job.getConfiguration(); conf.set(“PrefixesToAvoid”, “abs bts bnm swe”); …… job.waitForCompletion(true); }

18 IBM Research | India Research Lab Word-Count Map  public class WordCountMap extends Mapper { private String[] prefixesToAvoid; public void setup(Context context) throws InterruptedException{ Configuration conf = context.getConfiguration(); String prefixes = conf.get(“PrefixesToAvoid”); this.prefixesToAvoid = prefixes.split(“ “); } public void map(LongWritable key, Text line, Context context){ String[] tokens = Tokenize(line); for(int i=0; i<tokens.length; i++){ context.write(new Text(tokens[i]), new IntWritable(1)); }

19 IBM Research | India Research Lab Distributed Cache  A file may need to be broadcasted to each map-node  For example, a dictionary in a spell-check  Such file-names can be added in a distributed-cache.  Hadoop copies files added to the cache to all map-nodes.  Step 1 : Put file to HDFS  hdfs dfs –put /tmp/file1 /cachefile1  Step 2 : Add CacheFile in Job Configuration  Configuration conf = job.getConfiguration();  DistributedCache.addCacheFile(new URI(“/cachefile1”), conf);  Step 3 : Access cache file locally at each map  Path[] cacheFiles = context.getLocalCacheFiles();  FileInputStream finputStream = new FileInputStream(cacheFiles[0].toString());

20 IBM Research | India Research Lab Hadoop I/O : Reading an HDFS File // Get FileSystem Object Instance FileSystem fs = FileSystem.get(conf); // Get File Stream Path infile = new Path(filePath); BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(infile))); // Read file line by line StringBuilder fileContent = new StringBuilder(); String line = br.readLine(); while(line!=null){ fileContent.append(line).append(“\n”); line = br.readLine(); }

21 IBM Research | India Research Lab Hadoop I/O : Writing to an HDFS file // Get FileSystem Object Instance FileSystem fs = FileSystem.get(conf); // Get FileStream Path path = new Path(filePath); FSDataOutputStream outputStream = hdfs.create(path); // Write to file byte[] bytes = content.getBytes(); outputStream.write(bytes, 0, bytes.length); outputStream.close();

22 IBM Research | India Research Lab Hadoop I/O : Getting the File being Processed  A map-reduce job may need to process multiple files  The functionality of a map may depend upon which file is being processed FileSplit fileSplit = (FileSplit) context.getInputSplit(); String filename = fileSplit.getPath().getName();

23 IBM Research | India Research Lab Custom Objects as Key-Values  Passing key and values from map functions to reducers  IntWritable, DoubleWritable, LongWritable, Text, ArrayWritable  Passing key and values of custom classes may be desirable  Objects that can be passed around must implement certain interfaces  Writable for passing as values  WritableComparable for passing as keys

24 IBM Research | India Research Lab Example Use-Case  Consider Weather data  Temperature and Pressure values at different lattitude-longitude-elevation-timestamp quadruple  Data is hence 4-dimensional  Temperature and Pressure data in separate files  File Format : lattitude, longitude, elevation, timestamp, temperature-value Ex: 10 20 10 1 99F 10 21 10 2 98F Similarly for Pressure Ex 10 20 10 1 101kPa  We want to read the two data files and combine the data Ex: 10 20 10 1 99F 101kPa  Let class STPoint represent the coordinates  class STPoint{ double lattitude, longitude, elevation; long timestamp; }

25 IBM Research | India Research Lab Map to Reduce Flow 10 20 1 10 99F 10 21 1 10 98F 10 20 1 10 101kPa 10 21 1 10 109kPa MAP (10 20 1 10, 99F) Text DoubleWritable (10 21 1 10, 101kPa) Text DoubleWritable REDUCE (10, 20, 1, 10, 99F, 101kPa)

26 IBM Research | India Research Lab Map to Reduce Flow 10 20 1 10 99F 10 21 1 10 98F 10 20 1 10 101kPa 10 21 1 10 109kPa MAP (STPoint(10 20 1 10), 99F) STPoint DoubleWritable (STPoint(10 21 1 10), 101kPa) STPoint DoubleWritable REDUCE (STPoint(10, 20, 1, 10), 99F, 101kPa)

27 IBM Research | India Research Lab Map  public class MyMap extends Mapper { public void map(LongWritable key, Text line, Context context){ String tokens[] = line.split(“ “); double lattitude = new Double(tokens[0]).doubleValue(); double longitude = new Double(tokens[1]).doubleValue(); double elevation = new Double(tokens[2]).doubleValue(); long timestamp = new Long(tokens[3]).longValue(); double attrVal = new Double(tokens[4]).doubleValue(); String keyString = lattitude+” “+longitude+” “+elevation+” “+timestamp; context.write(new Text(keyString), attrVal); } Type of Map Output KeyType of Output Value

28 IBM Research | India Research Lab New Map public class MyMap extends Mapper { public void map(LongWritable key, Text line, Context context){ String tokens[] = line.split(“ “); double lattitude = new Double(tokens[0]).doubleValue(); double longitude = new Double(tokens[1]).doubleValue(); double elevation = new Double(tokens[2]).doubleValue(); long timestamp = new Long(tokens[3]).longValue(); double attrVal = new Double(tokens[4]).doubleValue(); STPoint stpoint = new STPoint (lattitude, longitude, elevation, timestamp); context.write(stpoint, attrVal); } Type of Map Output KeyType of Output Value More Intuitive, Human Readable, Reduces Processing at Reduce Side

29 IBM Research | India Research Lab New Reduce public class DataReadReduce extends Reducer { public void reduce(STPoint key, Iterable values, Context context){ } } Type of Output KeyType of Output ValueType of Input KeyType of Input Value

30 IBM Research | India Research Lab Passing Custom Objects as Key-Values  Key-Value Pairs are written to local disk by map functions  User must tell how to write a custom object  Key-Value Pairs are read by reducers from local disk  User must tell how to read a custom object  Keys are sorted and compared  User must spectify how to compare two keys

31 IBM Research | India Research Lab WritableComparable Interface  Three Methods  public void readFields(DataInput in) {}  public void write(DataOutput out) {}  public int compareTo(Object other) {}  Objects that are passed as keys must implement WritableComparable interface.  Objects that are passed as values must implement Writable Interface  Writable interface does not have compareTo method  Only keys are compared and not values and hence compareTo method not required for objects being passed only as keys.

32 IBM Research | India Research Lab Implementing WritableComparable for STPoint  public void readFields(DataInput in) { this.lattitude = in.readDouble(); this.longitude = in.readDouble(); this.elevation = in.readDouble(); long timeStamp = in.readLong(); }  public void write(DataOutput output){ out.writeDouble(this.lattitude); out.writeDouble(this.longitude); out.writeDouble(this.elevation); out.writeLong(this.timestamp); }  public int compareTo(STPoint other){ return this.toString().compareTo(other.toString()); }

33 IBM Research | India Research Lab InputFormat and OutputFormat  InputFormat  Defines how to read data from file and feed it to the map functions  OutputFormat  Defines how to write data on to a file  Hadoop provides various Input and Output Formats  A user can also implement custom input and output formats  Defining custom input and output formats is a very useful feature of map-reduce

34 IBM Research | India Research Lab Input-Format  Defines how to read data from file and feed it to the map functions  How to define Splits?  getSplits()  How to define Record?  getRecordReader()  Hadoop provides various Input and Output Formats  A user can also implement custom input and output formats  Defining custom input and output formats is a very useful feature of map-reduce

35 IBM Research | India Research Lab Split A B C R1 1 2 3 R2 2 3 5 R3 2 4 6 R4 6 4 2 R5 1 3 6 R6 8 9 1 R7 2 3 1 R8 9 9 2 R9 1 7 4 R10 1 2 2 R11 2 3 4 R12 4 5 6 R13 6 7 8 R14 9 8 3 R15 3 2 1 R16 2 3 4 R17 1 2 5 R18 9 3 5 R19 5 8 1 R20 3 3 3 64 MB Split 1 Split 2 Split 3 Split 4 MAP-1 MAP-2 MAP-3 MAP-4

36 IBM Research | India Research Lab Split A B C R1 1 2 3 R2 2 3 5 R3 2 4 6 R4 6 4 2 R5 1 3 6 R6 8 9 1 R7 2 3 1 R8 9 9 2 R9 1 7 4 R10 1 2 2 R11 2 3 4 R12 4 5 6 R13 6 7 8 R14 9 8 3 R15 3 2 1 R16 2 3 4 R17 1 2 5 R18 9 3 5 R19 5 8 1 R20 3 3 3 64 MB Split 1 Split 2 MAP-1 MAP-2

37 IBM Research | India Research Lab Split A B C R1 1 2 3 R2 2 3 5 R3 2 4 6 R4 6 4 2 R5 1 3 6 R6 8 9 1 R7 2 3 1 R8 9 9 2 R9 1 7 4 R10 1 2 2 R11 2 3 4 R12 4 5 6 R13 6 7 8 R14 9 8 3 R15 3 2 1 R16 2 3 4 R17 1 2 5 R18 9 3 5 R19 5 8 1 R20 3 3 3 64 MB Split 1 Split 2 MAP-1 MAP-2

38 IBM Research | India Research Lab Split A B C R1 1 2 3 R2 2 3 5 R3 2 4 6 R4 6 4 2 R5 1 3 6 R6 8 9 1 R7 2 3 1 R8 9 9 2 R9 1 7 4 R10 1 2 2 R11 2 3 4 R12 4 5 6 R13 6 7 8 R14 9 8 3 R15 3 2 1 R16 2 3 4 R17 1 2 5 R18 9 3 5 R19 5 8 1 R20 3 3 3 64 MB Split 1 Split 2 MAP-1 MAP-2 Split 3 MAP-3

39 IBM Research | India Research Lab Record-Reader R1 1 2 3 R2 2 3 5 R3 2 4 6 R4 6 4 2 R5 1 3 6 All records fed to Map task one by one

40 IBM Research | India Research Lab Record-Reader R1 1 2 3 R2 2 3 5 R3 2 4 6 R4 6 4 2 R5 1 3 6 There are three records now

41 IBM Research | India Research Lab Record-Reader R1 1 2 3 R5 1 3 6 R2 2 3 5 R3 2 4 6 R4 6 4 2 All the tuples with identical values in column 1 are bunched in the same record

42 IBM Research | India Research Lab TextInputFormat  Default Input Format  Key is Byte Offset and Value is Line Content  Suitable for reading raw text files 10 20 1 10 99F 10 21 1 10 98F TEXT INPUT FORMAT (0, “10 20 1 10 99F”) offsetline as a string MAP (10, “10 21 1 10 98F”)

43 IBM Research | India Research Lab KeyValueInputFormat  Input data in form of key \tab value  Anything before \tab is key  Anything after \tab is value  Input if not in correct format will throw up an error 10 20 1 10 \t 99F 10 21 1 10 \t 98F KEY VALUE INPUT FORMAT (“10 20 1 10”, “99F”) Key as Content before tabValue as content after tab MAP (“10 21 1 10”, “98F”)

44 IBM Research | India Research Lab SequenceInputFormat  Hadoop specific high performance binary input format  Key is user-defined  Value is user-defined Binary File SEQUENCE INPUT FORMAT (“10 20 1 10”, “99F”) User-defined keyUser-defined Value MAP (“10 21 1 10”, “98F”)

45 IBM Research | India Research Lab OutputFormats  TextOutputFormat  Default Output Format  Writes data in Key \tab Value format  This output to read subsequently by KeyValueInputFormat  SequenceOutputFormat  Writes Binary Files suitable for reading into subsequent MR jobs  Keys and Values are User defined

46 IBM Research | India Research Lab Text Input and Output Format (“10 21 1 10”, “98F”) (“10 20 1 10”, “99F”) TEXT OUTPUT FORMAT 10 20 1 10 \tab 99F 10 21 1 10 \tab 98F TEXT INPUT FORMAT (10, “10 21 1 10 \tab 98F”) (0, “10 20 1 10 \tab 99F”) KEY VALUE INPUT FORMAT (“10 21 1 10”, “98F”) (“10 20 1 10”, “99F”)

47 IBM Research | India Research Lab Custom Input Formats  Allows a user control over how to read data and subsequently feed it to the map functions  Advisable to implement custom input formats for specific use-cases  Simplifies the process of implementing map-reduce algorithms

48 IBM Research | India Research Lab CustomInputFormat 10 20 1 10 99F 10 21 1 10 98F MY INPUT FORMAT (STPoint(10 20 1 10), 99F) MAP (STPoint(10 21 1 10), 98F) - Key is of type STPoint

49 IBM Research | India Research Lab Map public class MyMap extends Mapper { public void map(LongWritable key, Text line, Context context){ String tokens[] = line.split(“ “); double lattitude = new Double(tokens[0]).doubleValue(); double longitude = new Double(tokens[1]).doubleValue(); double elevation = new Double(tokens[2]).doubleValue(); long timestamp = new Long(tokens[3]).longValue(); double attrVal = new Double(tokens[4]).doubleValue(); STPoint stpoint = new STPoint (lattitude, longitude, elevation, timestamp); context.write(stpoint, attrVal); } Type of Map Output KeyType of Output Value

50 IBM Research | India Research Lab New Map With Custom Input Format class MyMap extends Mapper { public void map(STPoint point, DoubleWritable attrValue, Context context){ context.write(stpoint, attrVal); } Map Output KeyMap Output Value More Intuitive, Human Readable Map Input KeyMap Input Value

51 IBM Research | India Research Lab Specifying Input and Output Format public class MyRunner{ public static void main(String[] args){ Job job = new Job(); job.setMapperClass(MyMap.class); job.setCombinerClass(MyCombiner.class); job.setReducerClass(MyReduce.class); job.setJarByClass(MyRunner.class); job.setPartitionerClass(MyPartitioner.class); job.setInputFormatClass(MyInputFormat.class); job.setOutputFormatClass(MyOutputFormat.class); FileInputFormat.addInputPath(job, inputFilesPath); FileOutputFormat.addOutputPath(job, outputPath); job.setNumReduceTasks(1); job.waitForCompletion(true); }

52 IBM Research | India Research Lab Implementing a Custom Input Format  Specify how to split the data  Data Split handled by class FileInputFormat  Custom Input Format can extend this class  RecordReader  Reading the data in each split, parsing it and passing it to map  Iterator over the input data

53 IBM Research | India Research Lab

54 Custom Input Format public class MyInputFormat extends FileInputFormat { public RecordReader createRecordReader (InputSplit split, TaskAttemptContext context){ return new MyRecordReader(); }

55 IBM Research | India Research Lab Custom Record-Reader

56 IBM Research | India Research Lab Custom Record-Reader public class MyRecordReader extends RecordReader { private STPoint point; private DoubleWritable attrVal; private LineRecordReader lineRecordReader; public void initialize(InputSplit split, TaskAttemptContext context){ lineRecordReader = new LineRecordReader(); lineRecordReader.initialize(split, context); } public boolean nextKeyValue(){ if(!lineRecordReader.nextKeyValue()){ this.point = null; this.attrVal = -1; return false; } String lineString = lineRecordReader.getCurrentValue(); this.point = getSTPoint(lineString); this.attrVal = getAttributeValue(lineString); return true; } public STPoint getCurrentKey(){ return this.point;} public DoubleWritable getCurrentValue() { return this.attrVal; } }

57 IBM Research | India Research Lab Chaining Map-Reduce Jobs  Simple tasks may be completed by single map and reduce.  Complex tasks will require multiple map and reduce cycles.  Multiple map and reduce cycles need to be chained together  Chaining multiple jobs in a sequence  Chaining multiple jobs in complex dependency  Chaining multiple maps in a sequence

58 IBM Research | India Research Lab Chaining Map-Reduce Jobs In Sequence  Most Commonly Done  The output of a reducer is an input to the map functions of the next cycle.  A new job starts only after the prior job has finished MAP JOB 1 REDUCE JOB 1 MAP JOB 2 REDUCE JOB 2 MAP JOB 3 REDUCE JOB 3 INPUT OUTPUT

59 IBM Research | India Research Lab Chaining Multiple Jobs In Sequence Job job1 = new Job(); job1.setInputPath(inputPath1); job1.setOutputPath(outputPath1); // set all other parameters job1.setMapperClass(); …….. job1.waitForCompletion(true); Job job2 = new Job(); job2.setInputPath(outputPath1); job2.setOutputPath(outputPath2); // set all other parameters job2.setMapperClass(); job2.waitForCompletion(true);

60 IBM Research | India Research Lab Chaining Multiple Jobs in Complex Dependency  Chaining in sequence assumes that the jobs are dependent on a chain fashion.  That may not be so.  Example: Job1 may process a data-set of certain type, Job2 may process a data- set of another type and Job3 combines the results of Job1 and Job2  Job1 and Job2 are independent while Job3 is dependent on Job1 and Job2  Running Jobs in sequence Job1, Job2 and then Job3 may not be ideal.

61 IBM Research | India Research Lab Chaining Multiple Jobs in Complex Dependency  Use addDependingJob to specify the dependencies  job3.addDependingJob(job1);  Job3.addDependingJob(job2);  Define a JobControl object  JobControl jc = new JobControl();  jc.addJob(job1);  jc.addJob(job2);  jc.addJob(job3);  jc.run();

62 IBM Research | India Research Lab Chaining Multiple Maps In A Sequence  Multiple Map tasks can also be chained in a sequence followed by a reducer.  Avoids development of large map methods  Avoids multiple MR jobs with additional IO overheads  More ease of development, Code Re-Use.  Use ChainMapper API

63 IBM Research | India Research Lab Chaining Multiple Maps In A Sequence

64 IBM Research | India Research Lab Chaining Multiple Maps In A Sequence Job chainJob = new Job(conf); ChainMapper.addMapper(chainJob, Map-1 Info ); ChainMapper.addMapper(chainJob, Map-2 Info ); ChainMapper.addMapper(chainJob, Map-3 Info ); chainJob.setReducer(Reducer-Info); chainJob.waitForCompletion(true);

65 IBM Research | India Research Lab Compression  MR jobs produce large output  The output of an MR job can be compressed.  Saves a lot of space.  Need to ensure that the compression algorithm used is such, such that it produces splittable files  bzip2 is one such compression algorithm  If a compression algorithm does not produce splittable files, the output will not be split and a single map will process the whole data in a subsequent job.  gzip output is not splittable.

66 IBM Research | India Research Lab Compressing the Output FileOutputFormat.setCompressOutput(job, true); FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);

67 IBM Research | India Research Lab Hadoop Tuning and Optimization  A number of parameters may impact the performance of a job.  Whether to compress output or not  Number of Reduce Tasks  Block Size (64 MB or 128 MB or 256 MB etc)  Speculative Execution or Not  Buffer Size for Sorting  Temporary Space Allocation  Many more such parameters  Tuning these parameters is not an exact science  Some recommendations have been developed how to set these parameters

68 IBM Research | India Research Lab Compression  mapred.compress.map.output  Default  False  Pros  Faster Disk Writes  Lower Disk Space Usage  Lesser Time Spent on Data Transfer  Cons  Overhead in compression and decompression  Recommendation  For large jobs and large cluster, compress.

69 IBM Research | India Research Lab Speculative Execution  mapred.map/reduce.tasks.speculative.execution  Default  True  Pros  Reduces the job-time if the task progress is slow due to memory unavailability or hardware degradation  Cons  Increases the job-time if the task progress is slow due to complex and large calculations.  Recommendation  Set it to false in case of high average completion task duration (> 1 hr) due to complex and large calculations

70 IBM Research | India Research Lab Block Size  dfs.block.size  Default  64 MB  Recommendations  Small Cluster and large data-sets Many map tasks will be needed Data Size 160 GB and Block Size 64 MB => # Splits = 2560 Data Size 160 GB and Block Size 128 MB => # Splits = 1280 Data Size 160 GB and Block Size 256 MB => # Splits = 640  In small clusters (6-10 nodes), the map task creation overhead is significant. So Block Size should be large but small enough to utilize all resources  Block Size should be set according to size of the cluster, map task capacity and average size of input files.

71 IBM Research | India Research Lab References  Hadoop – The Definitive Guide. Oreilly Press  Pro-Hadoop : Build scalable, distributed applications in the Cloud.  Hadoop Tutorial : http://developer.yahoo.com/hadoop/tutorial/.http://developer.yahoo.com/hadoop/tutorial/  www.slideshare.net


Download ppt "IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE."

Similar presentations


Ads by Google