Presentation is loading. Please wait.

Presentation is loading. Please wait.

Beyond map/reduce functions partitioner, combiner and parameter configuration Gang Luo Sept. 9, 2010.

Similar presentations


Presentation on theme: "Beyond map/reduce functions partitioner, combiner and parameter configuration Gang Luo Sept. 9, 2010."— Presentation transcript:

1 Beyond map/reduce functions partitioner, combiner and parameter configuration Gang Luo Sept. 9, 2010

2 Partitioner Determine which reducer/partition one record should go to Given the key, value and number of partitions, return an integer –Partition: (K2, V2, #Partitions)  integer

3 Partitioner public interface Partitioner extends JobConfigurable { int getPartition(K2 key, V2 value, int numPartitions); } public class myPartitioner implements Partitioner { int getPartition(K2 key, V2 value, int numPartitions) { your logic! } Interface: Implementation:

4 Partitioner public class myPartitioner implements Partitioner { int getPartition(Text key, Text value, int numPartitions) { int hashCode = key.hashCode(); int partitionIndex = hashCode mod numPartitions; return partitionIndex; } Example:

5 Combiner Reduce the amount of intermediate data before sending them to reducers. Pre-aggregation The interface is exactly the same as reducer.

6 Combiner public class myCombiner extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator value OutputCollector output, Reporter reporter) { your logic } Example: Should be the same type as map output key/value Should be the same type as reducer input key/value

7 Parameter Cluster-level parameters (e.g. HDFS block size) Job-specific parameters (e.g. number of reducers, map output buffer size) –Configurable. Important for job performance User-define parameters –Used to pass information from driver to mapper/reducer. –Help to make your mapper/reducer more generic

8 Parameter JobConf conf = new JobConf(Driver.class); conf.setNumReduceTasks(10); // set the number of reducers by a // build-in function conf.set(“io.sort.mb”, “200”); // set the size of map output buffer by the // name of that parameter conf.setString(“deliminator”, “\t”); //set a user-defined parameter. conf.getNumReduceTasks(10); // get the value of a parameter by // build-in function String buffSize = conf.get(“io.sort.mb”, “200”); //get the value of a parameter // by its name String deliminator = conf.getString(“deliminator”, “\t”); // get the value of a // user-defined parameter

9 Parameter There are some built-in parameters managed by Hadoop. We are not supposed to change them, but can read them –String inputFile = jobConf.get("map.input.file"); –Get the path to the current input –Used in joining datasets *for new api, you should use: FileSplit split = (FileSplit)context.getInputSplit(); String inputFile = split.getPath().toString();

10 More about Hadoop Identity Mapper/Reducer –Output == input. No modification Why do we need map/reduce function without any logic in them? –Sorting! –More generally, when you only want to use the basic functionality provided by Hadoop (e.g. sorting/grouping)

11 More about Hadoop How to determine the number of splits? –If a file is large enough and splitable, it will be splited into multiple pieces (split size = block size) –If a file is non-splitable, only one split. –If a file is small (smaller than a block), one split for file, unless...

12 More about Hadoop CombineFileInputFormat –Merge multiple small files into one split, which will be processed by one mapper –Save mapper slots. Reduce the overhead Other options to handle small files? –hadoop fs -getmerge src dest


Download ppt "Beyond map/reduce functions partitioner, combiner and parameter configuration Gang Luo Sept. 9, 2010."

Similar presentations


Ads by Google