Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China
Outline Brief Review Chaining MapReduce Jobs Join in MapReduce Bloom Filter
Brief Review A parallel programming framework Divide and merge split0 split1 split2 Input data Map task Mappers Map task Shuffle Reduce task Reducers Reduce task Output data output0 output1
Chaining MapReduce jobs Chaining in a sequence Chaining with complex dependency Chaining preprocessing and postprocessing steps
Chaining in a sequence Simple and straightforward [MAP | REDUCE]+; MAP+ | REDUCE | MAP* Output of last is the input to the next Similar to pipes Job1Job2Job3
Configuration conf = getConf(); JobConf job = new JobConf(conf); job.setJobName("ChainJob"); job.setInputFormat(TextInputFormat.class); job.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(job, in); FileOutputFormat.setOutputPath(job, out); JobConf map1Conf = new JobConf(false); ChainMapper.addMapper(job, Map1.class, LongWritable.class, Text.class, Text.class, Text.class, true, map1Conf);
Chaining with complex dependency Jobs are not chained in a linear fashion Use addDependingJob() method to add dependency information: Job3 Job1Job2 x.addDependingJob(y)
Chaining preprocessing and postprocessing steps Example: remove stop word in IR Approaches: Separate: inefficient Chaining those steps into a single job Use ChainMapper.addMapper() and ChainReducer.setReducer Map+ | Reduce | Map*
Join in MapReduce Reduce-side join Broadcast join Map-side filtering and Reduce-side join A given key A range from dataset(broadcast) a Bloom filter
Reduce-side join Map output key>>join key, value>>tagged with data source Reduce do a full cross-product of values output the combination results
Example ab 1ab 1cd 4ef ac 1b 2d 4c table x table y map() 1 4 key xab xcd xef value key yb yd yc value tag join key shuffle() 1 key xab xcd yb valuelist 2yd 4 xef yc reduce() abc 1abb 1cdb 4efc output 1
Broadcast join (replicated join) Broadcast the smaller table Do join in Map() Using distributed cache DistributedCache.addCacheFile()
Map-side filtering and Reduce- side join Join key: student IDs from info generate IDs file from info broadcast join What if the IDs file can’t be stored in memory? a Bloom Filter
A Bloom Filter Introduction Implementation of bloom filter Use in MapReduce join
Introduction to Bloom Filter space-efficient data structure, constant size, test elements, add(), contains() no false negatives and a small probability of false positives
Implementation of bloom filter Apply a bit array Add elements generate k indexes set the k bits to 1 Test elements generate k indexes all k bits are 1 >> true, not all are 1 >> false
Example add x(0,2,6) add y(0,3,9) contain m(1,3,9) contain n(0,2,9)initial state ①② ③④⑤ ×√ false positives
Use in MapReduce join A separate subjob to create a Bloom Filter Broadcast the Bloom Filter and use in Map() of join job drop the useless record, and do join in reduce
References Chunk Lam, “Hadoop in action” Jairam Chandar, “Join Algorithms using Map/Reduce”
THANK YOU
Hadoop