Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China www.jiahenglu.net.

Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China www.jiahenglu.net

Outline Brief Review Chaining MapReduce Jobs Join in MapReduce Bloom Filter

Brief Review A parallel programming framework Divide and merge split0 split1 split2 Input data Map task Mappers Map task Shuffle Reduce task Reducers Reduce task Output data output0 output1

Chaining MapReduce jobs Chaining in a sequence Chaining with complex dependency Chaining preprocessing and postprocessing steps

Chaining in a sequence Simple and straightforward [MAP | REDUCE]+; MAP+ | REDUCE | MAP* Output of last is the input to the next Similar to pipes Job1Job2Job3

Configuration conf = getConf(); JobConf job = new JobConf(conf); job.setJobName("ChainJob"); job.setInputFormat(TextInputFormat.class); job.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(job, in); FileOutputFormat.setOutputPath(job, out); JobConf map1Conf = new JobConf(false); ChainMapper.addMapper(job, Map1.class, LongWritable.class, Text.class, Text.class, Text.class, true, map1Conf);

Chaining with complex dependency Jobs are not chained in a linear fashion Use addDependingJob() method to add dependency information: Job3 Job1Job2 x.addDependingJob(y)

Chaining preprocessing and postprocessing steps Example: remove stop word in IR Approaches: Separate: inefficient Chaining those steps into a single job Use ChainMapper.addMapper() and ChainReducer.setReducer Map+ | Reduce | Map*

Join in MapReduce Reduce-side join Broadcast join Map-side filtering and Reduce-side join A given key A range from dataset(broadcast) a Bloom filter

Reduce-side join Map output key>>join key, value>>tagged with data source Reduce do a full cross-product of values output the combination results

Example ab 1ab 1cd 4ef ac 1b 2d 4c table x table y map() 1 4 key xab xcd xef value 1 2 4 key yb yd yc value tag join key shuffle() 1 key xab xcd yb valuelist 2yd 4 xef yc reduce() abc 1abb 1cdb 4efc output 1

Broadcast join (replicated join) Broadcast the smaller table Do join in Map() Using distributed cache DistributedCache.addCacheFile()

Map-side filtering and Reduce- side join Join key: student IDs from info generate IDs file from info broadcast join What if the IDs file can’t be stored in memory? a Bloom Filter

A Bloom Filter Introduction Implementation of bloom filter Use in MapReduce join

Introduction to Bloom Filter space-efficient data structure, constant size, test elements, add(), contains() no false negatives and a small probability of false positives

Implementation of bloom filter Apply a bit array Add elements generate k indexes set the k bits to 1 Test elements generate k indexes all k bits are 1 >> true, not all are 1 >> false

Example 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 1 0 1 0 0 0 1 0 0 0 0 1 2 3 4 5 6 7 8 9 add x(0,2,6) 1 0 1 1 0 0 1 0 0 1 0 1 2 3 4 5 6 7 8 9 add y(0,3,9) 1 0 1 1 0 0 1 0 0 1 0 1 2 3 4 5 6 7 8 9 contain m(1,3,9) 1 0 1 1 0 0 1 0 0 1 0 1 2 3 4 5 6 7 8 9 contain n(0,2,9)initial state ①② ③④⑤ ×√ false positives

Use in MapReduce join A separate subjob to create a Bloom Filter Broadcast the Bloom Filter and use in Map() of join job drop the useless record, and do join in reduce

References Chunk Lam, “Hadoop in action” Jairam Chandar, “Join Algorithms using Map/Reduce”

THANK YOU

Hadoop

Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China www.jiahenglu.net.

Similar presentations

Presentation on theme: "Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China www.jiahenglu.net."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China www.jiahenglu.net.

Similar presentations

Presentation on theme: "Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China www.jiahenglu.net."— Presentation transcript:

Similar presentations

About project

Feedback