Based on Lin and Dryer’s text: Chapter 3
Figure 2.6
A programmer has no control over: ◦ Where a mapper or reducer runs (i.e., on which node in the cluster). ◦ When a mapper or reducer begins or finishes. ◦ Which input key-value pairs are processed by a specific mapper. ◦ Which intermediate key-value pairs are processed by a specific reducer.
Ability to: Construct complex data types as keys and values for storage, processing and communications Specify and execute initialization code before a map and/or reduce and the same for termination code after map and/or reduce. To preserve state across multiple keys in map and/or in the reduce To control sorting order of intermediate keys To control partitioning of key space, and thus the set of keys a particular reduce will process
Address the issues without creating bottleneck for scalability ◦ Golden standard that MR attempts is sheer linear scalability ◦ Storing and manipulating state has the potential of hindering scalability How to improve performance? ◦ Make the functions efficient? ◦ Transfer of intermediate data efficient ◦ Aggregation of intermediate data is an important operation for efficiency ◦ Shrink the intermediate key space ◦ What else can we do?
che/hadoop/mapreduce/Mapper.html che/hadoop/mapreduce/Mapper.html che/hadoop/mapred/package-summary.html che/hadoop/mapred/package-summary.html map-reduce-api map-reduce-api
class Mapper method Map(docid a, doc d) H ← new AssociativeArray for all term t ∈ doc d do H{t} ← H{t} + 1 //Tally counts for entire document for all term t ∈ H do Emit(term t, count H{t})
class Mapper method Initialize H ← new AssociativeArray method Map(docid a, doc d) for all term t ∈ doc d do H{t} ← H{t} + 1 Tally counts across documents method Close for all term t ∈ H do Emit(term t, count H{t})