MAPREDUCE Massive Data Processing (I)
Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse
OVERVIEW PROGRAMMING MODEL IMPLEMENTATION MapReduce Introduction
What is MapReduce Programming model for expressing distributed computations at a massive scale A patented software framework introduced by Google Processes 20 petabytes of data per day Popularized by open-source Hadoop project Used at Yahoo!, Facebook, Amazon, …
Why MapReduce Scale “out”, not “up” Limits of Symmetrical Multi-Processing (SMP) and large shared-memory machines Move computing to data Cluster have limited bandwidth Hide system-level details from the developers No more race conditions, lock contention, etc Separating the what from how Developer specifies the computation that needs to be performed Execution framework (“runtime”) handles actual execution
Locality Don’t move data to workers… move workers to the data! Store data on the local disks of nodes in the cluster Start up the workers on the node that has the data Why? Not enough RAM to hold all the data in memory Disk access is slow, but disk throughput is reasonable A distributed file system is the answer GFS (Google File System) for Google’s MapReduce HDFS (Hadoop Distributed File System) for Hadoop
OVERVIEW PROGRAMMING MODEL IMPLEMENTATION MapReduce Introduction
Typical Large-Data Problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Key idea: provide a functional abstraction for these two operations Map Reduce
How to Abstract The framework is inspired by map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as their original forms Map(...) : N → N Ex. [ 1,2,3,4 ] – (*2) -> [ 2,4,6,8 ] Reduce(...): N → 1 [ 1,2,3,4 ] – (sum) -> 10 Programmers specify two functions: Map(k1,v1) -> list(k2,v2) Reduce(k2, list (v2)) -> list(v3) All values with the same key are sent to the same reducer
How to Abstract(Cont.) The execution framework (Runtime) handles Scheduling Assigns workers to map and reduce tasks Data distribution Moves processes to data Synchronization Gathers, sorts, and shuffles intermediate data Errors and faults Detects worker failures and restarts Everything happens on top of a Distributed File System (DFS)
OVERVIEW PROGRAMMING MODEL IMPLEMENTATION MapReduce Introduction
Execution Overview
MapReduce: High Level
Nodes, Trackers, Tasks JobTracker Run on Master node Accepts Job requests from clients TaskTracker Run on slave nodes Forks separate Java process for task instances
Hadoop MapReduce w/ HDFS Map Red
Example - Wordcount Hello Cloud TA cool Hello TA cool Input Hello [1 1] TA [1 1] Cloud [1] cool [1 1] Reducer Hello 2 TA 2 Hello 2 TA 2 Cloud 1 cool 2 Cloud 1 cool 2 Hello 1 TA 1 Cloud 1 cool 1 Sort/Copy Merge Output
Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse
public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount "); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(wordcount.class); job.setMapperClass(mymapper.class); job.setCombinerClass(myreducer.class); job.setReducerClass(myreducer.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1); } Main function
Mapper import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class mymapper extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String line = ( (Text) value ).toString(); StringTokenizer itr = new StringTokenizer( line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); }
Mapper(cont.) … Hi Cloud TA say Hi … /user/hadoop/input/hi Input Key Input Value ( (Text) value ).toString(); Hi Cloud TA say Hi StringTokenizer itr = new StringTokenizer( line); Hi Cloud TA say Hi itr while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); }
Reducer import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class myreducer extends Reducer { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); }
Reducer (cont.) Hi
Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse
MapReduce Terminology Job A “full program” - an execution of a Mapper and Reducer across a data set Task An execution of a Mapper or a Reducer on a slice of data Task Attempt A particular instance of an attempt to execute a task on a machine
Main Class Class MR{ main(){ Configuration conf = new Configuration(); Job job = new Job(conf, “job name"); job.setJarByClass(thisMainClass.class); job.setMapperClass(Mapper.class); job.setReduceClass(Reducer.class); FileInputFormat.addInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
Job Identify classes implementing Mapper and Reducer interfaces Job.setMapperClass(), setReducerClass() Specify inputs, outputs FileInputFormat.addInputPath() FileOutputFormat.setOutputPath() Optionally, other options too: Job.setNumReduceTasks(), Job.setOutputFormat()…
Class Mapper Maps input key/value pairs to a set of intermediate key/value pairs. Ex: Class MyMapper extend Mapper { //global variable public void map(Object key, Text value, Context context) throws IOException,InterruptedException { //local vaiable …. context.write(key’, value’); } Input Class(key, value) Onput Class(key, value)
Text, IntWritable, LongWritable, Hadoop defines its own “box” classes Strings : Text Integers : IntWritable Long : LongWritable Any (WritableComparable, Writable) can be sent to the reducer All keys are instances of WritableComparable All values are instances of Writable
Read Data
Mappers Upper-case Mapper Ex: let map(k, v) = emit(k.toUpper(), v.toUpper()) (“foo”, “bar”) → (“FOO”, “BAR”) (“Foo”, “other”) → (“FOO”, “OTHER”) (“key2”, “data”) → (“KEY2”, “DATA”) Explode Mapper let map(k, v) = for each char c in v: emit(k, c) (“A”, “cats”) → (“A”, “c”), (“A”, “a”), (“A”, “t”), (“A”, “s”) (“B”, “hi”) → (“B”, “h”), (“B”, “i”) Filter Mapper let map(k, v) = if (isPrime(v)) then emit(k, v) (“foo”, 7) → (“foo”, 7) (“test”, 10) → (nothing)
Class Reducer Reduces a set of intermediate values which share a key to a smaller set of values. Ex: Class MyReducer extend Reducer { //global variable public void reduce(Object key, Iterable value, Context context) throws IOException,InterruptedException { //local vaiable …. context.write(key’, value’); } Input Class(key, value) Onput Class(key, value)
Reducers Sum Reducer (“A”, [42, 100, 312]) → (“A”, 454) Identity Reducer (“A”, [42, 100, 312]) → (“A”, 42),(“A”, 100), (“A”, 312) let reduce(k, vals) = sum = 0 foreach int v in vals: sum += v emit(k, sum) let reduce(k, vals) = foreach v in vals: emit(k, v)
Performance Consideration Ideal scaling characteristics: Twice the data, twice the running time Twice the resources, half the running time Why can’t we achieve this? Synchronization requires communication Communication kills performance Thus… avoid communication! Reduce intermediate data via local aggregation Combiners can help
Partitioner and Combiner The same keys to the same reducer via network Partitioner function A default partitioning function is provided that uses hashing In some cases, it is useful to partition data by some other function of the key Avoid communication via local aggregation Combiner function Synchronization requires communication, and communication kills performance Partial combining significantly speeds up certain classes of MapReduce operations
Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse
MR package, Mapper Class
Reducer Class
MR Driver(Main class)
Run on hadoop
Run on hadoop(cont.)
MapReduce 實例 範例名稱: Wordcount 計算一個檔案中,每個字 (Word) 出現的次數 請參閱 orial.html#Example%3A+WordCount+v1.0 orial.html#Example%3A+WordCount+v1.0 這是學習 MapReduce 最基本的實例
實例: MapReduce (Mapper) 14. public static class Map extends MapReduceBase implements Mapper { 15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { 19. String line = value.toString(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasMoreTokens()) { 22. word.set(tokenizer.nextToken()); 23. output.collect(word, one); 24. } 25. } 26. } 詳細程式碼請參閱 MapReduce Tutorial +v1.0
實例: MapReduce (Reducer) 28. public static class Reduce extends MapReduceBase implements Reducer { 29. public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { 30. int sum = 0; 31. while (values.hasNext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. } 詳細程式碼請參閱 MapReduce Tutorial +v1.0
WordCount-Practice 1. 移動到根目錄 cd ~
2. 新增以自己名字 _input 為檔名的文字檔,裡面輸入 I like ITRI. echo “I like ITRI.” > name_input
3. 在 HDFS 裡建立自己名字的資料夾 sudo hadoop fs –mkdir /user/name
4. 看有無創建成功 sudo hadoop fs –ls /user
5. 改變資料夾的擁有者 sudo hadoop fs –chown user1:user1 /user/name
WordCount-Practice 6. 看擁有者有無更改成功 hadoop fs –ls /user
WordCount-Practice 7. 把檔案放進 HDFS hadoop fs –put name_input /user/name
WordCount-Practice 8. 查看有無放置成功 hadoop fs –ls /user/name
WordCount-Practice 9. 移動到資料夾準備執行工作
WordCount-Practice 10. 執行 wordcount sudo hadoop jar hadoop dev-examples.jar wordcount /user/name/name_input /user/name/name_output
WordCount-Practice Job Done
WordCount-Practice 11. 看有無輸出成功 hadoop fs –ls /user/name/name_output
WordCount-Practice 12. 看輸出的檔案 hadoop fs –cat /user/name/name_output/part-r-*