Download presentation
Presentation is loading. Please wait.
Published byMartin Richards Modified over 8 years ago
1
MAPREDUCE Massive Data Processing (I)
2
Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse
3
OVERVIEW PROGRAMMING MODEL IMPLEMENTATION MapReduce Introduction
4
What is MapReduce Programming model for expressing distributed computations at a massive scale A patented software framework introduced by Google Processes 20 petabytes of data per day Popularized by open-source Hadoop project Used at Yahoo!, Facebook, Amazon, …
5
Why MapReduce Scale “out”, not “up” Limits of Symmetrical Multi-Processing (SMP) and large shared-memory machines Move computing to data Cluster have limited bandwidth Hide system-level details from the developers No more race conditions, lock contention, etc Separating the what from how Developer specifies the computation that needs to be performed Execution framework (“runtime”) handles actual execution
6
Locality Don’t move data to workers… move workers to the data! Store data on the local disks of nodes in the cluster Start up the workers on the node that has the data Why? Not enough RAM to hold all the data in memory Disk access is slow, but disk throughput is reasonable A distributed file system is the answer GFS (Google File System) for Google’s MapReduce HDFS (Hadoop Distributed File System) for Hadoop
7
OVERVIEW PROGRAMMING MODEL IMPLEMENTATION MapReduce Introduction
8
Typical Large-Data Problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Key idea: provide a functional abstraction for these two operations Map Reduce
9
How to Abstract The framework is inspired by map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as their original forms Map(...) : N → N Ex. [ 1,2,3,4 ] – (*2) -> [ 2,4,6,8 ] Reduce(...): N → 1 [ 1,2,3,4 ] – (sum) -> 10 Programmers specify two functions: Map(k1,v1) -> list(k2,v2) Reduce(k2, list (v2)) -> list(v3) All values with the same key are sent to the same reducer
10
How to Abstract(Cont.) The execution framework (Runtime) handles Scheduling Assigns workers to map and reduce tasks Data distribution Moves processes to data Synchronization Gathers, sorts, and shuffles intermediate data Errors and faults Detects worker failures and restarts Everything happens on top of a Distributed File System (DFS)
11
OVERVIEW PROGRAMMING MODEL IMPLEMENTATION MapReduce Introduction
12
Execution Overview
13
MapReduce: High Level
14
Nodes, Trackers, Tasks JobTracker Run on Master node Accepts Job requests from clients TaskTracker Run on slave nodes Forks separate Java process for task instances
15
Hadoop MapReduce w/ HDFS Map Red
16
Example - Wordcount Hello Cloud TA cool Hello TA cool Input Hello [1 1] TA [1 1] Cloud [1] cool [1 1] Reducer Hello 2 TA 2 Hello 2 TA 2 Cloud 1 cool 2 Cloud 1 cool 2 Hello 1 TA 1 Cloud 1 cool 1 Sort/Copy Merge Output
17
Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse
18
public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount "); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(wordcount.class); job.setMapperClass(mymapper.class); job.setCombinerClass(myreducer.class); job.setReducerClass(myreducer.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1); } Main function
19
Mapper import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class mymapper extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String line = ( (Text) value ).toString(); StringTokenizer itr = new StringTokenizer( line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); }
20
Mapper(cont.) … Hi Cloud TA say Hi … /user/hadoop/input/hi Input Key Input Value ( (Text) value ).toString(); Hi Cloud TA say Hi StringTokenizer itr = new StringTokenizer( line); Hi Cloud TA say Hi itr while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); }
21
Reducer import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class myreducer extends Reducer { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); }
22
Reducer (cont.) Hi 1 1 1 1
23
Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse
24
MapReduce Terminology Job A “full program” - an execution of a Mapper and Reducer across a data set Task An execution of a Mapper or a Reducer on a slice of data Task Attempt A particular instance of an attempt to execute a task on a machine
25
Main Class Class MR{ main(){ Configuration conf = new Configuration(); Job job = new Job(conf, “job name"); job.setJarByClass(thisMainClass.class); job.setMapperClass(Mapper.class); job.setReduceClass(Reducer.class); FileInputFormat.addInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
26
Job Identify classes implementing Mapper and Reducer interfaces Job.setMapperClass(), setReducerClass() Specify inputs, outputs FileInputFormat.addInputPath() FileOutputFormat.setOutputPath() Optionally, other options too: Job.setNumReduceTasks(), Job.setOutputFormat()…
27
Class Mapper Maps input key/value pairs to a set of intermediate key/value pairs. Ex: Class MyMapper extend Mapper { //global variable public void map(Object key, Text value, Context context) throws IOException,InterruptedException { //local vaiable …. context.write(key’, value’); } Input Class(key, value) Onput Class(key, value)
28
Text, IntWritable, LongWritable, Hadoop defines its own “box” classes Strings : Text Integers : IntWritable Long : LongWritable Any (WritableComparable, Writable) can be sent to the reducer All keys are instances of WritableComparable All values are instances of Writable
29
Read Data
30
Mappers Upper-case Mapper Ex: let map(k, v) = emit(k.toUpper(), v.toUpper()) (“foo”, “bar”) → (“FOO”, “BAR”) (“Foo”, “other”) → (“FOO”, “OTHER”) (“key2”, “data”) → (“KEY2”, “DATA”) Explode Mapper let map(k, v) = for each char c in v: emit(k, c) (“A”, “cats”) → (“A”, “c”), (“A”, “a”), (“A”, “t”), (“A”, “s”) (“B”, “hi”) → (“B”, “h”), (“B”, “i”) Filter Mapper let map(k, v) = if (isPrime(v)) then emit(k, v) (“foo”, 7) → (“foo”, 7) (“test”, 10) → (nothing)
31
Class Reducer Reduces a set of intermediate values which share a key to a smaller set of values. Ex: Class MyReducer extend Reducer { //global variable public void reduce(Object key, Iterable value, Context context) throws IOException,InterruptedException { //local vaiable …. context.write(key’, value’); } Input Class(key, value) Onput Class(key, value)
32
Reducers Sum Reducer (“A”, [42, 100, 312]) → (“A”, 454) Identity Reducer (“A”, [42, 100, 312]) → (“A”, 42),(“A”, 100), (“A”, 312) let reduce(k, vals) = sum = 0 foreach int v in vals: sum += v emit(k, sum) let reduce(k, vals) = foreach v in vals: emit(k, v)
33
Performance Consideration Ideal scaling characteristics: Twice the data, twice the running time Twice the resources, half the running time Why can’t we achieve this? Synchronization requires communication Communication kills performance Thus… avoid communication! Reduce intermediate data via local aggregation Combiners can help
34
Partitioner and Combiner The same keys to the same reducer via network Partitioner function A default partitioning function is provided that uses hashing In some cases, it is useful to partition data by some other function of the key Avoid communication via local aggregation Combiner function Synchronization requires communication, and communication kills performance Partial combining significantly speeds up certain classes of MapReduce operations
36
Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse
37
MR package, Mapper Class
38
Reducer Class
39
MR Driver(Main class)
40
Run on hadoop
41
Run on hadoop(cont.)
42
MapReduce 實例 範例名稱: Wordcount 計算一個檔案中,每個字 (Word) 出現的次數 請參閱 http://hadoop.apache.org/docs/r1.2.1/mapred_tut orial.html#Example%3A+WordCount+v1.0 http://hadoop.apache.org/docs/r1.2.1/mapred_tut orial.html#Example%3A+WordCount+v1.0 這是學習 MapReduce 最基本的實例
43
實例: MapReduce (Mapper) 14. public static class Map extends MapReduceBase implements Mapper { 15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text(); 17. 18. public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { 19. String line = value.toString(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasMoreTokens()) { 22. word.set(tokenizer.nextToken()); 23. output.collect(word, one); 24. } 25. } 26. } 詳細程式碼請參閱 MapReduce Tutorial http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount +v1.0
44
實例: MapReduce (Reducer) 28. public static class Reduce extends MapReduceBase implements Reducer { 29. public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { 30. int sum = 0; 31. while (values.hasNext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. } 詳細程式碼請參閱 MapReduce Tutorial http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount +v1.0
45
WordCount-Practice 1. 移動到根目錄 cd ~
46
2. 新增以自己名字 _input 為檔名的文字檔,裡面輸入 I like ITRI. echo “I like ITRI.” > name_input
47
3. 在 HDFS 裡建立自己名字的資料夾 sudo hadoop fs –mkdir /user/name
48
4. 看有無創建成功 sudo hadoop fs –ls /user
49
5. 改變資料夾的擁有者 sudo hadoop fs –chown user1:user1 /user/name
50
WordCount-Practice 6. 看擁有者有無更改成功 hadoop fs –ls /user
51
WordCount-Practice 7. 把檔案放進 HDFS hadoop fs –put name_input /user/name
52
WordCount-Practice 8. 查看有無放置成功 hadoop fs –ls /user/name
53
WordCount-Practice 9. 移動到資料夾準備執行工作
54
WordCount-Practice 10. 執行 wordcount sudo hadoop jar hadoop-0.20.2-dev-examples.jar wordcount /user/name/name_input /user/name/name_output
55
WordCount-Practice Job Done
56
WordCount-Practice 11. 看有無輸出成功 hadoop fs –ls /user/name/name_output
57
WordCount-Practice 12. 看輸出的檔案 hadoop fs –cat /user/name/name_output/part-r-*
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.