MAPREDUCE Massive Data Processing (I). Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse.

MAPREDUCE Massive Data Processing (I)

Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse

OVERVIEW PROGRAMMING MODEL IMPLEMENTATION MapReduce Introduction

What is MapReduce Programming model for expressing distributed computations at a massive scale A patented software framework introduced by Google  Processes 20 petabytes of data per day Popularized by open-source Hadoop project  Used at Yahoo!, Facebook, Amazon, …

Why MapReduce Scale “out”, not “up”  Limits of Symmetrical Multi-Processing (SMP) and large shared-memory machines Move computing to data  Cluster have limited bandwidth Hide system-level details from the developers  No more race conditions, lock contention, etc Separating the what from how  Developer specifies the computation that needs to be performed  Execution framework (“runtime”) handles actual execution

Locality Don’t move data to workers… move workers to the data!  Store data on the local disks of nodes in the cluster  Start up the workers on the node that has the data Why?  Not enough RAM to hold all the data in memory  Disk access is slow, but disk throughput is reasonable A distributed file system is the answer  GFS (Google File System) for Google’s MapReduce  HDFS (Hadoop Distributed File System) for Hadoop

Typical Large-Data Problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Key idea: provide a functional abstraction for these two operations Map Reduce

How to Abstract The framework is inspired by map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as their original forms  Map(...) : N → N  Ex. [ 1,2,3,4 ] – (*2) -> [ 2,4,6,8 ]  Reduce(...): N → 1  [ 1,2,3,4 ] – (sum) -> 10 Programmers specify two functions:  Map(k1,v1) -> list(k2,v2)  Reduce(k2, list (v2)) -> list(v3)  All values with the same key are sent to the same reducer

How to Abstract(Cont.) The execution framework (Runtime) handles  Scheduling  Assigns workers to map and reduce tasks  Data distribution  Moves processes to data  Synchronization  Gathers, sorts, and shuffles intermediate data  Errors and faults  Detects worker failures and restarts  Everything happens on top of a Distributed File System (DFS)

Execution Overview

MapReduce: High Level

Nodes, Trackers, Tasks JobTracker  Run on Master node  Accepts Job requests from clients TaskTracker  Run on slave nodes  Forks separate Java process for task instances

Hadoop MapReduce w/ HDFS Map Red

Example - Wordcount Hello Cloud TA cool Hello TA cool Input Hello [1 1] TA [1 1] Cloud [1] cool [1 1] Reducer Hello 2 TA 2 Hello 2 TA 2 Cloud 1 cool 2 Cloud 1 cool 2 Hello 1 TA 1 Cloud 1 cool 1 Sort/Copy Merge Output

public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount "); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(wordcount.class); job.setMapperClass(mymapper.class); job.setCombinerClass(myreducer.class); job.setReducerClass(myreducer.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1); } Main function

Mapper import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class mymapper extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String line = ( (Text) value ).toString(); StringTokenizer itr = new StringTokenizer( line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); }

Mapper(cont.) … Hi Cloud TA say Hi … /user/hadoop/input/hi Input Key Input Value ( (Text) value ).toString(); Hi Cloud TA say Hi StringTokenizer itr = new StringTokenizer( line); Hi Cloud TA say Hi itr while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); }

Reducer import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class myreducer extends Reducer { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); }

Reducer (cont.) Hi 1 1 1 1

MapReduce Terminology Job  A “full program” - an execution of a Mapper and Reducer across a data set Task  An execution of a Mapper or a Reducer on a slice of data Task Attempt  A particular instance of an attempt to execute a task on a machine

Main Class Class MR{ main(){ Configuration conf = new Configuration(); Job job = new Job(conf, “job name"); job.setJarByClass(thisMainClass.class); job.setMapperClass(Mapper.class); job.setReduceClass(Reducer.class); FileInputFormat.addInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }

Job Identify classes implementing Mapper and Reducer interfaces  Job.setMapperClass(), setReducerClass() Specify inputs, outputs  FileInputFormat.addInputPath()  FileOutputFormat.setOutputPath() Optionally, other options too:  Job.setNumReduceTasks(),  Job.setOutputFormat()…

Class Mapper  Maps input key/value pairs to a set of intermediate key/value pairs. Ex: Class MyMapper extend Mapper { //global variable public void map(Object key, Text value, Context context) throws IOException,InterruptedException { //local vaiable …. context.write(key’, value’); } Input Class(key, value) Onput Class(key, value)

Text, IntWritable, LongWritable, Hadoop defines its own “box” classes  Strings : Text  Integers : IntWritable  Long : LongWritable Any (WritableComparable, Writable) can be sent to the reducer  All keys are instances of WritableComparable  All values are instances of Writable

Read Data

Mappers Upper-case Mapper  Ex: let map(k, v) = emit(k.toUpper(), v.toUpper())  (“foo”, “bar”) → (“FOO”, “BAR”)  (“Foo”, “other”) → (“FOO”, “OTHER”)  (“key2”, “data”) → (“KEY2”, “DATA”) Explode Mapper  let map(k, v) = for each char c in v: emit(k, c)  (“A”, “cats”) → (“A”, “c”), (“A”, “a”), (“A”, “t”), (“A”, “s”)  (“B”, “hi”) → (“B”, “h”), (“B”, “i”) Filter Mapper  let map(k, v) = if (isPrime(v)) then emit(k, v)  (“foo”, 7) → (“foo”, 7)  (“test”, 10) → (nothing)

Class Reducer  Reduces a set of intermediate values which share a key to a smaller set of values. Ex: Class MyReducer extend Reducer { //global variable public void reduce(Object key, Iterable value, Context context) throws IOException,InterruptedException { //local vaiable …. context.write(key’, value’); } Input Class(key, value) Onput Class(key, value)

Reducers Sum Reducer  (“A”, [42, 100, 312]) → (“A”, 454) Identity Reducer  (“A”, [42, 100, 312]) → (“A”, 42),(“A”, 100), (“A”, 312) let reduce(k, vals) = sum = 0 foreach int v in vals: sum += v emit(k, sum) let reduce(k, vals) = foreach v in vals: emit(k, v)

Performance Consideration Ideal scaling characteristics:  Twice the data, twice the running time  Twice the resources, half the running time Why can’t we achieve this?  Synchronization requires communication  Communication kills performance Thus… avoid communication!  Reduce intermediate data via local aggregation  Combiners can help

Partitioner and Combiner The same keys to the same reducer via network  Partitioner function  A default partitioning function is provided that uses hashing  In some cases, it is useful to partition data by some other function of the key Avoid communication via local aggregation  Combiner function  Synchronization requires communication, and communication kills performance  Partial combining significantly speeds up certain classes of MapReduce operations

MR package, Mapper Class

Reducer Class

MR Driver(Main class)

Run on hadoop

Run on hadoop(cont.)

MapReduce 實例範例名稱： Wordcount 計算一個檔案中，每個字 (Word) 出現的次數請參閱 http://hadoop.apache.org/docs/r1.2.1/mapred_tut orial.html#Example%3A+WordCount+v1.0 http://hadoop.apache.org/docs/r1.2.1/mapred_tut orial.html#Example%3A+WordCount+v1.0 這是學習 MapReduce 最基本的實例

實例： MapReduce (Mapper) 14. public static class Map extends MapReduceBase implements Mapper { 15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text(); 17. 18. public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { 19. String line = value.toString(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasMoreTokens()) { 22. word.set(tokenizer.nextToken()); 23. output.collect(word, one); 24. } 25. } 26. } 詳細程式碼請參閱 MapReduce Tutorial http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount +v1.0

實例： MapReduce (Reducer) 28. public static class Reduce extends MapReduceBase implements Reducer { 29. public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { 30. int sum = 0; 31. while (values.hasNext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. } 詳細程式碼請參閱 MapReduce Tutorial http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount +v1.0

WordCount-Practice 1. 移動到根目錄 cd ~

2. 新增以自己名字 _input 為檔名的文字檔，裡面輸入 I like ITRI. echo “I like ITRI.” > name_input

3. 在 HDFS 裡建立自己名字的資料夾 sudo hadoop fs –mkdir /user/name

4. 看有無創建成功 sudo hadoop fs –ls /user

5. 改變資料夾的擁有者 sudo hadoop fs –chown user1:user1 /user/name

WordCount-Practice 6. 看擁有者有無更改成功 hadoop fs –ls /user

WordCount-Practice 7. 把檔案放進 HDFS hadoop fs –put name_input /user/name

WordCount-Practice 8. 查看有無放置成功 hadoop fs –ls /user/name

WordCount-Practice 9. 移動到資料夾準備執行工作

WordCount-Practice 10. 執行 wordcount sudo hadoop jar hadoop-0.20.2-dev-examples.jar wordcount /user/name/name_input /user/name/name_output

WordCount-Practice Job Done

WordCount-Practice 11. 看有無輸出成功 hadoop fs –ls /user/name/name_output

WordCount-Practice 12. 看輸出的檔案 hadoop fs –cat /user/name/name_output/part-r-*

MAPREDUCE Massive Data Processing (I). Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse.

Similar presentations

Presentation on theme: "MAPREDUCE Massive Data Processing (I). Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MAPREDUCE Massive Data Processing (I). Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse.

Similar presentations

Presentation on theme: "MAPREDUCE Massive Data Processing (I). Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse."— Presentation transcript:

Similar presentations

About project

Feedback