Map Reduce Program September 25th 2017 Kyung Eun Park, D.Sc.

Map Reduce Program September 25th 2017 Kyung Eun Park, D.Sc.

Contents MapReduce Evolution Features MapReduce Workflows
Hadoop Cluster MapReduce Architecture Hadoop Data Types MapReduce Functions MapReduce Programming

Big, Large Dataset Abstraction!!! Various sources of large datasets
Various types, even no format: unstructured, semi-structured But, the expected analysis operations are simple and repetitive tasks Working with large datasets: To design data processing tasks on the large dataset  Can we build the data processing environment with hundreds or thousands of cheap commercialized computers other than almighty computer(s)? To see interesting patterns and unknown characteristics from the data  What kind of processing approaches among batch, interactive, on-demand, self-acting in real-time? To attack the data on the processing environment using the processing approaches Can we make each computer execute a simple analysis operation on each partitions. And aggregate the partial results from each node. Traditionally, parallel workers may need to communicate and cause integrity problems on shared data General approach is to block concurrent access to the data by synchronizing each worker Synchronization is not easy to handle with current multi-core and cluster environment Hide how-to-handle and expose what-to-do (execution is made by execution framework)  Abstraction level elevation! Abstraction!!!

View Datacenter as a Computer - by Jimmy Lin
Key Ideas Scale “out”, beyond “up” Rather than upgrading the capacity of an existing system, attach additional disk array or node to scale out horizontally Move processing to the data: move task to data nodes geographically separated Cluster have limited bandwidth Each node includes storage, computing power, and I/O. Inclusion of new resources (scale out) means performance increase Process data sequentially, avoid random access Seeks are expensive, disk throughput is reasonable View Datacenter as a Computer - by Jimmy Lin Adapted from Jimmy Lin’s Slide

The Data center is the computer.

MapReduce Big Data Processing Abstraction MapReduce
The functional abstraction of two main operations: Map and Reduce MapReduce Iterate over a large number of records: Map Extract something of interest: (key, value) from each node Shuffle and sort intermediate results Aggregate intermediate results Generate final output: Reduce

Map/Shuffle/Sort/Reduce
y1 y2 y3 y4 y5 y6 st1 st2 st3 st4 st5 st6  ID, year, temperatue, code, etc. map map map map 2015, 1 2016, 2 2017, 3 2017, 6 2015, 5 2017, 2 2016, 7 2017, 8 Shuffle and Sort 2015 [1 5] 2016 [2 7] 2017 [2 3 6 8] reduce reduce reduce y1 avg1 y2 avg2 y3 avg3 Adapted from Jimmy Lin’s Slide

MapReduce is the minimally “interesting” dataflow!
rn-1 rn … map map map … reduce reduce reduce … … r’1 r'2 r’3 r’4 r'n-1 rn MapReduce is the minimally “interesting” dataflow! Adapted from Jimmy Lin’s Slide

MapReduce (note we’re abstracting the “data-parallel” part)
List[(K1,V1)] f: extract (year, temperature) [(2015, 1), (2016, 2)] [(2017, 3), (2017, 6)] map f: (K1, V1) ⇒ List[(K2, V2)] [(2015, 2), (2016, 7), (2017, 2), (2017, 8] g1: shuffle/sort (year, temperatures) [(2015, 1), (2015, 5)] [(2016, 2), (2016, 7)] reduce g: (K2, Iterable[V2]) ⇒ List[(K3, V3)] [(2017, 2), (2017, 3), (2017, 6), (2017, 8] g1 (2015, [1, 5]) g1 (2016, [2, 7]) g1 (2017, [ 2, 3, 6, 8]) g2 : reduce (year, avg. temp.) List([K3,V3]) [(2015, 3), (2016, 4.5), (2017, 4.75)] (note we’re abstracting the “data-parallel” part) Adapted from Jimmy Lin’s Slide

MapReduce Workflows What’s wrong? reduce map HDFS reduce map HDFS
Adapted from Jimmy Lin’s Slide

Want MM? map HDFS map HDFS ✔ ✗ Adapted from Jimmy Lin’s Slide

✔ ✗ Want MRR? reduce map HDFS reduce map HDFS
Adapted from Jimmy Lin’s Slide

Core Concept Behind MapReduce
Mapping input data set into a collection of key-value pairs, and then Reducing over all pairs with the same key Major benefit of MapReduce: its “shared-nothing” data processing platform All mappers can work independently No critical region or data is shared among mappers and reducers The shared-nothing paradigm Write map() and reduce() functions easily Improves parallelism effectively and effortlessly MapReduce is a foundation for solving big data using modern and powerful Spark API (higher abstraction API) which provides: Basic MapReduce Other powerful features: join(), filter(), cartesian(), and combineByKey() Hadoop supports a limited number of primitive functionality:map(), combine(), and reduce()

NameNode and DataNode Interaction in HDFS
Two data files: @ /user/chuck/data1: 1,2,3 blocks @ /user/james/2data2: 4,5 blocks The blocks are distributed among the DataNodes Each block has three replicas. If one node crashes or becomes inaccessible DataNode informs the NameNode of the blocks it is currently storing Secondary NameNode (SNN): an assistant daemon for monitoring the state of the cluster HDFS

JobTracker and TaskTracker Interaction
Computing daemons also follow a master/slave architecture: JobTracker: controls the overall execution of a MapReduce job TaskTrackers: manage the execution of individual tasks on each slave node One JobTracker daemon per Hadoop cluster partitions the work from a client assigns the tasks on each slave node If no communication from TaskTracker, JobTracker will resubmit the tasks to other nodes in the cluster

Topology of a Hadoop Cluster
For small cluster, the SNN can reside on one of the slave nodes. For large clusters, separate the NameNode and JobTracker on two machines Slave machines host a DataNode and TaskTracker, for running tasks on the same node where their data is stored.

Basic MapReduce Algorithm Architecture
Shuffle During only shuffle step, nodes communicate with each other.

Hadoop Data Types Hadoop needs to move keys and values across the cluster’s network. This requires to serialize the key/value pairs  customized classes for the Hadoop framework needed: wrapper classes for all the basic data types Classes implementing the Writable interface for values Classes implementing the WritableComparable<T> interface for keys or values A combination of the Writable and java.lang.Comparable<T> interfaces Users can create their own custom type as long as it implements the Writable (or WritableComparable<T>) interface readField() // work with DataInput class to serialize the class contents write() // work with DataOutput class to serialize the class contents compareTo() // for the Comparable interface

Wrapper Classes for Key/Value Pairs
Description BooleanWritable Wrapper for a standard Boolean variable ByteWritable Wrapper for a single byte DoubleWritable Wrapper for a Double FloatWritable Wrapper for a Float IntWritable Wrapper for an Integer LongWritable Wrapper for a Long Text Wrapper to store text using the UTF8 format NullWritable Placeholder when the key or value is not needed

Best-Fit for MapReduce
Lots of input data Environment with parallel and distributed computing, data storage, and data locality Many independent tasks without synchronization Availability of sorting and shuffling mechanisms Demand for fault tolerance

MapReduce is NOT a programming language, but rather a framework for distributed applications NOT a complete replacement for a relational database NOT for real-time processing, but rather, designed for batch processing NOT a solution for all software problems

Implementation of MapReduce
Runs on a large cluster of commodity machines and is highly scalable MapReduce application processes petabytes or terabytes of data on hundreds or thousands of machines Easy to use because it hides the details of parallelization, fault tolerance, data distribution, and load balancing Programmers can focus on writing the two key functions, map() and reduce()

MapReduce Programmers specify two functions:
map (k, v) → [(k’, v’)] reduce (k’, [v’]) → [(k’, v’)] All values with the same key are reduced together The execution framework (MapReduce runtime) handles everything else… Not quite…usually, programmers also specify: partition (k’, number of partitions) → partition for k’ Often a simple hash of the key, e.g., hash(k’) mod n Divides up key space for parallel reduce operations combine (k’, [v’]) → [(k’, v’’)] Mini-reducers that run in memory after the map phase Used as an optimization to reduce network traffic

MapReduce “Runtime” Handles scheduling Handles “data distribution”
Assigns workers to map and reduce tasks Handles “data distribution” Moves processes to data Handles synchronization Gathers, sorts, and shuffles intermediate data Handles errors and faults Detects worker failures and restarts Everything happens on top of a distributed FS (later)

map() function Master (name) node takes the input data set, partitions it into smaller data chunks, and distributes them to worker (data) nodes. The worker nodes apply the same transformation function to each data chunk, then pass the results back to the master node Mapper: map(): (Key1, Value1)  [(Key2, Value2)] // square brackets denote a list

reduce() function Master (name) node shuffles and clusters the received results based on unique key-value pairs; then, through another redistribution to the workers/slaves, these values are combined via another type of transformation function. Reducer: reduce(): (Key2, [Value2])  [(Key3, Value3)] // square brackets denote a list

Additional MapReduce Functions
combine: applied to local values with key2, acts as a local reducer per worker node shuffle: group all pairs with the same key key2 together input: a list of (key1, value1) output: a list of (key3, value3)

MapReduce Framework without Combiners (Local Reducer)
map(Key1, Value1)  [(Key2, Value2)] reduce(Key2, [Value2])  [(Key3, Value3)]

MapReduce Framework with Combiners (Local Reducer)
map(Key1, Value1)  [(Key2, Value2)] combine(Key2, [Value2])  [(Key2, Value2)] reduce(Key2, [Value2])  [(Key3, Value3)]

Writing your map() and reduce() functions
The solution as a MapReduce solution must be scaling out (adding more commodity nodes to a system) The functions will be executing in basic commodity servers with 32 GB or 64 GB of RAM at most (the capacity itself increases over time). When to use MapReduce()? For Big data as the collection of independent partitions: Think about MapReduce For grouping or aggregating a lot of data: MapReduce works well Graph algorithms? Due to their iterative approach Not MapReduce, but Apache Giraph and Apache Spark GraphX Rule of thumb Big and independent partitions without access sharing beyond memory capacity CPU-bound computation with processor-intensive jobs

Hadoop MapReduce Program Components
Driver Program Mapper Class Reducer Class

Driver Program Identify input and output directory
Plug-in the mapper and reducer by registering the mapper and reducer classes public class MyMapReduceJobDriver { public static void main(String[] args) { MyMapReduceJobDriver driver = new MyMapReduceJobDriver(); driver.run(args); } void run(String[] args) { // prepare input/ouput and additional parameters String input = args[0]; String output = args[1]; // create a Job Job job = new Job( …); // define input/output directories to MapReduce framework FileInputFormat.addInputPath(job, new Path(input)); FileInputFormat.addOutputPath(job, new Path(output)); // plug in your Mapper and Reducer classs job.setMapperClass(MyMapperClass.class); job.setReducerClass(MyReducerClass.class); // submit your MapReduce job

Mapper Class in Hadoop map() function: transforms individual records into a intermediate records A given input pair  zero or many output pairs Mapper class is a generic type, with four formal type parameters: Input key Input value Output key Output value of the map function Similar to Reducer class public class MyMapperClass extends … { // called once at the beginning of the map task (optional) setup() { … } // called once for each key-value pair in the input split map(key, value) { … // called once at the end of the map task (optional) cleanup() { …

Reducer Class in Hadoop
reduce() function: reduces a set of intermediate values which share a key to a set of values (list) (key, {value1, value2, …, value3}) Before the reduce() function is called, the following three support functions are performed Shuffle: copies the sorted output from each Mapper Sort: merge sorts Reducer inputs by keys Secondary Sort: optionally sorts values of a reducer Input arrived unsorted public class MyReducerClass extends … { // called once at the start of the reduce task setup() { … } // This method is called once for each reduce key reduce(key, value) { // Input: already sorted and grouped by shuffle() and sort() foreach ( v : value ) { process(key, v) … // called once at the end of the reduce task cleanup() { …

Hadoop Script $HADOOP_HOME/bin/Hadoop
Usage: Hadoop [--config confdir] COMMAND Where COMMAND is one of: namenode –format format the DFS filesystem secondarynamenode run the DFS secondary namenode namenode run the DFS namenode datanode run a DFS datanode dfsadmin run a DFS admin client fsck run a DFS filesystem checking utility fs run a generic filesystem user client balancer run a cluster balancing utility jobtracker run the MapReduce job Tracker node pipes run a Pipes job tasktracker run a MapReduce task Tracker node job manipulate MapReduce jobs version print the version jar <jar> run a jar file (a Java Hadoop program): bin/Hadoop jar <jar> distcp <srcurl> <desturl> copy file or directories recursively archive –archiveName NAME <src> <dest> create a Hadoop archive daemonlog get/set the log level for each daemon or CLASSNAME run the class named CLASSNAME

Lab III-1: WordCount.java
import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileIn putFormat; import org.apache.hadoop.mapreduce.lib.output.File OutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }

Lab III-1: Build and Execute MapReduce
Example jar $ cd $HADOOP_INSTALL/share/Hadoop/mapreduce $ hadoop jar Hadoop-mapreduce-examples jar grep ~/input ~/output ‘dfs[a-z.]+’ $ hadoop fs –cat ./part-r-00000 Environment variable setting $ export JAVA_HOME=… $ export PATH=${JAVA_HOME}/bin:${PATH} $ export HADOOP_CLASSPAHT=${JAVA_HOME}/lib/tools.jar Compile WordCount.java and create your jar $ bin/Hadoop com.sun.tools.javac.Main WordCount.java $ jar cf wc.jar WordCount*.class Create input folder $ mkdir input $ cp ~/file* ./input

Lab III-1: Build and Execute MapReduce
Check files $ hadoop fs –ls ./input ./input/file01 ./input/file02 $ hadoop fs –cat ./input/file01 $ hadoop fs –cat ./input/file02 Perform Hadoop MapReduce $ hadoop jar wc.jar WordCount ./input ./output Output $ hadoop fs –cat ./output/part-r-00000 Bye 6 Goodbye 4 Hadoop1 3 Hadoop2 2 Hadoop3 3 Hello 10 World 6 World1 3 World2 2 World3 1

Spark’s Transformations and Actions

Spark Programming: Counting “error” and “exception” Keywords
import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; public class SampleSparkProgram { public static void main(String[] args) { String logFile = “/home/tiger/log/logs.txt”; SparkConf conf = new SparkConf().setAppname(“count errors”); JavaSparkContext context = new JavaSparkContext(conf); JavaRDD<String> logData = context.textFile(logFile).cache(); long numOfErrors = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return s.contains(“exception”); } }).count(); System.out.println(“errors: “ + numOfErrors + “, exceptions: “ + numOfExceptions); context.close(); } )

Key-Value Pairs in Spark
Scala.Tuple<N> objects: the foundation for key-value pairs Tuple3<String, Integer, Integer> : a composite value with 3 fields Tuple2<String, String> : a composite value with 2 fields In Java, Tuple2<String, String> k2 = new Tuple2<String, String>(“s1”, “s2”); Tuple3<String, Integer, Integer> v3 = new Tuple3<String, Integer, Integer>(“a”, “b”, 2); Custom Key or Value Types within a Class implements the java.io.Serializable interface Custom value class: for you to use it as a key or value for the Spark programs Public class MyCustomValue implements java.io.Serializable { int id; String name; String address; char gender; <methods…> } Use MyCustomValue class as a key or value JavaSparkContext context = new JavaSparkContext(); JavaRDD<String> lines = context.textFile(“./data.txt”); JavaPairRDD<String, MyCustomValue> pairs = lines.mapToPair( … );

Transformations Return a new, modified RDD based on the original
foldByKey() map() filter() sample() union()

Actions Return a value based on some computation being performed on an RDD. countByKey() reduce() count() first() foreach() Using transformations and actions, complex DAG can be created to solve MapReduce problems and beyond.

Reference Jimmy Lin (at Univ. of Waterloo) Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce, : Mahmoud Parsian, Data Algorithm: Recipes For Scaling Up With Hadoop and Spark, O’Reilly, : Mahmoud Parsian, Introduction to MapReduce, : algorithms-book/blob/master/src/main/java/org/dataalgorithms/chapB09/charcount/Introduction-to- MapReduce.pdf Chuck Lam, Hadoop in Action, : 11/ pdf Tom White, Hadoop: The Definitive Guide, 4th Ed., O’Reilly, 2015. Apach Hadoop, MapReduce Tutorial: client/hadoop-mapreduce-client-core/MapReduceTutorial.html Matthew Rathbone, Apache Spark Java Tutorial with Code Examples, 2015.:

Map Reduce Program September 25th 2017 Kyung Eun Park, D.Sc.

Similar presentations

Presentation on theme: "Map Reduce Program September 25th 2017 Kyung Eun Park, D.Sc."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Map Reduce Program September 25th 2017 Kyung Eun Park, D.Sc.

Similar presentations

Presentation on theme: "Map Reduce Program September 25th 2017 Kyung Eun Park, D.Sc."— Presentation transcript:

Similar presentations

About project

Feedback