Lecture 18 (Hadoop: Programming Examples)

Lecture 18 (Hadoop: Programming Examples)
CSE 482: Big Data Analysis Lecture 18 (Hadoop: Programming Examples)

Distributed Word Count Example
key-value pairs Input data files key-value pairs Mapper (this, 1) (is, 1) (a, 1) (cat, 1) … (0, This is a cat!) (14, cat is ok) (24, walk the dog) … map() function cat * Mapper input Mapper output Sorting, partitioning, shuffling Reducer reduce() function (a, 6) (and, 2) (be, 3) (cat, 4) (dog, 2) … Output (part-r-00000) key-value pairs (a, [1, 1, 1, 1, 1, 1]) (and, [1, 1]) (be, [1, 1, 1]) (cat, [1, 1, 1, 1]) (dog, [1, 1]) … Key, list of values pairs Reducer output Reducer input

Basic Template for Hadoop Program
import org.apache.hadoop.*; // specify the Hadoop libraries used import java.util.*; // specify the Java libraries used public class ClassName { // name of the main class public static class MapperClass extends Mapper < InputKeyType, InputValueType, OutputKeyType, OutputValueType> { … } public static class ReducerClass extends Reducer < InputKeyType, InputValueType, OutputKeyType, OutputValueType> { … public static void main (String [] args) throws Exception { …

Hadoop Data Types for Keys & Values
Key and value cannot be arbitrary types because the Hadoop framework has a certain way of serializing the key/value pairs to move them across the cluster’s network Only classes that support this kind of serialization can function as keys or values in the framework Classes that implement the WritableComparable interface can be either keys or values Classes that implement the Writable interface can be used as data type for values

These classes are defined in org.apache.hadoop.io package

Use get and set functions to convert the variable between Hadoop and standard data types Example: Intwritable count = new Intwritable(50); int value = count.get(); // value = 50 Text word = new Text(); String str = new String(“twitter”); word.set(str); // word is assigned to “twitter”

Hadoop Program for WordCount
(Text, IntWritable) (Object, Text) Mapper (this, 1) (is, 1) (a, 1) (cat, 1) … (0, This is a cat!) (14, cat is ok) (24, walk the dog) … map() function Mapper input Mapper output Reducer reduce() function (a, 6) (and, 2) (be, 3) (cat, 4) (dog, 2) … (Text, IntWritable) (a, [1, 1, 1, 1, 1, 1]) (and, [1, 1]) (be, [1, 1, 1]) (cat, [1, 1, 1, 1]) (dog, [1, 1]) … (Text, IntWritable) Reducer output Reducer input

Hadoop Program for WordCount
import org.apache.hadoop.*; // specify the Hadoop libraries used import java.util.*; // specify the Java libraries used public class WordCount { // name of the main class public static class TokenizerMapper extends Mapper < Object, Text, Text, IntWritable > { … } public static class IntSumReducer extends Reducer < Text, IntWritable, Text, IntWritable > { … public static void main (String [] args) throws Exception { … } }

Mapper Class You only need to implement the map() function
public static class TokenizerMapper extends Mapper < Object, Text, Text, IntWritable > { public void map (Object key, Text value, Context context) throws IOException, InterruptedException { … } } Context object is used to interact with Hadoop system E.g., to write key-value pairs as mapper/reducer output context.write(key, value)

WordCount Example for Mapper
By default, Hadoop assumes each input to map() function is a line from a file Key is the byte offset from the first line and value is the line content itself The default input format can be changed

Reducer Class You only need to implement the reduce() function
public static class IntSumReducer extends Reducer < Text, IntWritable, Text, IntWritable > { public void reduce (Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { … } } values is an Iterable object In the reduce function, you will often iterate through the values while applying some aggregation function

WordCount Example for Reducer
Convert the sum from int to IntWritable Write the output of the reducer

Main Program …

Example: Count # of Wikipedia Edits
Count the number of edits for each article Input data (wiki_edit.txt) Output: Wikipedia article

Example: Count # of Wikipedia Edits
Count the number of edits for each article Input data (wiki_edit.txt) Mapper output: key: title of article value: 1 Reducer output: key: title of article value: count (# of edits)

countEdits.java import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class countEdits { // This is the main class for the wikipedia edit counting program }

Output type of Mapper must be consistent with input type of Reducer
countEdits.java Input type Output type public class countEdits { public static class EditMapper extends Mapper<Object, Text, Text, IntWritable>{ … } public static class EditReducer extends Reducer<Text,IntWritable,Text,IntWritable> { … } public static void main(String[] args) throws Exception { … } Output type Input type Output type of Mapper must be consistent with input type of Reducer

Mapper for Counting # of Edits
Map() function will process the input split one record at a time Each record is one line in the input file (by default) You can change the definition of a “record” (e.g., to handle multiple lines) by modifying the InputFormat of a Hadoop job (to be discussed in next lecture) Map function will Read each line and split it into tokens Identify the article_title (i.e., 4th token in the line) Write (article_title, 1) as output of mapper

Make sure the data types for key, value pairs match

Reducer for Counting # of Edits
Input key for reducer can be any class that implements writableComparable Input value for reducer must implement the Iterable interface (used for lists) - So if output value of Mapper is IntWritable, input values for Reducer should be Iterable<IntWritable>

countEdits Main Program
public class countEdits { ,.. … Combiner class is the same as reducer class

Example: Counting # of Wikipedia Edits
STEP 1: Download code and data STEP 2: Compile and archive the code STEP 3: Upload your input data to HDFS STEP 4: Run the Hadoop job STEP 5: Display and download results STEP 6: Terminate cluster

Step 1: Download Code and Data
You can download the data and the source code from the class website

Step 2: Compilation Set the corresponding environment variables
To compile: Create a jar file to archive the class files:

Step 3: Upload data to HDFS
You can upload the file using either hadoop fs -copyFromLocal <source> <destination> or hadoop fs -put <source> <destination> Note: <source> is the source file located on your local directory; <destination> is located on HDFS

Step 3: Upload data to HDFS
View the file in HDFS to make sure it is correct Name of editor Title of edited Wikipedia article

Step 4: Run the Job Syntax: hadoop jar <jarfile> <classfile> <arguments> Output directory Input

Step 5: Display and Download Result
Display the result _SUCCESS: program was executed successfully parts-r-00000: output file of the reducer

Step 5: Display and Download Result
Download the results You can then ftp the results back to arctic.cse.msu.edu or any CSE servers

Step 6: Terminate Cluster
Remember to terminate the EMR cluster

Lecture 18 (Hadoop: Programming Examples)

Similar presentations

Presentation on theme: "Lecture 18 (Hadoop: Programming Examples)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 18 (Hadoop: Programming Examples)

Similar presentations

Presentation on theme: "Lecture 18 (Hadoop: Programming Examples)"— Presentation transcript:

Similar presentations

About project

Feedback