Airlinecount CSCE 587 Fall 2017
Preliminary steps in the VM Load data into linux filesystem of the VM. use wget or use vi to create a text file
Preliminary steps in the VM Transfer the file from the vm linux filesystem to the Hadoop filesystem hadoop fs -put test_25K.csv /user/share/student/ Convince yourself by checking the HDFS hadoop fs -ls /user/share/student/ Ex: [student@sandbox ~]$ hadoop fs -ls /user/share/student/ Found 3 items -rw-r--r-- 1 student hdfs 2391989 2016-03-24 10:43 /user/share/student/test_25K.csv drwxr-xr-x - student hdfs 0 2015-11-09 18:41 /user/share/student/out drwxr-xr-x - student hdfs 0 2015-11-09 18:47 /user/share/student/out2
Preliminary Rstudio steps # Set environmental variables Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.0.0-2557.jar") # Load the following packages in the following order library(rhdfs) library(rmr2) # initialize the connection from rstudio to hadoop hdfs.init()
Our second mapreduce program # Doing simple mapreduce on airline data # Our map function which returns the keyval <airline_ID,1> map1 = function(k,flights) { return ( keyval(as.character(flights[[9]]),1)) } # Our reduce function which sums up the flights for each airline reduce1 = function(carrier, counts) { keyval(carrier, sum(counts))
Our second mapreduce program # Our mapreduce function which invokes map1 and reduce1 and parses # the input file expected it to be comma delimited mr1 = function(input, output = NULL) { mapreduce(input = input, output = output, input.format = make.input.format("csv", sep=","), map = map1, reduce = reduce1)}
Submitting our first mapreduce job # Specify the path hdfs.root = '/user/share/student' # append the data filename to the pathname hdfs.data = file.path(hdfs.root,'testDataNoHdr.csv') # append the output filename to the pathname hdfs.out = file.path(hdfs.root,'out')
Submitting our first mapreduce job # invoke your map-reduce functions on your input file and output file out = mr1(hdfs.data, hdfs.out) # if “out" already exists, then the mapreduce job will fail and you will # have to delete “out": # hadoop fs -rmr /user/student/out
VM: Check for changes to HDFS [student@sandbox ~]$ hadoop fs -ls /user/share/student/ Found 4 items -rw-r--r-- 1 student hdfs 1746 2016-03-03 12:36 /user/share/student/g.txt drwxr-xr-x - student hdfs 0 2015-11-09 18:41 /user/share/student/out drwxr-xr-x - student hdfs 0 2015-11-09 18:47 /user/share/student/out2 drwxr-xr-x - student hdfs 0 2016-03-03 12:56 /user/share/student/out
RStudio: Fetch the results from HDFS results = from.dfs(out) results.df = as.data.frame(results, stringsAsFactors=F) colnames(results.df) = c('Carrier', 'Flights') # head(results.df) results.df
Java Version: Wordcount Program Outline package org.myorg; import java.io.IOException; ... public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { …. } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { public static void main(String[] args) throws Exception {
Java Version: Wordcount Program package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; This portion lists all of the classes that our Program might reference
Java Version: Airlinecount Program package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; This portion lists all of the classes that our Program might reference No change from wordcount
The new program will be AirlineCount Change from public class WordCount { …….. } To public class AirlineCount {
Java Version: Wordcount Program public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); }
Java Version: Airlinecount Program public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[ ] column = line.split(","); Text word=new Text(); word.set(column[8]); context.write(word, one); }
Java Version: Wordcount Program public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum));
Java Version: Airlinecount Program No change! public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum));
Java Version: Wordcount Program Driver public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "WordCount“ ); job.setJarByClass( WordCount.class ); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
Java Version: Airlinecount Program Driver public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "AirlineCount"); job.setJarByClass( AirlineCount.class ); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
Creating the AlirlineCount Jar File # 1) create a directory airlinecount and move to that directory mkdir airlinecount cd airlinecount # 2) get the source file AirlineCount.java # you could use scp to copy it # 3) create a subdirectory to store the class files mkdir airlinec
Creating the AirlineCount Jar File # 4) Compile AirlineCount.java to create the class files javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common- 2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop- mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0- 2557.jar:/usr/hdp/2.3.0.0-2557/hadoop/lib/commons-cli-1.2.jar -d ./airlinec/ *.java N.B. This is all one line no carriage returns until the very end.
Creating the AirlineCount Jar File # 5) Create the JAR file from these classes jar -cvf AirlineCount.jar -C /home/student/airlinecount/airlinec .
Launch our new program! # 8) Launch the wordcount program from Hadoop # hadoop jar jarfile program input output hadoop jar /home/student/airlinecount/AirlineCount.jar org/myorg/AirlineCount /user/share/student/test_25K.csv /user/share/student/outputalc