Airlinecount CSCE 587 Fall 2017.

Airlinecount CSCE 587 Fall 2017

Preliminary steps in the VM
Load data into linux filesystem of the VM. use wget or use vi to create a text file

Preliminary steps in the VM
Transfer the file from the vm linux filesystem to the Hadoop filesystem hadoop fs -put test_25K.csv /user/share/student/ Convince yourself by checking the HDFS hadoop fs -ls /user/share/student/ Ex: ~]$ hadoop fs -ls /user/share/student/ Found 3 items -rw-r--r student hdfs :43 /user/share/student/test_25K.csv drwxr-xr-x - student hdfs :41 /user/share/student/out drwxr-xr-x - student hdfs :47 /user/share/student/out2

Preliminary Rstudio steps
# Set environmental variables Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/usr/hdp/ /hadoop-mapreduce/hadoop-streaming jar") # Load the following packages in the following order library(rhdfs) library(rmr2) # initialize the connection from rstudio to hadoop hdfs.init()

Our second mapreduce program
# Doing simple mapreduce on airline data # Our map function which returns the keyval <airline_ID,1> map1 = function(k,flights) { return ( keyval(as.character(flights[[9]]),1)) } # Our reduce function which sums up the flights for each airline reduce1 = function(carrier, counts) { keyval(carrier, sum(counts))

Our second mapreduce program
# Our mapreduce function which invokes map1 and reduce1 and parses # the input file expected it to be comma delimited mr1 = function(input, output = NULL) { mapreduce(input = input, output = output, input.format = make.input.format("csv", sep=","), map = map1, reduce = reduce1)}

Submitting our first mapreduce job
# Specify the path hdfs.root = '/user/share/student' # append the data filename to the pathname hdfs.data = file.path(hdfs.root,'testDataNoHdr.csv') # append the output filename to the pathname hdfs.out = file.path(hdfs.root,'out')

Submitting our first mapreduce job
# invoke your map-reduce functions on your input file and output file out = mr1(hdfs.data, hdfs.out) # if “out" already exists, then the mapreduce job will fail and you will # have to delete “out": # hadoop fs -rmr /user/student/out

VM: Check for changes to HDFS
~]$ hadoop fs -ls /user/share/student/ Found 4 items -rw-r--r-- 1 student hdfs :36 /user/share/student/g.txt drwxr-xr-x - student hdfs :41 /user/share/student/out drwxr-xr-x - student hdfs :47 /user/share/student/out2 drwxr-xr-x - student hdfs :56 /user/share/student/out

RStudio: Fetch the results from HDFS
results = from.dfs(out) results.df = as.data.frame(results, stringsAsFactors=F) colnames(results.df) = c('Carrier', 'Flights') # head(results.df) results.df

Java Version: Wordcount Program Outline
package org.myorg; import java.io.IOException; ... public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { …. } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { public static void main(String[] args) throws Exception {

Java Version: Wordcount Program
package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; This portion lists all of the classes that our Program might reference

Java Version: Airlinecount Program
package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; This portion lists all of the classes that our Program might reference No change from wordcount

The new program will be AirlineCount
Change from public class WordCount { …….. } To public class AirlineCount {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); }

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[ ] column = line.split(","); Text word=new Text(); word.set(column[8]); context.write(word, one); }

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum));

No change! public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum));

Java Version: Wordcount Program Driver
public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "WordCount“ ); job.setJarByClass( WordCount.class ); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }

Java Version: Airlinecount Program Driver
public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "AirlineCount"); job.setJarByClass( AirlineCount.class ); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }

Creating the AlirlineCount Jar File
# 1) create a directory airlinecount and move to that directory mkdir airlinecount cd airlinecount # 2) get the source file AirlineCount.java # you could use scp to copy it # 3) create a subdirectory to store the class files mkdir airlinec

Creating the AirlineCount Jar File
# 4) Compile AirlineCount.java to create the class files javac -classpath /usr/hdp/ /hadoop/hadoop-common jar:/usr/hdp/ /hadoop- mapreduce/hadoop-mapreduce-client-core jar:/usr/hdp/ /hadoop/lib/commons-cli-1.2.jar -d ./airlinec/ *.java N.B. This is all one line  no carriage returns until the very end.

Creating the AirlineCount Jar File
# 5) Create the JAR file from these classes jar -cvf AirlineCount.jar -C /home/student/airlinecount/airlinec .

Launch our new program! # 8) Launch the wordcount program from Hadoop # hadoop jar jarfile program input output hadoop jar /home/student/airlinecount/AirlineCount.jar org/myorg/AirlineCount /user/share/student/test_25K.csv /user/share/student/outputalc

Airlinecount CSCE 587 Fall 2017.

Similar presentations

Presentation on theme: "Airlinecount CSCE 587 Fall 2017."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Airlinecount CSCE 587 Fall 2017.

Similar presentations

Presentation on theme: "Airlinecount CSCE 587 Fall 2017."— Presentation transcript:

Similar presentations

About project

Feedback