Airlinecount CSCE 587 Spring 2016. Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.

Airlinecount CSCE 587 Spring 2016

Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned to you If you haven’t already done so, change your password Ex: passwd You will be prompted for your current password Next you will be prompted for a new password

Preliminary steps in the VM Load data into linux filesystem of the VM. use SSH secure file transfer or use vi to create a text file Ex : [student@sandbox ~]$ scp -P222 rose@l-1d39-08.cse.sc.edu:public_html/587/CSV/test_25K.csv test_25K.csv scp – secure copy command -P222 – use port 222 Source file Destination file

Preliminary steps in the VM Transfer the file from the vm linux filesystem to the Hadoop filesystem hadoop fs -put test_25K.csv /user/share/student/ Convince yourself by checking the HDFS hadoop fs -ls /user/share/student/ Ex: [student@sandbox ~]$ hadoop fs -ls /user/share/student/ Found 3 items -rw-r--r-- 1 student hdfs 2391989 2016-03-24 10:43 /user/share/student/test_25K.csv drwxr-xr-x - student hdfs 0 2015-11-09 18:41 /user/share/student/out drwxr-xr-x - student hdfs 0 2015-11-09 18:47 /user/share/student/out2

Preliminary Rstudio steps # Set environmental variables Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.0.0-2557.jar") # Load the following packages in the following order library(rhdfs) library(rmr2) # initialize the connection from rstudio to hadoop hdfs.init()

Our second mapreduce program # Doing simple mapreduce on airline data # Our map function which returns the keyval map1 = function(k,flights) { return ( keyval(as.character(flights[[9]]),1)) } # Our reduce function which sums up the flights for each airline reduce1 = function(carrier, counts) { keyval(carrier, sum(counts)) }

Our second mapreduce program # Our mapreduce function which invokes map1 and reduce1 and parses # the input file expected it to be comma delimited mr1 = function(input, output = NULL) { mapreduce(input = input, output = output, input.format = make.input.format("csv", sep=","), map = map1, reduce = reduce1)}

Submitting our first mapreduce job # Specify the path hdfs.root = '/user/share/student' # append the data filename to the pathname hdfs.data = file.path(hdfs.root,'testDataNoHdr.csv') # append the output filename to the pathname hdfs.out = file.path(hdfs.root,'out')

Submitting our first mapreduce job # invoke your map-reduce functions on your input file and output file out = mr1(hdfs.data, hdfs.out) # Pour yourself a cup of coffee and wait……. # Hadoop is fast enough, but the VM is deadly slow…..

Note: you can not overwrite existing files # if “out" already exists, then the mapreduce job will fail and you will # have to delete “out": # hadoop fs -rmr /user/student/out

VM: Check for changes to HDFS [student@sandbox ~]$ hadoop fs -ls /user/share/student/ Found 4 items -rw-r--r-- 1 student hdfs 1746 2016-03-03 12:36 /user/share/student/g.txt drwxr-xr-x - student hdfs 0 2015-11-09 18:41 /user/share/student/out drwxr-xr-x - student hdfs 0 2015-11-09 18:47 /user/share/student/out2 drwxr-xr-x - student hdfs 0 2016-03-03 12:56 /user/share/student/out

RStudio: Fetch the results from HDFS results = from.dfs(out) results.df = as.data.frame(results, stringsAsFactors=F) colnames(results.df) = c('Carrier', 'Flights') # head(results.df) results.df

Java Version: Wordcount Program Outline package org.myorg; import java.io.IOException;... public class WordCount { public static class Map extends Mapper { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { …. } public static class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { …. } public static void main(String[] args) throws Exception { …. }

Java Version: Wordcount Program package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; This portion lists all of the classes that our Program might reference

Java Version: Airlinecount Program package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; This portion lists all of the classes that our Program might reference No change from wordcount

The new program will be AirlineCount Change from public class WordCount { …….. } To public class AirlineCount { …….. }

Java Version: Wordcount Program public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); }

Java Version: Airlinecount Program public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] column = line.split(","); Text word=new Text(); word.set(column[8]); context.write(word, one); }

Java Version: Wordcount Program public static class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }

Java Version: Airlinecount Program No change! public static class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }

Java Version: Wordcount Program Driver public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "WordCount“ ); job.setJarByClass( WordCount.class ); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }

Java Version: Airlinecount Program Driver public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "AirlineCount"); job.setJarByClass( AirlineCount.class ); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }

Creating the AlirlineCount Jar File # 1) create a directory airlinecount and move to that directory mkdir airlinecount cd airlinecount # 2) get the source file AirlineCount.java # you could use scp to copy it # 3) create a subdirectory to store the class files mkdir airlinec

Creating the AirlineCount Jar File # 4) Compile AirlineCount.java to create the class files javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common- 2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop- mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0- 2557.jar:/usr/hdp/2.3.0.0-2557/hadoop/lib/commons-cli-1.2.jar -d./airlinec/ *.java N.B. This is all one line  no carriage returns until the very end.

Creating the AirlineCount Jar File # 5) Create the JAR file from these classes jar -cvf AirlineCount.jar -C /home/student/airlinecount/airlinec.

Launch our new program! # 8) Launch the wordcount program from Hadoop # hadoop jar jarfile program input output hadoop jar /home/student/airlinecount/AirlineCount.jar org/myorg/AirlineCount /user/share/student/test_25K.csv /user/share/student/outputalc

Airlinecount CSCE 587 Spring 2016. Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.

Similar presentations

Presentation on theme: "Airlinecount CSCE 587 Spring 2016. Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Airlinecount CSCE 587 Spring 2016. Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.

Similar presentations

Presentation on theme: "Airlinecount CSCE 587 Spring 2016. Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned."— Presentation transcript:

Similar presentations

About project

Feedback