Download presentation
Presentation is loading. Please wait.
Published byBeatrice Daniel Modified over 8 years ago
1
Airlinecount CSCE 587 Spring 2016
2
Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned to you If you haven’t already done so, change your password Ex: passwd You will be prompted for your current password Next you will be prompted for a new password
3
Preliminary steps in the VM Load data into linux filesystem of the VM. use SSH secure file transfer or use vi to create a text file Ex : [student@sandbox ~]$ scp -P222 rose@l-1d39-08.cse.sc.edu:public_html/587/CSV/test_25K.csv test_25K.csv scp – secure copy command -P222 – use port 222 Source file Destination file
4
Preliminary steps in the VM Transfer the file from the vm linux filesystem to the Hadoop filesystem hadoop fs -put test_25K.csv /user/share/student/ Convince yourself by checking the HDFS hadoop fs -ls /user/share/student/ Ex: [student@sandbox ~]$ hadoop fs -ls /user/share/student/ Found 3 items -rw-r--r-- 1 student hdfs 2391989 2016-03-24 10:43 /user/share/student/test_25K.csv drwxr-xr-x - student hdfs 0 2015-11-09 18:41 /user/share/student/out drwxr-xr-x - student hdfs 0 2015-11-09 18:47 /user/share/student/out2
5
Preliminary Rstudio steps # Set environmental variables Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.0.0-2557.jar") # Load the following packages in the following order library(rhdfs) library(rmr2) # initialize the connection from rstudio to hadoop hdfs.init()
6
Our second mapreduce program # Doing simple mapreduce on airline data # Our map function which returns the keyval map1 = function(k,flights) { return ( keyval(as.character(flights[[9]]),1)) } # Our reduce function which sums up the flights for each airline reduce1 = function(carrier, counts) { keyval(carrier, sum(counts)) }
7
Our second mapreduce program # Our mapreduce function which invokes map1 and reduce1 and parses # the input file expected it to be comma delimited mr1 = function(input, output = NULL) { mapreduce(input = input, output = output, input.format = make.input.format("csv", sep=","), map = map1, reduce = reduce1)}
8
Submitting our first mapreduce job # Specify the path hdfs.root = '/user/share/student' # append the data filename to the pathname hdfs.data = file.path(hdfs.root,'testDataNoHdr.csv') # append the output filename to the pathname hdfs.out = file.path(hdfs.root,'out')
9
Submitting our first mapreduce job # invoke your map-reduce functions on your input file and output file out = mr1(hdfs.data, hdfs.out) # Pour yourself a cup of coffee and wait……. # Hadoop is fast enough, but the VM is deadly slow…..
10
Note: you can not overwrite existing files # if “out" already exists, then the mapreduce job will fail and you will # have to delete “out": # hadoop fs -rmr /user/student/out
11
VM: Check for changes to HDFS [student@sandbox ~]$ hadoop fs -ls /user/share/student/ Found 4 items -rw-r--r-- 1 student hdfs 1746 2016-03-03 12:36 /user/share/student/g.txt drwxr-xr-x - student hdfs 0 2015-11-09 18:41 /user/share/student/out drwxr-xr-x - student hdfs 0 2015-11-09 18:47 /user/share/student/out2 drwxr-xr-x - student hdfs 0 2016-03-03 12:56 /user/share/student/out
12
RStudio: Fetch the results from HDFS results = from.dfs(out) results.df = as.data.frame(results, stringsAsFactors=F) colnames(results.df) = c('Carrier', 'Flights') # head(results.df) results.df
13
Java Version: Wordcount Program Outline package org.myorg; import java.io.IOException;... public class WordCount { public static class Map extends Mapper { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { …. } public static class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { …. } public static void main(String[] args) throws Exception { …. }
14
Java Version: Wordcount Program package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; This portion lists all of the classes that our Program might reference
15
Java Version: Airlinecount Program package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; This portion lists all of the classes that our Program might reference No change from wordcount
16
The new program will be AirlineCount Change from public class WordCount { …….. } To public class AirlineCount { …….. }
17
Java Version: Wordcount Program public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); }
18
Java Version: Airlinecount Program public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] column = line.split(","); Text word=new Text(); word.set(column[8]); context.write(word, one); }
19
Java Version: Wordcount Program public static class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }
20
Java Version: Airlinecount Program No change! public static class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }
21
Java Version: Wordcount Program Driver public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "WordCount“ ); job.setJarByClass( WordCount.class ); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
22
Java Version: Airlinecount Program Driver public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "AirlineCount"); job.setJarByClass( AirlineCount.class ); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
23
Creating the AlirlineCount Jar File # 1) create a directory airlinecount and move to that directory mkdir airlinecount cd airlinecount # 2) get the source file AirlineCount.java # you could use scp to copy it # 3) create a subdirectory to store the class files mkdir airlinec
24
Creating the AirlineCount Jar File # 4) Compile AirlineCount.java to create the class files javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common- 2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop- mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0- 2557.jar:/usr/hdp/2.3.0.0-2557/hadoop/lib/commons-cli-1.2.jar -d./airlinec/ *.java N.B. This is all one line no carriage returns until the very end.
25
Creating the AirlineCount Jar File # 5) Create the JAR file from these classes jar -cvf AirlineCount.jar -C /home/student/airlinecount/airlinec.
26
Launch our new program! # 8) Launch the wordcount program from Hadoop # hadoop jar jarfile program input output hadoop jar /home/student/airlinecount/AirlineCount.jar org/myorg/AirlineCount /user/share/student/test_25K.csv /user/share/student/outputalc
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.