Airlinecount CSCE 587 Fall 2017.

Slides:



Advertisements
Similar presentations
Hadoop Programming. Overview MapReduce Types Input Formats Output Formats Serialization Job g/apache/hadoop/mapreduce/package-
Advertisements

Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
Parallel and Distributed Computing: MapReduce Alona Fyshe.
Lecture 11 – Hadoop Technical Introduction. Terminology Google calls it:Hadoop equivalent: MapReduceHadoop GFSHDFS BigtableHBase ChubbyZookeeper.
CS246 TA Session: Hadoop Tutorial Peyman kazemian 1/11/2011.
Intro to Map-Reduce Feb 4, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
Exception examples. import java.io.*; import java.util.*; class IO { private String line; private StringTokenizer tokenizer; public void newline(DataInputStream.
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
String Tokenization What is String Tokenization?
RHadoop rev
Introduction to Google MapReduce WING Group Meeting 13 Oct 2006 Hendra Setiawan.
MapReduce Programming Yue-Shan Chang. split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign.
Java Programming Robert Chatley William Lee
Data Analytics 김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도. Contents Part l 1. Introduction - Data Analytics Cases - What is Data Analytics? - OLTP, OLAP - ROLAP - MOLAP.
MapReduce Costin Raiciu Advanced Topics in Distributed Systems, 2011.
HAMS Technologies 1
MapReduce.
IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.
Hadoop Introduction Wang Xiaobo Outline Install hadoop HDFS MapReduce WordCount Analyzing Compile image data TeleNav Confidential.
Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.
1 Recitation 8. 2 Outline Goals of this recitation: 1.Learn about loading files 2.Learn about command line arguments 3.Review of Exceptions.
MapReduce Costin Raiciu Advanced Topics in Distributed Systems, 2012.
1 Week 12 l Overview of Streams and File I/O l Text File I/O Streams and File I/O.
MapReduce design patterns Chapter 5: Join Patterns G 진다인.
Introduction IS Outline  Goals of the course  Course organization  Java command line  Object-oriented programming  File I/O.
Writing a MapReduce Program 1. Agenda  How to use the Hadoop API to write a MapReduce program in Java  How to use the Streaming API to write Mappers.
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
MapReduce Programming Model. HP Cluster Computing Challenges  Programmability: need to parallelize algorithms manually  Must look at problems from parallel.
1 Creating Web Services Presented by Ashraf Memon Hands-on Ghulam Memon, Longjiang Ding.
5-Dec-15 Sequential Files and Streams. 2 File Handling. File Concept.
Before we start, please download: VirtualBox: – The Hortonworks Data Platform: –
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
Chapter 9 1 Chapter 9 – Part 2 l Overview of Streams and File I/O l Text File I/O l Binary File I/O l File Objects and File Names Streams and File I/O.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team Modified by R. Cook.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
ICS3U_FileIO.ppt File Input/Output (I/O)‏ ICS3U_FileIO.ppt File I/O Declare a file object File myFile = new File("billy.txt"); a file object whose name.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, January 31, 2013 Session 2: Hadoop Nuts and Bolts This work is licensed.
Georgia Institute of Technology Making Text for the Web part 2 Barb Ericson Georgia Institute of Technology March 2006.
Airlinecount CSCE 587 Spring Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.
Parallel Data Processing with Hadoop/MapReduce
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
Distributed Systems Lecture 3 Big Data and MapReduce 1.
Hadoop&Hbase Developed Using JAVA USE NETBEANS IDE.
MAPREDUCE Massive Data Processing (I). Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse.
Sort in MapReduce. MapReduce Block 1 Block 2 Block 3 Block 4 Block 5 Map Reduce Output 1 Output 2 Shuffle/Sort.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Lecture 4: Mapreduce and Hadoop
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Unit 2 Hadoop and big data
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Introduction to Google MapReduce
Introduction to programming in java
Lecture 17 (Hadoop: Getting Started)
Central Florida Business Intelligence User Group
Lecture 11 – Hadoop Technical Introduction
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Wordcount CSCE 587 Spring 2018.
WordCount 빅데이터 분산컴퓨팅 박영택.
인공지능연구실 이남기 ( ) 유비쿼터스 응용시스템: 실습 가이드 인공지능연구실 이남기 ( )
Wordcount CSCE 587 Spring 2018.
Lecture 18 (Hadoop: Programming Examples)
Lecture 26 (Mahout Clustering)
Chapter X: Big Data.
Sadalage & Fowler (Amazon)
Big Data Technology: Introduction to Hadoop
Chapter 10: Big Data.
Bryon Gill Pittsburgh Supercomputing Center
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Cloud Computing John McSpedon.
Presentation transcript:

Airlinecount CSCE 587 Fall 2017

Preliminary steps in the VM Load data into linux filesystem of the VM. use wget or use vi to create a text file

Preliminary steps in the VM Transfer the file from the vm linux filesystem to the Hadoop filesystem hadoop fs -put test_25K.csv /user/share/student/ Convince yourself by checking the HDFS hadoop fs -ls /user/share/student/ Ex: [student@sandbox ~]$ hadoop fs -ls /user/share/student/ Found 3 items -rw-r--r-- 1 student hdfs 2391989 2016-03-24 10:43 /user/share/student/test_25K.csv drwxr-xr-x - student hdfs 0 2015-11-09 18:41 /user/share/student/out drwxr-xr-x - student hdfs 0 2015-11-09 18:47 /user/share/student/out2

Preliminary Rstudio steps # Set environmental variables Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.0.0-2557.jar") # Load the following packages in the following order library(rhdfs) library(rmr2) # initialize the connection from rstudio to hadoop hdfs.init()

Our second mapreduce program # Doing simple mapreduce on airline data # Our map function which returns the keyval <airline_ID,1> map1 = function(k,flights) { return ( keyval(as.character(flights[[9]]),1)) } # Our reduce function which sums up the flights for each airline reduce1 = function(carrier, counts) { keyval(carrier, sum(counts))

Our second mapreduce program # Our mapreduce function which invokes map1 and reduce1 and parses # the input file expected it to be comma delimited mr1 = function(input, output = NULL) { mapreduce(input = input, output = output, input.format = make.input.format("csv", sep=","), map = map1, reduce = reduce1)}

Submitting our first mapreduce job # Specify the path hdfs.root = '/user/share/student' # append the data filename to the pathname hdfs.data = file.path(hdfs.root,'testDataNoHdr.csv') # append the output filename to the pathname hdfs.out = file.path(hdfs.root,'out')

Submitting our first mapreduce job # invoke your map-reduce functions on your input file and output file out = mr1(hdfs.data, hdfs.out) # if “out" already exists, then the mapreduce job will fail and you will # have to delete “out": # hadoop fs -rmr /user/student/out

VM: Check for changes to HDFS [student@sandbox ~]$ hadoop fs -ls /user/share/student/ Found 4 items -rw-r--r-- 1 student hdfs 1746 2016-03-03 12:36 /user/share/student/g.txt drwxr-xr-x - student hdfs 0 2015-11-09 18:41 /user/share/student/out drwxr-xr-x - student hdfs 0 2015-11-09 18:47 /user/share/student/out2 drwxr-xr-x - student hdfs 0 2016-03-03 12:56 /user/share/student/out

RStudio: Fetch the results from HDFS results = from.dfs(out) results.df = as.data.frame(results, stringsAsFactors=F) colnames(results.df) = c('Carrier', 'Flights') # head(results.df) results.df

Java Version: Wordcount Program Outline package org.myorg; import java.io.IOException; ... public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { …. } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { public static void main(String[] args) throws Exception {

Java Version: Wordcount Program package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; This portion lists all of the classes that our Program might reference

Java Version: Airlinecount Program package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; This portion lists all of the classes that our Program might reference No change from wordcount

The new program will be AirlineCount Change from public class WordCount { …….. } To public class AirlineCount {

Java Version: Wordcount Program public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); }

Java Version: Airlinecount Program public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[ ] column = line.split(","); Text word=new Text(); word.set(column[8]); context.write(word, one); }

Java Version: Wordcount Program public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum));

Java Version: Airlinecount Program No change! public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum));

Java Version: Wordcount Program Driver public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "WordCount“ ); job.setJarByClass( WordCount.class ); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }

Java Version: Airlinecount Program Driver public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "AirlineCount"); job.setJarByClass( AirlineCount.class ); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }

Creating the AlirlineCount Jar File # 1) create a directory airlinecount and move to that directory mkdir airlinecount cd airlinecount # 2) get the source file AirlineCount.java # you could use scp to copy it # 3) create a subdirectory to store the class files mkdir airlinec

Creating the AirlineCount Jar File # 4) Compile AirlineCount.java to create the class files javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common- 2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop- mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0- 2557.jar:/usr/hdp/2.3.0.0-2557/hadoop/lib/commons-cli-1.2.jar -d ./airlinec/ *.java N.B. This is all one line  no carriage returns until the very end.

Creating the AirlineCount Jar File # 5) Create the JAR file from these classes jar -cvf AirlineCount.jar -C /home/student/airlinecount/airlinec .

Launch our new program! # 8) Launch the wordcount program from Hadoop # hadoop jar jarfile program input output hadoop jar /home/student/airlinecount/AirlineCount.jar org/myorg/AirlineCount /user/share/student/test_25K.csv /user/share/student/outputalc