Airlinecount CSCE 587 Spring 2016. Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.

Slides:



Advertisements
Similar presentations
Hadoop Programming. Overview MapReduce Types Input Formats Output Formats Serialization Job g/apache/hadoop/mapreduce/package-
Advertisements

The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
Parallel and Distributed Computing: MapReduce Alona Fyshe.
Chapter 2 Data Models Database Systems: Design, Implementation, and Management, Eleventh Edition, Coronel & Morris.
Google MapReduce Framework A Summary of: MapReduce & Hadoop API Slides prepared by Peter Erickson
O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.
Lecture 11 – Hadoop Technical Introduction. Terminology Google calls it:Hadoop equivalent: MapReduceHadoop GFSHDFS BigtableHBase ChubbyZookeeper.
CS246 TA Session: Hadoop Tutorial Peyman kazemian 1/11/2011.
Intro to Map-Reduce Feb 4, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
Poly Hadoop CSC 550 May 22, 2007 Scott Griffin Daniel Jackson Alexander Sideropoulos Anton Snisarenko.
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
RHadoop rev
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 3: Mapreduce and Hadoop All slides © IG.
Introduction to Google MapReduce WING Group Meeting 13 Oct 2006 Hendra Setiawan.
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
MapReduce Programming Yue-Shan Chang. split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign.
Java Programming Robert Chatley William Lee
Data Analytics 김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도. Contents Part l 1. Introduction - Data Analytics Cases - What is Data Analytics? - OLTP, OLAP - ROLAP - MOLAP.
MapReduce Costin Raiciu Advanced Topics in Distributed Systems, 2011.
HAMS Technologies 1
CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.
Hadoop Introduction Wang Xiaobo Outline Install hadoop HDFS MapReduce WordCount Analyzing Compile image data TeleNav Confidential.
Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.
Hadoop as a Service Boston Azure / Microsoft DevBoston 07-Feb-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license
1 Recitation 8. 2 Outline Goals of this recitation: 1.Learn about loading files 2.Learn about command line arguments 3.Review of Exceptions.
MapReduce Costin Raiciu Advanced Topics in Distributed Systems, 2012.
MapReduce design patterns Chapter 5: Join Patterns G 진다인.
Writing a MapReduce Program 1. Agenda  How to use the Hadoop API to write a MapReduce program in Java  How to use the Streaming API to write Mappers.
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
MapReduce Programming Model. HP Cluster Computing Challenges  Programmability: need to parallelize algorithms manually  Must look at problems from parallel.
O’Reilly – Hadoop: The Definitive Guide Ch.7 MapReduce Types and Formats 29 July 2010 Taikyoung Kim.
1 Creating Web Services Presented by Ashraf Memon Hands-on Ghulam Memon, Longjiang Ding.
5-Dec-15 Sequential Files and Streams. 2 File Handling. File Concept.
CS 4244: Internet Programming Network Programming in Java 1.0.
Before we start, please download: VirtualBox: – The Hortonworks Data Platform: –
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team Modified by R. Cook.
Working with Hadoop. Requirement Virtual machine software –VM Ware –VirtualBox Virtual machine images –Download from Cloudera (Founded by leaders in the.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Big Data Infrastructure Week 2: MapReduce Algorithm Design (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Tutorial: To run the MapReduce EEMD code with Hadoop on Futuregrid -by Rewati Ovalekar.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
Filtering, aggregating and histograms A FEW COMPLETE EXAMPLES WITH MR, SPARK LUCA MENICHETTI, VAG MOTESNITSALIS.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, January 31, 2013 Session 2: Hadoop Nuts and Bolts This work is licensed.
Parallel Data Processing with Hadoop/MapReduce
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
Distributed Systems Lecture 3 Big Data and MapReduce 1.
Hadoop&Hbase Developed Using JAVA USE NETBEANS IDE.
MAPREDUCE Massive Data Processing (I). Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse.
Sort in MapReduce. MapReduce Block 1 Block 2 Block 3 Block 4 Block 5 Map Reduce Output 1 Output 2 Shuffle/Sort.
Unit 2 Hadoop and big data
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Introduction to Google MapReduce
HW#4 Making Simple BBS Using JDBC
Lecture 17 (Hadoop: Getting Started)
Lecture 11 – Hadoop Technical Introduction
Airlinecount CSCE 587 Fall 2017.
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Wordcount CSCE 587 Spring 2018.
인공지능연구실 이남기 ( ) 유비쿼터스 응용시스템: 실습 가이드 인공지능연구실 이남기 ( )
Wordcount CSCE 587 Spring 2018.
Lecture 18 (Hadoop: Programming Examples)
Lecture 26 (Mahout Clustering)
Chapter X: Big Data.
Sadalage & Fowler (Amazon)
Big Data Technology: Introduction to Hadoop
Chapter 10: Big Data.
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Presentation transcript:

Airlinecount CSCE 587 Spring 2016

Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned to you If you haven’t already done so, change your password Ex: passwd You will be prompted for your current password Next you will be prompted for a new password

Preliminary steps in the VM Load data into linux filesystem of the VM. use SSH secure file transfer or use vi to create a text file Ex : ~]$ scp -P222 test_25K.csv scp – secure copy command -P222 – use port 222 Source file Destination file

Preliminary steps in the VM Transfer the file from the vm linux filesystem to the Hadoop filesystem hadoop fs -put test_25K.csv /user/share/student/ Convince yourself by checking the HDFS hadoop fs -ls /user/share/student/ Ex: ~]$ hadoop fs -ls /user/share/student/ Found 3 items -rw-r--r-- 1 student hdfs :43 /user/share/student/test_25K.csv drwxr-xr-x - student hdfs :41 /user/share/student/out drwxr-xr-x - student hdfs :47 /user/share/student/out2

Preliminary Rstudio steps # Set environmental variables Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/usr/hdp/ /hadoop-mapreduce/hadoop-streaming jar") # Load the following packages in the following order library(rhdfs) library(rmr2) # initialize the connection from rstudio to hadoop hdfs.init()

Our second mapreduce program # Doing simple mapreduce on airline data # Our map function which returns the keyval map1 = function(k,flights) { return ( keyval(as.character(flights[[9]]),1)) } # Our reduce function which sums up the flights for each airline reduce1 = function(carrier, counts) { keyval(carrier, sum(counts)) }

Our second mapreduce program # Our mapreduce function which invokes map1 and reduce1 and parses # the input file expected it to be comma delimited mr1 = function(input, output = NULL) { mapreduce(input = input, output = output, input.format = make.input.format("csv", sep=","), map = map1, reduce = reduce1)}

Submitting our first mapreduce job # Specify the path hdfs.root = '/user/share/student' # append the data filename to the pathname hdfs.data = file.path(hdfs.root,'testDataNoHdr.csv') # append the output filename to the pathname hdfs.out = file.path(hdfs.root,'out')

Submitting our first mapreduce job # invoke your map-reduce functions on your input file and output file out = mr1(hdfs.data, hdfs.out) # Pour yourself a cup of coffee and wait……. # Hadoop is fast enough, but the VM is deadly slow…..

Note: you can not overwrite existing files # if “out" already exists, then the mapreduce job will fail and you will # have to delete “out": # hadoop fs -rmr /user/student/out

VM: Check for changes to HDFS ~]$ hadoop fs -ls /user/share/student/ Found 4 items -rw-r--r-- 1 student hdfs :36 /user/share/student/g.txt drwxr-xr-x - student hdfs :41 /user/share/student/out drwxr-xr-x - student hdfs :47 /user/share/student/out2 drwxr-xr-x - student hdfs :56 /user/share/student/out

RStudio: Fetch the results from HDFS results = from.dfs(out) results.df = as.data.frame(results, stringsAsFactors=F) colnames(results.df) = c('Carrier', 'Flights') # head(results.df) results.df

Java Version: Wordcount Program Outline package org.myorg; import java.io.IOException;... public class WordCount { public static class Map extends Mapper { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { …. } public static class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { …. } public static void main(String[] args) throws Exception { …. }

Java Version: Wordcount Program package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; This portion lists all of the classes that our Program might reference

Java Version: Airlinecount Program package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; This portion lists all of the classes that our Program might reference No change from wordcount

The new program will be AirlineCount Change from public class WordCount { …….. } To public class AirlineCount { …….. }

Java Version: Wordcount Program public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); }

Java Version: Airlinecount Program public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] column = line.split(","); Text word=new Text(); word.set(column[8]); context.write(word, one); }

Java Version: Wordcount Program public static class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }

Java Version: Airlinecount Program No change! public static class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }

Java Version: Wordcount Program Driver public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "WordCount“ ); job.setJarByClass( WordCount.class ); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }

Java Version: Airlinecount Program Driver public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "AirlineCount"); job.setJarByClass( AirlineCount.class ); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }

Creating the AlirlineCount Jar File # 1) create a directory airlinecount and move to that directory mkdir airlinecount cd airlinecount # 2) get the source file AirlineCount.java # you could use scp to copy it # 3) create a subdirectory to store the class files mkdir airlinec

Creating the AirlineCount Jar File # 4) Compile AirlineCount.java to create the class files javac -classpath /usr/hdp/ /hadoop/hadoop-common jar:/usr/hdp/ /hadoop- mapreduce/hadoop-mapreduce-client-core jar:/usr/hdp/ /hadoop/lib/commons-cli-1.2.jar -d./airlinec/ *.java N.B. This is all one line  no carriage returns until the very end.

Creating the AirlineCount Jar File # 5) Create the JAR file from these classes jar -cvf AirlineCount.jar -C /home/student/airlinecount/airlinec.

Launch our new program! # 8) Launch the wordcount program from Hadoop # hadoop jar jarfile program input output hadoop jar /home/student/airlinecount/AirlineCount.jar org/myorg/AirlineCount /user/share/student/test_25K.csv /user/share/student/outputalc