Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.

Slides:



Advertisements
Similar presentations
Hadoop Programming. Overview MapReduce Types Input Formats Output Formats Serialization Job g/apache/hadoop/mapreduce/package-
Advertisements

Beyond Mapper and Reducer
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
MapReduce.
Beyond map/reduce functions partitioner, combiner and parameter configuration Gang Luo Sept. 9, 2010.
Parallel and Distributed Computing: MapReduce Alona Fyshe.
Google MapReduce Framework A Summary of: MapReduce & Hadoop API Slides prepared by Peter Erickson
O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Other Map-Reduce (ish) Frameworks William Cohen. Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop.
Hadoop: The Definitive Guide Chap. 2 MapReduce
CS246 TA Session: Hadoop Tutorial Peyman kazemian 1/11/2011.
Intro to Map-Reduce Feb 4, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 3: Mapreduce and Hadoop All slides © IG.
Introduction to Google MapReduce WING Group Meeting 13 Oct 2006 Hendra Setiawan.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Indranil Gupta (Indy) Sep 3, 2013 Lecture 3 Cloud Computing - 2  2013,
MapReduce Programming Yue-Shan Chang. split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) Sep 4, 2012 Lecture 3 Cloud Computing -
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VI: 2014/04/14.
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
HAMS Technologies 1
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.
大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 This work is licensed under a Creative Commons.
Hadoop Introduction Wang Xiaobo Outline Install hadoop HDFS MapReduce WordCount Analyzing Compile image data TeleNav Confidential.
Parallel Data Mining and Processing with Hadoop/MapReduce CS240A/290N, Tao Yang.
Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.
Hadoop as a Service Boston Azure / Microsoft DevBoston 07-Feb-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license
Writing a MapReduce Program 1. Agenda  How to use the Hadoop API to write a MapReduce program in Java  How to use the Streaming API to write Mappers.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Before we start, please download: VirtualBox: – The Hortonworks Data Platform: –
Lecture 6 Books: “Hadoop in Action” by Chuck Lam, “An Introduction to Parallel Programming” by Peter Pacheco.
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Big Data Infrastructure Week 2: MapReduce Algorithm Design (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Filtering, aggregating and histograms A FEW COMPLETE EXAMPLES WITH MR, SPARK LUCA MENICHETTI, VAG MOTESNITSALIS.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, January 31, 2013 Session 2: Hadoop Nuts and Bolts This work is licensed.
Airlinecount CSCE 587 Spring Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Parallel Data Processing with Hadoop/MapReduce
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
Distributed Systems Lecture 3 Big Data and MapReduce 1.
Hadoop&Hbase Developed Using JAVA USE NETBEANS IDE.
Sort in MapReduce. MapReduce Block 1 Block 2 Block 3 Block 4 Block 5 Map Reduce Output 1 Output 2 Shuffle/Sort.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Lecture 4: Mapreduce and Hadoop
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Introduction to Google MapReduce
Lecture 17 (Hadoop: Getting Started)
Central Florida Business Intelligence User Group
Lecture 11 – Hadoop Technical Introduction
Airlinecount CSCE 587 Fall 2017.
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Wordcount CSCE 587 Spring 2018.
Hadoop MapReduce Types
인공지능연구실 이남기 ( ) 유비쿼터스 응용시스템: 실습 가이드 인공지능연구실 이남기 ( )
Wordcount CSCE 587 Spring 2018.
Lecture 18 (Hadoop: Programming Examples)
VI-SEEM data analysis service
Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming
Chapter X: Big Data.
MapReduce Practice :WordCount
Bryon Gill Pittsburgh Supercomputing Center
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Presentation transcript:

Cloud Computing Mapreduce (2) Keke Chen

Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

A nice book  Hadoop: The definitive Guide You can read it online from campus network - ohiolink  ebook center  safari online

Hadoop streaming  Simple and powerful interface for programming Application developers do not need to learn hadoop java APIs Good for simple, adhoc tasks

Note:  Map/Reduce uses the local linux file system for processing and hosting temporary data  HDFS is used to host application data HDFS Node Local file system

Hadoop streamining  /current/streaming.html /current/streaming.html /usr/local/hadoop/bin/hadoop jar \ /usr/local/hadoop/hadoop-streaming jar \ -input myInputDirs -output myOutputDir \ -mapper myMapper -reducer myReducer Reducer can be empty: -reducer None myMapper and myReducer can be any executable Mapper/reducer will take stdin and output to stdout  Files in myInputDirs are fed into mapper as stdin  Mapper’s output will be the input of reducer

Packaging files with job submission  /usr/local/hadoop/bin/hadoop jar \ /usr/local/hadoop/hadoop-streaming jar \ -input “/user/hadoop/inputdata” \ -output “/user/hadoop/outputdata” \ -mapper “python myPythonScript.py myDictionary.txt” \ -reducer “/bin/wc” \ -file myPythonScript.py \ -file myDictionary.txt -file is good for small files Input parameter for the script

Using hadoop library classes hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -D mapred.reduce.tasks=12 \ -input myInputDirs \ -output myOutputDir \ -mapper org.apache.hadoop.mapred.lib.IdentityMapper \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

Large files and archives  Upload large files to HDFS first  Use –files option in streaming, which will download files to local working directory -files hdfs://host:fs_port/user/testfile.txt#testlink -archives hdfs://host:fs_port/user/testfile.jar#testlink  Cache1.txt, cache2.txt are in testfile.jar  Then, locally testlink/cache1.txt, textlink/cache2.txt

Wordcount  Problem: counting frequencies of words for a large document collection.  Implement mapper and reducer respectively, using python Some good python tutorials at

Mapper.py import sys for line in sys.stdin: line = line.strip() words = line.split() for word in words: print ‘%s\t1’ % (word)

Reducer.py import sys word2count={} for line in sys.stdin: line = line.strip() word, count = line.split(‘\t’, 1) try: count = int(count) word2count[word] = word2count.get(word, 0)+ count except ValueError: pass for word in word2count: print ‘%s\t%s’% (word, word2count[word])

Running wordcount hadoop jar $HADOOP_HOME/hadoop- streaming.jar \ -mapper "python mapper.py" \ -reducer "python reducer.py" \ -input text -output output2 \ -file /localpath/mapper.py -file /localpath/reducer.py

Running wordcount hadoop jar $HADOOP_HOME/hadoop- streaming.jar \ -mapper "python mapper.py" \ -reducer "python reducer.py" \ -input text -output output2 \ -file mapper.py -file reducer.py \ -jobconf mapred.reduce.tasks=2 \ -jobconf mapred.map.tasks=4

 If mapper/reducer takes files as parameters hadoop jar $HADOOP_HOME/hadoop- streaming.jar \ -mapper "python mapper.py" \ -reducer "python reducer.py myfile" \ -input text -output output2 \ -file /localpath/mapper.py -file /localpath/reducer.py -file /localpath/myfile

Hadoop Java APIs  hadoop.apache.org/common/docs/curre nt/api/  benefits Jave code is more efficient than streaming More parameters for control and tuning Better for iterative MR programs

Important base classes  Mapper Function map(Object, Writable, Context)  Reducer Function reduce(WritableComparable, Iterator, Context)  Combiner  Partitioner

The framework public class Wordcount{ public static class MapClass extends Mapper { public void setup(Mapper.Context context){…} public void map(Object key, Text value, Context context) throws IOException {…} } public static class ReduceClass Reducer { public void setup(Reducer.Context context){…} public void reduce(Text key, Iterator values, Context context) throws IOException{…} } public static void main(String[] args) throws Exception{} }

The wordcount example in java  /current/mapred_tutorial.html#Example %3A+WordCount+v1.0 /current/mapred_tutorial.html#Example %3A+WordCount+v1.0  Old/New framework Old framework for version prior to 0.20

Mapper of wordcount public static class WCMapper extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); }

WordCount Reducer public static class WCReducer extends Reducer { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); }

Function parameters  Define map/reduce parameters according to your application Have to use writable classes in org.apache.hadoop.io  E.g. Text, LongWritable, IntWritable etc. Template parameters and the function parameters should be matched Map’s output and reduce’s input parameters should be matched.

Configuring map/reduce  Passing global parameter settings to each map/reduce process In main function, set parameters in a Configuration object Configuration conf = new Configuration(); Job job = new Job(conf, "cloudvista"); job.setJarByClass(Wordcount.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); job.setMapperClass(WCMapper.class); //job.setCombinerClass(WCReducer.class); job.setReducerClass(WCReducer.class); //job.setPartitionerClass(WCPartitioner.class); job.setNumReduceTasks(num_reduce); FileInputFormat.setInputPaths (job, input); FileOutputFormat.setOutputPath (job, new Path(output_path )); System.exit(job.waitForCompletion(true)?0:1);

How to run your app 1. Compile to jar file 2. Command line hadoop jar your_jar your_parameters Normally you need to pass in  Number of reducers  Input files  Output directory  Any other application specific parameters

Access Files in HDFS? Example: In map function Public void setup(Mapper.Context context){ Configuration conf = context.getConfiguration(); string filename = conf.get(“yourfile"); Path p = new Path(filename); // Path is used for opening the file. FileSystem fs = FileSystem.get(conf);//determines local or HDFS FSInputStream file = fs.open(p); while (file.available() > 0){ … } file.close(); }

Combiner  Apply reduce function to the intermediate results locally after the map generates the result Map1 key1 Key n combineKey1, value1 Key2, value2 … Keyn, valueN reduces Map’s local

Partitioner  If map’s output will generate N keys (N>R, R:# of reduces) By default, N keys are randomly distributed to R reduces You can use partitioner to define how the keys are distributed to the reduces.

Mini project 1 1.Learn to use HDFS 2.Read and run wordcount example 2/mapred_tutorial.html 3.Write a MR program for inverted-index /user/hadoop/prj1.txt Implement two versions  Script/exe + streaming  Hadoop Java API The file has “docID \t docContent” per line Generating inverted index Word \t a list of “DocID:position”