MAPREDUCE Massive Data Processing (I). Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse.

Slides:

Advertisements

Similar presentations

Hadoop Programming. Overview MapReduce Types Input Formats Output Formats Serialization Job g/apache/hadoop/mapreduce/package-

Advertisements

Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

Parallel and Distributed Computing: MapReduce Alona Fyshe.

Chapter 2 Data Models Database Systems: Design, Implementation, and Management, Eleventh Edition, Coronel & Morris.

DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,

CS246 TA Session: Hadoop Tutorial Peyman kazemian 1/11/2011.

Intro to Map-Reduce Feb 4, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.

Lecture 3 – Hadoop Technical Introduction CSE 490H.

An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.

CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 3: Mapreduce and Hadoop All slides © IG.

Introduction to Google MapReduce WING Group Meeting 13 Oct 2006 Hendra Setiawan.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

大规模数据处理 / 云计算 Lecture 3 – Hadoop Environment 彭波北京大学信息科学技术学院 4/23/2011 This work is licensed under a Creative Commons.

MapReduce Programming Yue-Shan Chang. split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign.

Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) Sep 4, 2012 Lecture 3 Cloud Computing -

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月， CPU 的主頻就會增加一倍  2005 開始失效多核及平行運算時代來臨.

HAMS Technologies 1

大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波北京大学信息科学技术学院 7/23/2013 This work is licensed under a Creative Commons.

Hadoop Introduction Wang Xiaobo Outline Install hadoop HDFS MapReduce WordCount Analyzing Compile image data TeleNav Confidential.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Writing a MapReduce Program 1. Agenda  How to use the Hadoop API to write a MapReduce program in Java  How to use the Streaming API to write Mappers.

MapReduce Algorithm Design Based on Jimmy Lin’s slides

Before we start, please download: VirtualBox: – The Hortonworks Data Platform: –

Lecture 5 Books: “Hadoop in Action” by Chuck Lam,

Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University

Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team Modified by R. Cook.

C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.

Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.

Using Sequence Files. Mahout Installation – wget distribution-0.9.tar.gz

Map Reduce. Functional Programming Review r Functional operations do not modify data structures: They always create new ones r Original data still exists.

Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.

自由軟體實驗室 Map Reduce Programming 王耀聰陳威宇楊順發國家高速網路與計算中心 (NCHC)

Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, January 31, 2013 Session 2: Hadoop Nuts and Bolts This work is licensed.

Airlinecount CSCE 587 Spring Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.

Parallel Data Processing with Hadoop/MapReduce

HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.

Distributed Systems Lecture 3 Big Data and MapReduce 1.

Hadoop&Hbase Developed Using JAVA USE NETBEANS IDE.

Lecture 4: Mapreduce and Hadoop

COMP9313: Big Data Management Lecturer: Xin Cao Course web site:

COMP9313: Big Data Management Lecturer: Xin Cao Course web site:

Introduction to Google MapReduce

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.

Map-reduce programming paradigm

Lecture 17 (Hadoop: Getting Started)

Central Florida Business Intelligence User Group

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Airlinecount CSCE 587 Fall 2017.

MIT 802 Introduction to Data Platforms and Sources Lecture 2

Hadoop MapReduce Types

COS 418: Distributed Systems Lecture 1 Mike Freedman

湖南大学-信息科学与工程学院-计算机与科学系

인공지능연구실 이남기 ( ) 유비쿼터스 응용시스템: 실습 가이드 인공지능연구실 이남기 ( )

February 26th – Map/Reduce

Cse 344 May 4th – Map/Reduce.

MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Distributed System Gang Wu Spring，2018.

Lecture 18 (Hadoop: Programming Examples)

CS 345A Data Mining MapReduce This presentation has been altered.

Lecture 3 – Hadoop Technical Introduction

Chapter X: Big Data.

5/7/2019 Map Reduce Map reduce.

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

MIT 802 Introduction to Data Platforms and Sources Lecture 2

COS 518: Distributed Systems Lecture 11 Mike Freedman

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

MAPREDUCE Massive Data Processing (I)

Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse

OVERVIEW PROGRAMMING MODEL IMPLEMENTATION MapReduce Introduction

What is MapReduce Programming model for expressing distributed computations at a massive scale A patented software framework introduced by Google  Processes 20 petabytes of data per day Popularized by open-source Hadoop project  Used at Yahoo!, Facebook, Amazon, …

Why MapReduce Scale “out”, not “up”  Limits of Symmetrical Multi-Processing (SMP) and large shared-memory machines Move computing to data  Cluster have limited bandwidth Hide system-level details from the developers  No more race conditions, lock contention, etc Separating the what from how  Developer specifies the computation that needs to be performed  Execution framework (“runtime”) handles actual execution

Locality Don’t move data to workers… move workers to the data!  Store data on the local disks of nodes in the cluster  Start up the workers on the node that has the data Why?  Not enough RAM to hold all the data in memory  Disk access is slow, but disk throughput is reasonable A distributed file system is the answer  GFS (Google File System) for Google’s MapReduce  HDFS (Hadoop Distributed File System) for Hadoop

OVERVIEW PROGRAMMING MODEL IMPLEMENTATION MapReduce Introduction

Typical Large-Data Problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Key idea: provide a functional abstraction for these two operations Map Reduce

How to Abstract The framework is inspired by map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as their original forms  Map(...) : N → N  Ex. [ 1,2,3,4 ] – (*2) -> [ 2,4,6,8 ]  Reduce(...): N → 1  [ 1,2,3,4 ] – (sum) -> 10 Programmers specify two functions:  Map(k1,v1) -> list(k2,v2)  Reduce(k2, list (v2)) -> list(v3)  All values with the same key are sent to the same reducer

How to Abstract(Cont.) The execution framework (Runtime) handles  Scheduling  Assigns workers to map and reduce tasks  Data distribution  Moves processes to data  Synchronization  Gathers, sorts, and shuffles intermediate data  Errors and faults  Detects worker failures and restarts  Everything happens on top of a Distributed File System (DFS)

OVERVIEW PROGRAMMING MODEL IMPLEMENTATION MapReduce Introduction

Execution Overview

MapReduce: High Level

Nodes, Trackers, Tasks JobTracker  Run on Master node  Accepts Job requests from clients TaskTracker  Run on slave nodes  Forks separate Java process for task instances

Hadoop MapReduce w/ HDFS Map Red

Example - Wordcount Hello Cloud TA cool Hello TA cool Input Hello [1 1] TA [1 1] Cloud [1] cool [1 1] Reducer Hello 2 TA 2 Hello 2 TA 2 Cloud 1 cool 2 Cloud 1 cool 2 Hello 1 TA 1 Cloud 1 cool 1 Sort/Copy Merge Output

Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse

public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount "); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(wordcount.class); job.setMapperClass(mymapper.class); job.setCombinerClass(myreducer.class); job.setReducerClass(myreducer.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1); } Main function

Mapper import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class mymapper extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String line = ( (Text) value ).toString(); StringTokenizer itr = new StringTokenizer( line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); }

Mapper(cont.) … Hi Cloud TA say Hi … /user/hadoop/input/hi Input Key Input Value ( (Text) value ).toString(); Hi Cloud TA say Hi StringTokenizer itr = new StringTokenizer( line); Hi Cloud TA say Hi itr while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); }

Reducer import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class myreducer extends Reducer { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); }

Reducer (cont.) Hi

Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse

MapReduce Terminology Job  A “full program” - an execution of a Mapper and Reducer across a data set Task  An execution of a Mapper or a Reducer on a slice of data Task Attempt  A particular instance of an attempt to execute a task on a machine

Main Class Class MR{ main(){ Configuration conf = new Configuration(); Job job = new Job(conf, “job name"); job.setJarByClass(thisMainClass.class); job.setMapperClass(Mapper.class); job.setReduceClass(Reducer.class); FileInputFormat.addInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }

Job Identify classes implementing Mapper and Reducer interfaces  Job.setMapperClass(), setReducerClass() Specify inputs, outputs  FileInputFormat.addInputPath()  FileOutputFormat.setOutputPath() Optionally, other options too:  Job.setNumReduceTasks(),  Job.setOutputFormat()…

Class Mapper  Maps input key/value pairs to a set of intermediate key/value pairs. Ex: Class MyMapper extend Mapper { //global variable public void map(Object key, Text value, Context context) throws IOException,InterruptedException { //local vaiable …. context.write(key’, value’); } Input Class(key, value) Onput Class(key, value)

Text, IntWritable, LongWritable, Hadoop defines its own “box” classes  Strings : Text  Integers : IntWritable  Long : LongWritable Any (WritableComparable, Writable) can be sent to the reducer  All keys are instances of WritableComparable  All values are instances of Writable

Read Data

Mappers Upper-case Mapper  Ex: let map(k, v) = emit(k.toUpper(), v.toUpper())  (“foo”, “bar”) → (“FOO”, “BAR”)  (“Foo”, “other”) → (“FOO”, “OTHER”)  (“key2”, “data”) → (“KEY2”, “DATA”) Explode Mapper  let map(k, v) = for each char c in v: emit(k, c)  (“A”, “cats”) → (“A”, “c”), (“A”, “a”), (“A”, “t”), (“A”, “s”)  (“B”, “hi”) → (“B”, “h”), (“B”, “i”) Filter Mapper  let map(k, v) = if (isPrime(v)) then emit(k, v)  (“foo”, 7) → (“foo”, 7)  (“test”, 10) → (nothing)

Class Reducer  Reduces a set of intermediate values which share a key to a smaller set of values. Ex: Class MyReducer extend Reducer { //global variable public void reduce(Object key, Iterable value, Context context) throws IOException,InterruptedException { //local vaiable …. context.write(key’, value’); } Input Class(key, value) Onput Class(key, value)

Reducers Sum Reducer  (“A”, [42, 100, 312]) → (“A”, 454) Identity Reducer  (“A”, [42, 100, 312]) → (“A”, 42),(“A”, 100), (“A”, 312) let reduce(k, vals) = sum = 0 foreach int v in vals: sum += v emit(k, sum) let reduce(k, vals) = foreach v in vals: emit(k, v)

Performance Consideration Ideal scaling characteristics:  Twice the data, twice the running time  Twice the resources, half the running time Why can’t we achieve this?  Synchronization requires communication  Communication kills performance Thus… avoid communication!  Reduce intermediate data via local aggregation  Combiners can help

Partitioner and Combiner The same keys to the same reducer via network  Partitioner function  A default partitioning function is provided that uses hashing  In some cases, it is useful to partition data by some other function of the key Avoid communication via local aggregation  Combiner function  Synchronization requires communication, and communication kills performance  Partial combining significantly speeds up certain classes of MapReduce operations

Outline MapReduce Introduction Sample Code Program Prototype Programming using Eclipse

MR package, Mapper Class

Reducer Class

MR Driver(Main class)

Run on hadoop

Run on hadoop(cont.)

MapReduce 實例範例名稱： Wordcount 計算一個檔案中，每個字 (Word) 出現的次數請參閱 orial.html#Example%3A+WordCount+v1.0 orial.html#Example%3A+WordCount+v1.0 這是學習 MapReduce 最基本的實例

實例： MapReduce (Mapper) 14. public static class Map extends MapReduceBase implements Mapper { 15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { 19. String line = value.toString(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasMoreTokens()) { 22. word.set(tokenizer.nextToken()); 23. output.collect(word, one); 24. } 25. } 26. } 詳細程式碼請參閱 MapReduce Tutorial +v1.0

實例： MapReduce (Reducer) 28. public static class Reduce extends MapReduceBase implements Reducer { 29. public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { 30. int sum = 0; 31. while (values.hasNext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. } 詳細程式碼請參閱 MapReduce Tutorial +v1.0

WordCount-Practice 1. 移動到根目錄 cd ~

2. 新增以自己名字 _input 為檔名的文字檔，裡面輸入 I like ITRI. echo “I like ITRI.” > name_input

3. 在 HDFS 裡建立自己名字的資料夾 sudo hadoop fs –mkdir /user/name

4. 看有無創建成功 sudo hadoop fs –ls /user

5. 改變資料夾的擁有者 sudo hadoop fs –chown user1:user1 /user/name

WordCount-Practice 6. 看擁有者有無更改成功 hadoop fs –ls /user

WordCount-Practice 7. 把檔案放進 HDFS hadoop fs –put name_input /user/name

WordCount-Practice 8. 查看有無放置成功 hadoop fs –ls /user/name

WordCount-Practice 9. 移動到資料夾準備執行工作

WordCount-Practice 10. 執行 wordcount sudo hadoop jar hadoop dev-examples.jar wordcount /user/name/name_input /user/name/name_output

WordCount-Practice Job Done

WordCount-Practice 11. 看有無輸出成功 hadoop fs –ls /user/name/name_output

WordCount-Practice 12. 看輸出的檔案 hadoop fs –cat /user/name/name_output/part-r-*