Introduction to Google MapReduce WING Group Meeting 13 Oct 2006 Hendra Setiawan.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
Problem-solving on large-scale clusters: theory and applications Lecture 3: Bringing it all together.
Chapter 2 Data Models Database Systems: Design, Implementation, and Management, Eleventh Edition, Coronel & Morris.
Google MapReduce Framework A Summary of: MapReduce & Hadoop API Slides prepared by Peter Erickson
CS246 TA Session: Hadoop Tutorial Peyman kazemian 1/11/2011.
Intro to Map-Reduce Feb 4, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
CS 525 Advanced Distributed Systems Spring 2015 Indranil Gupta (Indy) Lecture 3 Cloud Computing (Contd.) January 27, 2015 All slides © IG 1.
Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.
Poly Hadoop CSC 550 May 22, 2007 Scott Griffin Daniel Jackson Alexander Sideropoulos Anton Snisarenko.
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 3: Mapreduce and Hadoop All slides © IG.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce CSE 454 Slides based on those by Jeff Dean, Sanjay Ghemawat, and Dan Weld.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Indranil Gupta (Indy) Sep 3, 2013 Lecture 3 Cloud Computing - 2  2013,
MapReduce Programming Yue-Shan Chang. split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) Sep 4, 2012 Lecture 3 Cloud Computing -
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
1 Map-reduce programming paradigm Some slides are from lecture of Matei Zaharia, and distributed computing seminar by Christophe Bisciglia, Aaron Kimball,
MapReduce.
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge.
Take a Close Look at MapReduce Xuanhua Shi. Acknowledgement  Most of the slides are from Dr. Bing Chen,
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
Introduction to MapReduce and Hadoop
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases KSU CIS Department.
MapReduce Costin Raiciu Advanced Topics in Distributed Systems, 2011.
Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
HAMS Technologies 1
MapReduce How to painlessly process terabytes of data.
Hadoop Introduction Wang Xiaobo Outline Install hadoop HDFS MapReduce WordCount Analyzing Compile image data TeleNav Confidential.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MapReduce Costin Raiciu Advanced Topics in Distributed Systems, 2012.
Writing a MapReduce Program 1. Agenda  How to use the Hadoop API to write a MapReduce program in Java  How to use the Streaming API to write Mappers.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
MapReduce Programming Model. HP Cluster Computing Challenges  Programmability: need to parallelize algorithms manually  Must look at problems from parallel.
– 1 – MapReduce Programming Oct 25, 2011 Topics Large-scale computing Traditional high-performance computing (HPC) Cluster computing MapReduce Definition.
Before we start, please download: VirtualBox: – The Hortonworks Data Platform: –
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team Modified by R. Cook.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Map Reduce. Functional Programming Review r Functional operations do not modify data structures: They always create new ones r Original data still exists.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
PARLab Parallel Boot Camp Cloud Computing with MapReduce and Hadoop Matei Zaharia Electrical Engineering and Computer Sciences University of California,
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, January 31, 2013 Session 2: Hadoop Nuts and Bolts This work is licensed.
Airlinecount CSCE 587 Spring Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Lecture 4: Mapreduce and Hadoop
Introduction to Google MapReduce
Map-reduce programming paradigm
Lecture 4: Mapreduce and Hadoop
Central Florida Business Intelligence User Group
Lecture 3: Bringing it all together
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Lecture 18 (Hadoop: Programming Examples)
Lecture 4: Mapreduce and Hadoop
Big Data Technology: Introduction to Hadoop
MIT 802 Introduction to Data Platforms and Sources Lecture 2
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Introduction to Google MapReduce WING Group Meeting 13 Oct 2006 Hendra Setiawan

What is MapReduce?  A programming model (& its associated implementation)  For processing large data set  Exploits large set of commodity computers  Executes process in distributed manner  Offers high degree of transparencies  In other words:  simple and maybe suitable for your tasks !!!

Distributed Grep Very big data Split data grep matches cat All matches

Distributed Word Count Very big data Split data count merge merged count

Map Reduce  Map:  Accepts input key/value pair  Emits intermediate key/value pair  Reduce :  Accepts intermediate key/value* pair  Emits output key/value pair Very big data Result MAPMAP REDUCEREDUCE Partitioning Function

Partitioning Function

Partitioning Function (2)  Default : hash(key) mod R  Guarantee:  Relatively well-balanced partitions  Ordering guarantee within partition  Distributed Sort  Map: emit(key,value)  Reduce (with R=1): emit(key,value)

MapReduce  Distributed Grep  Map: if match(value,pattern) emit(value,1)  Reduce: emit(key,sum(value*))  Distributed Word Count  Map: for all w in value do emit(w,1)  Reduce: emit(key,sum(value*))

MapReduce Transparencies Plus Google Distributed File System :  Parallelization  Fault-tolerance  Locality optimization  Load balancing

Suitable for your task if  Have a cluster  Working with large dataset  Working with independent data (or assumed)  Can be cast into map and reduce

MapReduce outside Google  Hadoop (Java)  Emulates MapReduce and GFS  The architecture of Hadoop MapReduce and DFS is master/slave MasterSlave MapReducejobtrackertasktracker DFSnamenodedatanode

Example Word Count (1)  Map public static class MapClass extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); }

Example Word Count (2)  Reduce public static class Reduce extends MapReduceBase implements Reducer { public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum)); }

Example Word Count (3)  Main public static void main(String[] args) throws IOException { //checking goes here JobConf conf = new JobConf(); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputPath(new Path(args[0])); conf.setOutputPath(new Path(args[1])); JobClient.runJob(conf); }

One time setup  set hadoop-site.xml and slaves  Initiate namenode  Run Hadoop MapReduce and DFS  Upload your data to DFS  Run your process…  Download your data from DFS

Summary  A simple programming model for processing large dataset on large set of computer cluster  Fun to use, focus on problem, and let the library deal with the messy detail

References  Original paper (  On wikipedia (  Hadoop – MapReduce in Java (  Starfish - MapReduce in Ruby (