An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
Beyond map/reduce functions partitioner, combiner and parameter configuration Gang Luo Sept. 9, 2010.
Parallel and Distributed Computing: MapReduce Alona Fyshe.
Chapter 2 Data Models Database Systems: Design, Implementation, and Management, Eleventh Edition, Coronel & Morris.
Google MapReduce Framework A Summary of: MapReduce & Hadoop API Slides prepared by Peter Erickson
Lecture 11 – Hadoop Technical Introduction. Terminology Google calls it:Hadoop equivalent: MapReduceHadoop GFSHDFS BigtableHBase ChubbyZookeeper.
CS246 TA Session: Hadoop Tutorial Peyman kazemian 1/11/2011.
Intro to Map-Reduce Feb 4, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 3: Mapreduce and Hadoop All slides © IG.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Introduction to Google MapReduce WING Group Meeting 13 Oct 2006 Hendra Setiawan.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce CSE 454 Slides based on those by Jeff Dean, Sanjay Ghemawat, and Dan Weld.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Indranil Gupta (Indy) Sep 3, 2013 Lecture 3 Cloud Computing - 2  2013,
MapReduce Programming Yue-Shan Chang. split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) Sep 4, 2012 Lecture 3 Cloud Computing -
1 Map-reduce programming paradigm Some slides are from lecture of Matei Zaharia, and distributed computing seminar by Christophe Bisciglia, Aaron Kimball,
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Introduction to MapReduce and Hadoop
MapReduce Costin Raiciu Advanced Topics in Distributed Systems, 2011.
Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
HAMS Technologies 1
Google’s MapReduce Connor Poske Florida State University.
Hadoop Introduction Wang Xiaobo Outline Install hadoop HDFS MapReduce WordCount Analyzing Compile image data TeleNav Confidential.
Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MapReduce Costin Raiciu Advanced Topics in Distributed Systems, 2012.
Writing a MapReduce Program 1. Agenda  How to use the Hadoop API to write a MapReduce program in Java  How to use the Streaming API to write Mappers.
MapReduce Programming Model. HP Cluster Computing Challenges  Programmability: need to parallelize algorithms manually  Must look at problems from parallel.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Before we start, please download: VirtualBox: – The Hortonworks Data Platform: –
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team Modified by R. Cook.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Map Reduce. Functional Programming Review r Functional operations do not modify data structures: They always create new ones r Original data still exists.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
Airlinecount CSCE 587 Spring Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
Distributed Systems Lecture 3 Big Data and MapReduce 1.
Hadoop&Hbase Developed Using JAVA USE NETBEANS IDE.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Sort in MapReduce. MapReduce Block 1 Block 2 Block 3 Block 4 Block 5 Map Reduce Output 1 Output 2 Shuffle/Sort.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Lecture 4: Mapreduce and Hadoop
Introduction to Google MapReduce
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Map-reduce programming paradigm
Central Florida Business Intelligence User Group
Lecture 11 – Hadoop Technical Introduction
Ministry of Higher Education
Airlinecount CSCE 587 Fall 2017.
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Lecture 18 (Hadoop: Programming Examples)
Charles Tappert Seidenberg School of CSIS, Pace University
Chapter X: Big Data.
Big Data Technology: Introduction to Hadoop
MIT 802 Introduction to Data Platforms and Sources Lecture 2
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May Lauren Olver Dylan Streb Ryan Svoboda

What We’ll Be Covering… Background information/overview Map abstraction – Pseudocode example Reduce abstraction – Yet another pseudocode example Combining the map and reduce abstractions Why MapReduce is “better” Examples and applications of MapReduce

Before MapReduce… Large scale data processing was difficult! – Managing hundreds or thousands of processors – Managing parallelization and distribution – I/O Scheduling – Status and monitoring – Fault/crash tolerance MapReduce provides all of these, easily! Source:

MapReduce Overview What is it? – Programming model used by Google – A combination of the Map and Reduce models with an associated implementation – Used for processing and generating large data sets

MapReduce Overview How does it solve our previously mentioned problems? – MapReduce is highly scalable and can be used across many computers. – Many small machines can be used to process jobs that normally could not be processed by a large machine.

Map Abstraction Inputs a key/value pair – Key is a reference to the input value – Value is the data set on which to operate Evaluation – Function defined by user – Applies to every value in value input Might need to parse input Produces a new list of key/value pairs – Can be different type from input pair

Map Example

Reduce Abstraction Starts with intermediate Key / Value pairs Ends with finalized Key / Value pairs Starting pairs are sorted by key Iterator supplies the values for a given key to the Reduce function.

Reduce Abstraction Typically a function that: – Starts with a large number of key/value pairs One key/value for each word in all files being greped (including multiple entries for the same word) – Ends with very few key/value pairs One key/value for each unique word across all the files with the number of instances summed into this entry Broken up so a given worker works with input of the same key.

Reduce Example

How Map and Reduce Work Together Map returns information Reduces accepts information Reduce applies a user defined function to reduce the amount of data

How Map and Reduce Work Together Map returns information Reduces accepts information Reduce applies a user defined function to reduce the amount of data

Other Applications Yahoo! – Webmap application uses Hadoop to create a database of information on all known webpages Facebook – Hive data center uses Hadoop to provide business statistics to application developers and advertisers Rackspace – Analyzes sever log files and usage data using Hadoop

Why is this approach better? Creates an abstraction for dealing with complex overhead – The computations are simple, the overhead is messy Removing the overhead makes programs much smaller and thus easier to use – Less testing is required as well. The MapReduce libraries can be assumed to work properly, so only user code needs to be tested Division of labor also handled by the MapReduce libraries, so programmers only need to focus on the actual computation

MapReduce Example 1/4 package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { public static class Map extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text();

MapReduce Example 2/4 public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); }

MapReduce Example 3/4 public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }

MapReduce Example 4/4 public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); conf.setInputPath(new Path(args[0])); conf.setOutputPath(new Path(args[1])); JobClient.runJob(conf); }

Summary Map reads in text and creates a {, 1} pair for every word read Reduce then takes all of those pairs and counts them up to produce the final count. – If there were 20 {word, 1} pairs, the final output of Reduce would be a single {word, 20} pair