Beyond map/reduce functions partitioner, combiner and parameter configuration Gang Luo Sept. 9, 2010.

Slides:



Advertisements
Similar presentations
Hadoop Programming. Overview MapReduce Types Input Formats Output Formats Serialization Job g/apache/hadoop/mapreduce/package-
Advertisements

Distributed and Parallel Processing Technology Chapter2. MapReduce
Beyond Mapper and Reducer
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Simplified Data Processing on Large Clusters
Running Hadoop. Hadoop Platforms Platforms: Unix and on Windows. – Linux: the only supported production platform. – Other variants of Unix, like Mac OS.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
J OIN ALGORITHMS USING MAPREDUCE Haiping Wang
Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark Gavin Li, Jaebong Kim, Andy Feng Yahoo.
Google MapReduce Framework A Summary of: MapReduce & Hadoop API Slides prepared by Peter Erickson
O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.
Spark: Cluster Computing with Working Sets
Map reduce with Hadoop streaming and/or Hadoop. Hadoop Job Hadoop Mapper Hadoop Reducer Partitioner Hadoop FileSystem Combiner Shuffle Sort Shuffle Sort.
Hadoop: The Definitive Guide Chap. 2 MapReduce
CS246 TA Session: Hadoop Tutorial Peyman kazemian 1/11/2011.
Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.
Lecture 3 – Hadoop Technical Introduction CSE 490H.
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 3: Mapreduce and Hadoop All slides © IG.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Introduction to Google MapReduce WING Group Meeting 13 Oct 2006 Hendra Setiawan.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Big Data Analytics with R and Hadoop
大规模数据处理 / 云计算 Lecture 3 – Hadoop Environment 彭波 北京大学信息科学技术学院 4/23/2011 This work is licensed under a Creative Commons.
HADOOP ADMIN: Session -2
Inter-process Communication in Hadoop
MapReduce Programming Yue-Shan Chang. split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Distributed and Parallel Processing Technology Chapter7. MAPREDUCE TYPES AND FORMATS NamSoo Kim 1.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VI: 2014/04/14.
HAMS Technologies 1
IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.
大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 This work is licensed under a Creative Commons.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.
MapReduce design patterns Chapter 5: Join Patterns G 진다인.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
O’Reilly – Hadoop: The Definitive Guide Ch.7 MapReduce Types and Formats 29 July 2010 Taikyoung Kim.
Lecture 5 Books: “Hadoop in Action” by Chuck Lam,
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Introduction to Google MapReduce
Map-Reduce framework.
Ch 8 and Ch 9: MapReduce Types, Formats and Features
Hadoop MapReduce Framework
MapReduce Types, Formats and Features
Lecture 17 (Hadoop: Getting Started)
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Hadoop MapReduce Types
Lecture 18 (Hadoop: Programming Examples)
Data processing with Hadoop
Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming
Lecture 3 – Hadoop Technical Introduction
MAPREDUCE TYPES, FORMATS AND FEATURES
5/7/2019 Map Reduce Map reduce.
Map Reduce, Types, Formats and Features
Presentation transcript:

Beyond map/reduce functions partitioner, combiner and parameter configuration Gang Luo Sept. 9, 2010

Partitioner Determine which reducer/partition one record should go to Given the key, value and number of partitions, return an integer –Partition: (K2, V2, #Partitions)  integer

Partitioner public interface Partitioner extends JobConfigurable { int getPartition(K2 key, V2 value, int numPartitions); } public class myPartitioner implements Partitioner { int getPartition(K2 key, V2 value, int numPartitions) { your logic! } Interface: Implementation:

Partitioner public class myPartitioner implements Partitioner { int getPartition(Text key, Text value, int numPartitions) { int hashCode = key.hashCode(); int partitionIndex = hashCode mod numPartitions; return partitionIndex; } Example:

Combiner Reduce the amount of intermediate data before sending them to reducers. Pre-aggregation The interface is exactly the same as reducer.

Combiner public class myCombiner extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator value OutputCollector output, Reporter reporter) { your logic } Example: Should be the same type as map output key/value Should be the same type as reducer input key/value

Parameter Cluster-level parameters (e.g. HDFS block size) Job-specific parameters (e.g. number of reducers, map output buffer size) –Configurable. Important for job performance User-define parameters –Used to pass information from driver to mapper/reducer. –Help to make your mapper/reducer more generic

Parameter JobConf conf = new JobConf(Driver.class); conf.setNumReduceTasks(10); // set the number of reducers by a // build-in function conf.set(“io.sort.mb”, “200”); // set the size of map output buffer by the // name of that parameter conf.setString(“deliminator”, “\t”); //set a user-defined parameter. conf.getNumReduceTasks(10); // get the value of a parameter by // build-in function String buffSize = conf.get(“io.sort.mb”, “200”); //get the value of a parameter // by its name String deliminator = conf.getString(“deliminator”, “\t”); // get the value of a // user-defined parameter

Parameter There are some built-in parameters managed by Hadoop. We are not supposed to change them, but can read them –String inputFile = jobConf.get("map.input.file"); –Get the path to the current input –Used in joining datasets *for new api, you should use: FileSplit split = (FileSplit)context.getInputSplit(); String inputFile = split.getPath().toString();

More about Hadoop Identity Mapper/Reducer –Output == input. No modification Why do we need map/reduce function without any logic in them? –Sorting! –More generally, when you only want to use the basic functionality provided by Hadoop (e.g. sorting/grouping)

More about Hadoop How to determine the number of splits? –If a file is large enough and splitable, it will be splited into multiple pieces (split size = block size) –If a file is non-splitable, only one split. –If a file is small (smaller than a block), one split for file, unless...

More about Hadoop CombineFileInputFormat –Merge multiple small files into one split, which will be processed by one mapper –Save mapper slots. Reduce the overhead Other options to handle small files? –hadoop fs -getmerge src dest