Hadoop: The Definitive Guide Chap. 8 MapReduce Features

Slides:



Advertisements
Similar presentations
Distributed and Parallel Processing Technology Chapter2. MapReduce
Advertisements

Beyond Mapper and Reducer
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
Developing a MapReduce Application – packet dissection.
O’Reilly – Hadoop: The Definitive Guide Ch.6 How MapReduce Works 16 July 2010 Taewhi Lee.
O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.
Spark: Cluster Computing with Working Sets
Hadoop: The Definitive Guide Chap. 2 MapReduce
Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Big Data Analytics with R and Hadoop
HADOOP ADMIN: Session -2
Inter-process Communication in Hadoop
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
Distributed and Parallel Processing Technology Chapter6
THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Distributed and Parallel Processing Technology Chapter7. MAPREDUCE TYPES AND FORMATS NamSoo Kim 1.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VI: 2014/04/14.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
O’Reilly – Hadoop: The Definitive Guide Ch.7 MapReduce Types and Formats 29 July 2010 Taikyoung Kim.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Lecture 5 Books: “Hadoop in Action” by Chuck Lam,
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Map-Reduce framework.
Ch 8 and Ch 9: MapReduce Types, Formats and Features
MapReduce Types, Formats and Features
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Lecture 17 (Hadoop: Getting Started)
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
The Basics of Apache Hadoop
Cloud Distributed Computing Environment Hadoop
Hadoop Distributed Filesystem
On Spatial Joins in MapReduce
Data processing with Hadoop
MAPREDUCE TYPES, FORMATS AND FEATURES
MapReduce Algorithm Design
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

Hadoop: The Definitive Guide Chap. 8 MapReduce Features Kisung Kim

Contents Counters Sorting Joins Side Data Distribution

Counters Counters are a useful channel for gathering statistics about the job Useful for problem diagnosis Ex) # of invalid records Easier to use and to retrieve than logging Built-in counters Report various metrics for Map-Reduce jobs Map-Reduce Framework Map input records Map skipped records Combine input records Reduce output records File Systems Filesystem bytes read Filesystem bytes written Job Counters Launched map tasks Failed map tasks Data-local map tasks Rack-local map tasks Some of Built-in Counters

User-Defined Java Counter MapReduce allows user code to define a set of counters, which are then incremented as desired in the mapper or reducer Counters are defined by a Java enum The name of the enum is the group name The enum’s fields are the counter names

User-Defined Java Counter When the job has successfully completed, it prints out the counters at the end Readable names of counters Create a properties file named after the enum, using an underscore as a separator for nested classes MaxTemperatureWithCounters_Temperature.properties

Sorting By default, MapReduce will sort input records by their keys This job produces 30 output files, each of which is sorted However, there is no easy way to combine the files (partial sort)

Total Sort Produce a set of sorted files that, if concatenated, would form a globally sorted file Use a partitioner that respects the total order of the output Ex) Range partitioner Although this approach works, you have to choose your partition sizes carefully to ensure that they are fairly even so that job times aren’t dominated by a single reducer Example: bad partitioning To construct more even partitions, we need to have a better understanding of the distribution for the whole dataset

Sampling It’s possible to get a fairly even set of partitions, by sampling the key space The InputSampler class defines a nested Sampler interface whose implementations return a sample of keys given an InputFormat and JobConf Type of sampler Random sampler Split sampler Interval sampler

Example of Sampling RandomSampler Chooses keys with a uniform probability (here, 0.1) There are also parameters for the maximum number of samples to take, and the maximum number of splits to sample (here, 10000 and 10) Samplers run on the client, making it important to limit the number of splits that are downloaded, so the sampler runs quickly

Secondary Sort For any particular key, the values are not sorted Even not stable from one run to the next Example: calculating the maximum temperature for each year It is possible to impose an order on the values by sorting and grouping the keys in a particular way Make the key a composite of the natural key and the natural value. The key comparator should order by the composite key, that is, the natural key and natural value. The partitioner and grouping comparator for the composite key should consider only the natural key for partitioning and grouping

Map-Side Join A map-side join works by performing the join before the data reaches the map function Requirements Each input dataset must be divided into the same number of partitions It must be sorted by the same key (the join key) in each source All the records for a particular key must reside in the same partition Above requirements actually fit the description of the output of a MapReduce job A map-side join can be used to join the outputs of several jobs that had the same number of reducers, the same keys, and output files that are not splittable Use a CompositeInputFormat from the org.apache.hadoop.mapred.join package to run a map-side join MapReduce Job for Sorting Dataset 1 Map Reduce Map for Merge Map Dataset 2 Map Reduce

Reduce-Side Join More general than a map-side join Idea Input datasets don’t have to be structured in any particular way Less efficient as both datasets have to go through the MapReduce shuffle Idea The mapper tags each record with its source Uses the join key as the map output key so that the records with the same key are brought together in the reducer Multiple inputs The input sources for the datasets have different formats Use the MultipleInputs class to separate the logic for parsing and tagging each source. Secondary sort To perform the join, it is important to have the data from one source before another

Example: Reduce-Side Join The code assumes that every station ID in the weather records has exactly one matching record in the station dataset

Side Data Distribution Extra read-only data needed by a job to process the main dataset The challenge is to make side data available to all the map or reduce tasks (which are spread across the cluster) Cache in memory in a static field Using the Job Configuration Distributed Cache

Using the Job Configuration Set arbitrary key-value pairs in the job configuration using the various setter methods on JobConf Useful if one needs to pass a small piece of metadata to tasks Don’t use this mechanism for transferring more than a few kilobytes of data The job configuration is read by the jobtracker, the tasktracker, and the child JVM, and each time the configuration is read, all of its entries are read into memory, even if they are not used

Distributed Cache Distribute datasets using Hadoop’s distributed cache mechanism Provides a service for copying files and archives to the task nodes in time for the tasks to use them when they run GenericOptionsParser Specify the files to be distributed as a comma-separated list of URIs as the argument to the -files option This command will copy the local file stations-fixed-width.txt to the task nodes

Distributed Cache Hadoop copies the file specified by the –file and -archives options to the jobtracker’s filesystem (normally HDFS) Before a task is run, the tasktracker copies the files from the jobtracker’s filesystem to a local disk The tasktracker also maintains a reference count for the number of tasks using each file in the cache After the task has run, the file’s reference count is decreased by one, and when it reaches zero it is eligible for deletion Files are deleted to make room for a new file when the cache exceeds a certain size—10 GB by default