O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.

Slides:



Advertisements
Similar presentations
Distributed and Parallel Processing Technology Chapter2. MapReduce
Advertisements

Beyond Mapper and Reducer
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce.
Beyond map/reduce functions partitioner, combiner and parameter configuration Gang Luo Sept. 9, 2010.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
Developing a MapReduce Application – packet dissection.
Pig Contributors Workshop Agenda Introductions What we are working on Usability Howl TLP Lunch Turing Completeness Workflow Fun (Bocci ball)
Operating-System Structures
Distributed and Parallel Processing Technology Chapter5. Developing a MapReduce Application Jiseop Won 1.
Hadoop: The Definitive Guide Chap. 2 MapReduce
Fall 2007CS 2251 Programming Tools Eclipse JUnit Testing make and ant.
1 Software Testing and Quality Assurance Lecture 30 – Testing Systems.
Hadoop Setup. Prerequisite: System: Mac OS / Linux / Cygwin on Windows Notice: 1. only works in Ubuntu will be supported by TA. You may try other environments.
** MapReduce Debugging with Jumbune. * Agenda * Debugging Challenges Debugging MapReduce Jumbune’s Debugger Zero Tolerance in Production.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Maintaining and Updating Windows Server 2008
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
NDT Tools Tutorial: How-To setup your own NDT server Rich Carlson Summer 04 Joint Tech July 19, 2004.
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Using Ant to build J2EE Applications Kumar
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Big Data Analytics with R and Hadoop
HADOOP ADMIN: Session -2
Workflow Management CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
Designing For Testability. Incorporate design features that facilitate testing Include features to: –Support test automation at all levels (unit, integration,
Overview Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
HAMS Technologies 1
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
Chapter 2: Operating-System Structures. 2.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 2: Operating-System Structures Operating.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
CaDSR Freestyle Search June 11, caDSR Freestyle Search Overview Architecture Implementation Dependencies Futures 2.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Automated Scheduling and Operations for Legacy Applications.
Ant & Jar Ant – Java-based build tool Jar – pkzip archive, that contains metadata (a manifest file) that the JRE understands.
Page 1 © Hortonworks Inc – All Rights Reserved More Data, More Problems A Practical Guide to Testing on Hadoop 2015 Michael Miklavcic.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
SCAPE Rainer Schmidt SCAPE Information Day May 5 th, 2014 Österreichische Nationalbibliothek The SCAPE Platform Overview.
O’Reilly – Hadoop: The Definitive Guide Ch.7 MapReduce Types and Formats 29 July 2010 Taikyoung Kim.
CSE 548 Advanced Computer Network Security Trust in MobiCloud using Hadoop Framework Updates Sayan Cole Jaya Chakladar Group No: 1.
Core Java Introduction Byju Veedu Ness Technologies httpdownload.oracle.com/javase/tutorial/getStarted/intro/definition.html.
Apache Hadoop on the Open Cloud David Dobbins Nirmal Ranganathan.
CSE 548 Advanced Computer Network Security Trust in MobiCloud using Hadoop Framework Updates Sayan Kole Jaya Chakladar Group No: 1.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Maintaining and Updating Windows Server 2008 Lesson 8.
Hadoop Architecture Mr. Sriram
Ch 8 and Ch 9: MapReduce Types, Formats and Features
MapReduce Types, Formats and Features
Spark Presentation.
Chapter 2: System Structures
Lecture 17 (Hadoop: Getting Started)
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
Cloud Distributed Computing Environment Hadoop
Data processing with Hadoop
Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming
Charles Tappert Seidenberg School of CSIS, Pace University
MAPREDUCE TYPES, FORMATS AND FEATURES
Map Reduce, Types, Formats and Features
Presentation transcript:

O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee

Outline  The Configuration API  Configuring the Development Environment  Writing a Unit Test  Running Locally on Test Data  Running on a Cluster  Tuning a Job  MapReduce Workflows 2

The Configuration API 3  org.apache.hadoop.conf.Configuration class –Reads the properties from resources (XML configuration files)  Name –String  Value –Java primitives  boolean, int, long, float, … –Other useful types  String, Class, java.io.File, … configuration-1.xml

Combining Resources 4  Properties are overridden by later definitions  Properties that are marked as final cannot be overridden  This is used to separate out the default properties from the site- specific overrides configuration-2.xml

Variable Expansion 5  Properties can be defined in terms of other properties

Outline  The Configuration API  Configuring the Development Environment  Writing a Unit Test  Running Locally on Test Data  Running on a Cluster  Tuning a Job  MapReduce Workflows 6

Configuring the Development Environment 7  Development environment –Download & unpack the version of Hadoop in your machine –Add all the JAR files in Hadoop root & lib directory to the classpath  Hadoop cluster –To specify which configuration file you are using LocalPseudo-distributedDistributed fs.default.name file:///hdfs://localhost/hdfs://namenode/ mapred.job.tracker locallocalhost:8021jobtracker:8021

Running Jobs from the Command Line 8  Tool, ToolRunner –Provides a convenient way to run jobs –Uses GenericOptionsParser class internally  Interprets common Hadoop command-line options & sets them on a Configuration object

GenericOptionParser & ToolRunner Options 9  To specify configuration files  To set individual properties

GenericOptionParser & ToolRunner Options 10

Outline  The Configuration API  Configuring the Development Environment  Writing a Unit Test  Running Locally on Test Data  Running on a Cluster  Tuning a Job  MapReduce Workflows 11

Writing a Unit Test – Mapper (1/4) 12  Unit test for MaxTemperatureMapper

Writing a Unit Test – Mapper (2/4) 13  Mapper that passes MaxTemperatureMapperTest

Writing a Unit Test – Mapper (3/4) 14  Test for missing value

Writing a Unit Test – Mapper (4/4) 15  Mapper that handles missing value

Writing a Unit Test – Reducer (1/2) 16  Unit test for MaxTemperatureReducer

Writing a Unit Test – Reducer (2/2) 17  Reducer that passes MaxTemperatureReducerTest

Outline  The Configuration API  Configuring the Development Environment  Writing a Unit Test  Running Locally on Test Data  Running on a Cluster  Tuning a Job  MapReduce Workflows 18

Running a Job in a Local Job Runner (1/2) 19  Driver to run our job for finding the maximum temperature by year

Running a Job in a Local Job Runner (2/2) 20  To run in a local job runner or 

Fixing the Mapper 21  A class for parsing weather records in NCDC format

Fixing the Mapper 22

Fixing the Mapper 23  Mapper that uses a utility class to parse records

Testing the Driver 24  Two approaches –To use the local job runner & run the job against a test file on the local filesystem –To run the driver using a “mini-” cluster  MiniDFSCluster, MiniMRCluster class –Creates in-process cluster for testing against the full HDFS and MapReduce machinery  ClusterMapReduceTestCase –A useful base for writing a test –Handles the details of starting and stopping the in-process HDFS and MapReduce clusters in its setUp() and tearDown() methods –Generates a suitable JobConf object that is configured to work with the clusters

Outline  The Configuration API  Configuring the Development Environment  Writing a Unit Test  Running Locally on Test Data  Running on a Cluster  Tuning a Job  MapReduce Workflows 25

Running on a Cluster 26  Packaging –Package the program as a JAR file to send to the cluster –Use Ant for convienience  Launching a job –Run the driver with the -conf option to specify the cluster

Running on a Cluster 27  The output includes more useful information

The MapReduce Web UI 28  Useful for finding job’s progress, statistics, and logs  The Jobtracker page (

The MapReduce Web UI 29  The Job page

The MapReduce Web UI 30  The Job page

Retrieving the Results 31  Each reducer produces one output file –e.g., part … part  Retrieving the results –Copy the results from HDFS to the local machine  -getmerge option is useful –Use -cat option to print the output files to the console

Debugging a Job 32  Via print statements –Difficult to examine the output which may be scattered across the nodes  Using Hadoop features –Task’s status message  To prompt us to look in the error log –Custom counter  To count the total # of records with implausible data  If the amount of log data is large, –Write the information to the map’s output rather than to standard error for analysis and aggregation by the reduce –Write the program to analyze the logs

Debugging a Job 33

Debugging a Job 34  The tasks page

Debugging a Job 35  The task details page

Using a Remote Debugger 36  Hard to set up our debugger when running the job on a cluster –We don’t know which node is going to process which part of the input  Capture & replay debugging –Keep all the intermediate data generated during the job run  Set the configuration property keep.failed.task.files to true –Rerun the failing task in isolation with a debugger attached  Run a special task runner called IsolationRunner with the retained files as input

Outline  The Configuration API  Configuring the Development Environment  Writing a Unit Test  Running Locally on Test Data  Running on a Cluster  Tuning a Job  MapReduce Workflows 37

Tuning a Job 38  Tuning checklist  Profiling & optimizing at task level

Outline  The Configuration API  Configuring the Development Environment  Writing a Unit Test  Running Locally on Test Data  Running on a Cluster  Tuning a Job  MapReduce Workflows 39

MapReduce Workflows 40  Decomposing a problem into MapReduce jobs –Think about adding more jobs, rather than adding complexity to jobs –For more complex problems, consider a higher-level language than MapReduce (e.g., Pig, Hive, Cascading)  Running dependent jobs –Linear chain of jobs  Run each job one after another –DAG of jobs  Use org.apache.hadoop.mapred.jobcontrol package  JobControl class –Represents a graph of jobs to be run –Runs the jobs in dependency order defined by user