Workflow Management CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Slides:

Advertisements

Similar presentations

Jenkins User Conference San Francisco, Sept #jenkinsconf Business Process Model & Notation (BPMN) Workflows in Jenkins Max Spring Cisco

Advertisements

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.

© Hortonworks Inc Running Non-MapReduce Applications on Apache Hadoop Hitesh Shah & Siddharth Seth Hortonworks Inc. Page 1.

ANT – Another Neat Tool Representation and Management of Data on the Internet.

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.

Remote Method Invocation Chin-Chih Chang. Java Remote Object Invocation In Java, the object is serialized before being passed as a parameter to an RMI.

1 Ant – Another Neat Tool Representation and Management of Data on the Internet.

CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.

Hadoop Ecosystem Overview

Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.

UNIT-V The MVC architecture and Struts Framework.

HADOOP ADMIN: Session -2

ENTERPRISE JOB SCHEDULER SAJEEV RAMAKRISHNAN 29 AUG 2014.

Apache Spark and the future of big data applications Eric Baldeschwieler.

© 2012 IBM Corporation Tivoli Workload Automation Informatica Power Center.

THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.

SSC2: Web Services. Web Services Web Services offer interoperability using the web Web Services provide information on the operations they can perform.

AUTOBUILD Build and Deployment Automation Solution.

Database Laboratory Regular Seminar TaeHoon Kim.

HAMS Technologies 1

Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

HAMS Technologies 1

Hive Facebook 2009.

MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

An Introduction to HDInsight June 27 th,

Oracle Data Integrator Procedures, Advanced Workflows.

Page 1 © Hortonworks Inc – All Rights Reserved Fly the Coop! Getting Big Data to Soar With Apache Falcon 2015 Michael Miklavcic.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Ant & Jar Ant – Java-based build tool Jar – pkzip archive, that contains metadata (a manifest file) that the JRE understands.

Introduction IS Outline  Goals of the course  Course organization  Java command line  Object-oriented programming  File I/O.

Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system.

A FIRST TOUCH ON NOSQL SERVERS: COUCHDB GENOVEVA VARGAS SOLAR, JAVIER ESPINOSA CNRS, LIG-LAFMIA, FRANCE

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

WDO-It! 102 Workshop: Using an abstraction of a process to capture provenance UTEP’s Trust Laboratory NDR HP MP.

SCAPE Rainer Schmidt SCAPE Information Day May 5 th, 2014 Österreichische Nationalbibliothek The SCAPE Platform Overview.

© FPT SOFTWARE – TRAINING MATERIAL – Internal use 04e-BM/NS/HDCV/FSOFT v2/3 JSP Application Models.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.

Hadoop Joshua Nester, Garrison Vaughan, Calvin Sauerbier, Jonathan Pingilley, and Adam Albertson.

Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.

Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.

Interactions & Automations

Bayu Priyambadha, S.Kom. Static content  Web Server delivers contents of a file (html) 1. Browser sends request to Web Server 3. Web Server sends HTML.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.

Apache Hadoop on Windows Azure Avkash Chauhan

Apache ZooKeeper CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.

CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook

Everything you've ever wanted to know about using Control-M to integrate any application workload September 9, 2016 David Fernandez Senior Presales Consultant.

Apache hadoop & Mapreduce

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

Central Florida Business Intelligence User Group

Spark and Scala.

Lecture 16 (Intro to MapReduce and Hadoop)

Introduction to Web Services

Charles Tappert Seidenberg School of CSIS, Pace University

Server & Tools Business

Plug-In Architecture Pattern

Pig Hive HBase Zookeeper

Presentation transcript:

Workflow Management CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

APACHE OOZIE

Problem! "Okay, Hadoop is great, but how do people actually do this?“ – A Real Person – Package jobs? – Chaining actions together? – Run these on a schedule? – Pre and post processing? – Retry failures?

Apache Oozie Workflow Scheduler for Hadoop Scalable, reliable, and extensible workflow scheduler system to manage Apache Hadoop jobs Workflow jobs are DAGs of actions Coordinator jobs are recurrent Oozie Workflow jobs triggered by time and data availability Supports several types of jobs: Java MapReduce Streaming MapReduce Pig Hive Sqoop Distcp Java programs Shell scripts

Why should I care? Retry jobs in the event of a failure Execute jobs at a specific time or when data is available Correctly order job execution based on dependencies Provide a common framework for communication Use the workflow to couple resources instead of some home-grown code base

Layers of Oozie Bundles Coordinators Workflows Actions

Have a type, and each type has a defined set of configuration variables Each action must specify what to do based on success or failure

Workflow DAGs start Java Main Java Main M/R streaming job M/R streaming job decision fork Pig job Pig job M/R job M/R job join OK end Java Main FS job FS job OK ENOUGH MORE

Workflow Language Flow-control NodeDescription DecisionExpressing “switch-case” logic ForkSplits one path of execution into multiple concurrent paths JoinWaits until every concurrent execution path of a previous fork node arrives to it KillForces a workflow job to abort execution Action NodeDescription javaInvokes the main() method from the specified java class fsManipulate files and directories in HDFS; supports commands: move, delete, mkdir MapReduceStarts a Hadoop map/reduce job; that could be java MR job, streaming job or pipe job PigRuns a Pig job Sub workflowRuns a child workflow job HiveRuns a Hive job ShellRuns a Shell command sshStarts a shell command on a remote machine as a remote secure shell SqoopRuns a Sqoop job Sending s from Oozie workflow application DistcpRuns a Hadoop Distcp MapReduce job CustomDoes what you program it to do

Oozie Workflow Application An HDFS Directory containing: Definition file: workflow.xml Configuration file: config-default.xml App files: lib/ directory with JAR and other dependencies

WordCount Workflow foo.com:9001 hdfs://bar.com:9000 mapred.input.dir ${inputDir} mapred.output.dir ${outputDir} Start M-R wordcount M-R wordcount End OK Start Kill Error

Coordinators Oozie executes workflows based on – Time Dependency – Data Dependency Hadoop Tomcat Oozie Client Oozie Workflow Oozie Workflow WS API Oozie Coordinator Oozie Coordinator Check Data Availability

Time Triggers <coordinator-app name="coord1" start=" T00:00Z" end=" T00:00Z" frequency="15" xmlns="uri:oozie:coordinator:0.1"> hdfs://bar:9000/apps/processor-wf key1 value1

Data Triggers hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR} ${current(0)} hdfs://bar:9000/usr/abc/logsprocessor-wf inputData ${dataIn('inputLogs')}

Bundle Bundles are higher-level abstractions that batch a set of coordinators together No explicit dependencies between them, but they can be used to define a pipeline

Interacting with Oozie Read-Only Web Console CLI Java client Web Service Endpoints Directly with Oozie DB using SQL

Extending Oozie Minimal workflow language containing a handful of controls and actions Extensibility for custom action nodes Creation of a custom action requires: – Java implementation, extending ActionExecutor – Implementation of the action’s XML schema, which defines the action’s configuration parameters – Packaing of Java implementation and configuration schema into a JAR, which is added to Oozie WAR – Extending oozie-site.xml to register information about custom executor

What do I need to deploy a workflow? coordinator.xml workflow.xml Libraries Properties – Contains things like NameNode and ResourceManager addresses and other job- specific properties

Configuring Workflows Three mechanisms to configure a workflow – config-default.xml – job.properties – Job Arguments Processed as such: – Use all of the parameters from command line invocation – Anything unresolved? Use job.properties – Use config-default.xml for everything else

Okay, I've built those Now you can put it in HDFS and run it hdfs dfs -put my_job oozie/app oozie job -run -config job.properties

Java Action A Java action will execute the main method of the specified Java class Java classes should be packaged in a JAR and placed with workflow application's lib directory – wf-app-dir/workflow.xml – wf-app-dir/lib – wf-app-dir/lib/myJavaClasses.JAR

Java Action $ java -Xms512m a.b.c.MyMainClass arg1 arg2... a.b.c.MyJavaMain -Xms512m arg1 arg2...

Java Action Execution Executed as a MR job with a single task So you need the MR information foo.bar:8021 foo1.bar: abc def

Capturing Output How to pass parameter from my Java action to other actions? Add the element to your Java action Reference the parameter in your following actions Write some Java code to link them

${jobTracker} ${nameNode} mapred.job.queue.name ${queueName} org.apache.oozie.test.MyTest ${outputFileName}

${jobTracker} ${nameNode} mapred.job.queue.name ${queueName} script.pig MY_VAR=${wf:actionData('java1')['PASS_ME']}

public static void main (String[] args) { String fileName = args[0]; try{ File file = new File( System.getProperty("oozie.action.output.properties")); Properties props = new Properties(); props.setProperty("PASS_ME", "123456"); OutputStream os = new FileOutputStream(file); props.store(os, ""); os.close(); System.out.println(file.getAbsolutePath()); } catch (Exception e) { e.printStackTrace(); } System.exit(0); }

Web Console

Coordinators

Coordinator Details

Job Details

Job DAG

Job Details

Action Details

Job Tracker

A Use Case: Hourly Jobs Replace a CRON job that runs a bash script once a day 1.Java main class that pulls data from a file stream and dumps it to HDFS 2.Runs a MapReduce job on the files 3. s a person when finished 4.Start within X amount of time 5.Complete within Y amount of time 6.And retry Z times on failure

foo:9001 bar:9000 org.foo.bar.PullFileStream foo:9001 bar: notification The wf completed 1 2 3

<coordinator-app end="${COORD_END}" frequency="${coord:days(1)}" name="daily_job_coord" start="${COORD_START}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1" xmlns="uri:oozie:sla:0.1"> hdfs://bar:9000/user/hadoop/oozie/app/test_job ${coord:nominalTime()} ${X * MINUTES} ${Y * MINUTES} 4, 5 6

Review Oozie ties together many Hadoop ecosystem components to "productionalize" this stuff Advanced control flow and action extendibility lets Oozie do whatever you would need it to do at any point in the workflow XML is gross

References IE/Index IE/Index games games sweet sweet ng-you-wanted-to-know-but-were-afraid-to-ask- about-oozie ng-you-wanted-to-know-but-were-afraid-to-ask- about-oozie