Presentation is loading. Please wait.

Presentation is loading. Please wait.

Workflow Management CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Similar presentations


Presentation on theme: "Workflow Management CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook."— Presentation transcript:

1 Workflow Management CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

2 APACHE OOZIE

3 Problem! "Okay, Hadoop is great, but how do people actually do this?“ – A Real Person – Package jobs? – Chaining actions together? – Run these on a schedule? – Pre and post processing? – Retry failures?

4 Apache Oozie Workflow Scheduler for Hadoop Scalable, reliable, and extensible workflow scheduler system to manage Apache Hadoop jobs Workflow jobs are DAGs of actions Coordinator jobs are recurrent Oozie Workflow jobs triggered by time and data availability Supports several types of jobs: Java MapReduce Streaming MapReduce Pig Hive Sqoop Distcp Java programs Shell scripts

5 Why should I care? Retry jobs in the event of a failure Execute jobs at a specific time or when data is available Correctly order job execution based on dependencies Provide a common framework for communication Use the workflow to couple resources instead of some home-grown code base

6 Layers of Oozie Bundles Coordinators Workflows Actions

7 Have a type, and each type has a defined set of configuration variables Each action must specify what to do based on success or failure

8 Workflow DAGs start Java Main Java Main M/R streaming job M/R streaming job decision fork Pig job Pig job M/R job M/R job join OK end Java Main FS job FS job OK ENOUGH MORE

9 Workflow Language Flow-control NodeDescription DecisionExpressing “switch-case” logic ForkSplits one path of execution into multiple concurrent paths JoinWaits until every concurrent execution path of a previous fork node arrives to it KillForces a workflow job to abort execution Action NodeDescription javaInvokes the main() method from the specified java class fsManipulate files and directories in HDFS; supports commands: move, delete, mkdir MapReduceStarts a Hadoop map/reduce job; that could be java MR job, streaming job or pipe job PigRuns a Pig job Sub workflowRuns a child workflow job HiveRuns a Hive job ShellRuns a Shell command sshStarts a shell command on a remote machine as a remote secure shell SqoopRuns a Sqoop job EmailSending emails from Oozie workflow application DistcpRuns a Hadoop Distcp MapReduce job CustomDoes what you program it to do

10 Oozie Workflow Application An HDFS Directory containing: Definition file: workflow.xml Configuration file: config-default.xml App files: lib/ directory with JAR and other dependencies

11 WordCount Workflow foo.com:9001 hdfs://bar.com:9000 mapred.input.dir ${inputDir} mapred.output.dir ${outputDir} Start M-R wordcount M-R wordcount End OK Start Kill Error

12 Coordinators Oozie executes workflows based on – Time Dependency – Data Dependency Hadoop Tomcat Oozie Client Oozie Workflow Oozie Workflow WS API Oozie Coordinator Oozie Coordinator Check Data Availability

13 Time Triggers <coordinator-app name="coord1" start="2009-01-01T00:00Z" end="2010-01-01T00:00Z" frequency="15" xmlns="uri:oozie:coordinator:0.1"> hdfs://bar:9000/apps/processor-wf key1 value1

14 Data Triggers hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR} ${current(0)} hdfs://bar:9000/usr/abc/logsprocessor-wf inputData ${dataIn('inputLogs')}

15 Bundle Bundles are higher-level abstractions that batch a set of coordinators together No explicit dependencies between them, but they can be used to define a pipeline

16 Interacting with Oozie Read-Only Web Console CLI Java client Web Service Endpoints Directly with Oozie DB using SQL

17 Extending Oozie Minimal workflow language containing a handful of controls and actions Extensibility for custom action nodes Creation of a custom action requires: – Java implementation, extending ActionExecutor – Implementation of the action’s XML schema, which defines the action’s configuration parameters – Packaing of Java implementation and configuration schema into a JAR, which is added to Oozie WAR – Extending oozie-site.xml to register information about custom executor

18 What do I need to deploy a workflow? coordinator.xml workflow.xml Libraries Properties – Contains things like NameNode and ResourceManager addresses and other job- specific properties

19 Configuring Workflows Three mechanisms to configure a workflow – config-default.xml – job.properties – Job Arguments Processed as such: – Use all of the parameters from command line invocation – Anything unresolved? Use job.properties – Use config-default.xml for everything else

20 Okay, I've built those Now you can put it in HDFS and run it hdfs dfs -put my_job oozie/app oozie job -run -config job.properties

21 Java Action A Java action will execute the main method of the specified Java class Java classes should be packaged in a JAR and placed with workflow application's lib directory – wf-app-dir/workflow.xml – wf-app-dir/lib – wf-app-dir/lib/myJavaClasses.JAR

22 Java Action $ java -Xms512m a.b.c.MyMainClass arg1 arg2... a.b.c.MyJavaMain -Xms512m arg1 arg2...

23 Java Action Execution Executed as a MR job with a single task So you need the MR information foo.bar:8021 foo1.bar:8020... abc def

24 Capturing Output How to pass parameter from my Java action to other actions? Add the element to your Java action Reference the parameter in your following actions Write some Java code to link them

25 ${jobTracker} ${nameNode} mapred.job.queue.name ${queueName} org.apache.oozie.test.MyTest ${outputFileName}

26 ${jobTracker} ${nameNode} mapred.job.queue.name ${queueName} script.pig MY_VAR=${wf:actionData('java1')['PASS_ME']}

27 public static void main (String[] args) { String fileName = args[0]; try{ File file = new File( System.getProperty("oozie.action.output.properties")); Properties props = new Properties(); props.setProperty("PASS_ME", "123456"); OutputStream os = new FileOutputStream(file); props.store(os, ""); os.close(); System.out.println(file.getAbsolutePath()); } catch (Exception e) { e.printStackTrace(); } System.exit(0); }

28 Web Console

29 Coordinators

30 Coordinator Details

31 Job Details

32 Job DAG

33 Job Details

34 Action Details

35 Job Tracker

36 A Use Case: Hourly Jobs Replace a CRON job that runs a bash script once a day 1.Java main class that pulls data from a file stream and dumps it to HDFS 2.Runs a MapReduce job on the files 3.Emails a person when finished 4.Start within X amount of time 5.Complete within Y amount of time 6.And retry Z times on failure

37 foo:9001 bar:9000 org.foo.bar.PullFileStream foo:9001 bar:9000...... customer@foo.bar employee@foo.bar Email notification The wf completed 1 2 3

38 <coordinator-app end="${COORD_END}" frequency="${coord:days(1)}" name="daily_job_coord" start="${COORD_START}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1" xmlns="uri:oozie:sla:0.1"> hdfs://bar:9000/user/hadoop/oozie/app/test_job ${coord:nominalTime()} ${X * MINUTES} ${Y * MINUTES} foo@bar.com 4, 5 6

39 Review Oozie ties together many Hadoop ecosystem components to "productionalize" this stuff Advanced control flow and action extendibility lets Oozie do whatever you would need it to do at any point in the workflow XML is gross

40 References http://oozie.apache.org https://cwiki.apache.org/confluence/display/OOZ IE/Index https://cwiki.apache.org/confluence/display/OOZ IE/Index http://www.slideshare.net/mattgoeke/oozie-riot- games http://www.slideshare.net/mattgoeke/oozie-riot- games http://www.slideshare.net/mislam77/oozie- sweet-13451212 http://www.slideshare.net/mislam77/oozie- sweet-13451212 http://www.slideshare.net/ChicagoHUG/everythi ng-you-wanted-to-know-but-were-afraid-to-ask- about-oozie http://www.slideshare.net/ChicagoHUG/everythi ng-you-wanted-to-know-but-were-afraid-to-ask- about-oozie


Download ppt "Workflow Management CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook."

Similar presentations


Ads by Google