06 | Automating Big Data Processing

06 | Automating Big Data Processing
Graeme Malcolm | Data Technology Specialist, Content Master Pete Harris | Learning Product Planner, Microsoft

Module Overview Overview of Big Data Processing
Storage and Schema Considerations HCatalog Oozie

Overview of Big Data Processing
Big Data Processing Workflow Upload source data to HDFS in a Windows Azure storage blob container Transform the data using Pig, Hive, and Map/Reduce Consume the results of the transformation for reporting and analysis Provision Windows Azure HDInsight on-demand Ensure data processing operations are repeatable Minimize hard-coded dependencies

Storage and Schema Considerations
Hard-coded paths and schema can break scripts SourceData = LOAD '/data/source' USING PigStorage(',') AS (col1:chararray, col2:float); SortedData = ORDER SourceData BY col1 ASC; STORE SortedData INTO '/data/output'; HCatalog uses Hive tables to abstract storage and schema SourceData = LOAD 'StagingTable' USING org.apache.hcatalog.pig.HCatLoader(); STORE SortedData INTO 'OutputTable' USING org.apache.hcatalog.pig.HCatStorer();

Demo: HCatalog In this demonstration, you will see how to:
Use HCatalog to Execute HiveQL Use HCatalog in a Pig Latin Script

Automating Big Data Processing Tasks
Windows Azure PowerShell The Windows Azure HDInsight .NET SDK Oozie

Introduction to Oozie Oozie Workflow Document Script files
XML file defining workflow actions Script files Files used by workflow actions - for example, a HiveQL query file Can contain parameters The job.properties file Configuration file setting parameter values HDInsight configuration files Files to configure execution context – for example Hive-Default.xml

Oozie Workflow File Workflow consists of actions
<workflow-app xmlns="uri:oozie:workflow:0.2" name="MyWorkflow"> <start to="FirstAction"/> <action name="FirstAction"> <hive xmlns="uri:oozie:hive-action:0.2"> <script>CreateTable.q</script> <param>TABLE_NAME=${tableName}</param> <param>LOCATION=${tableFolder}</param> </hive> <ok to="SecondAction"/> <error to="fail"/> </action> <action name="SecondAction"> … <kill name="fail"> <message>Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app> Workflow consists of actions This action runs a parameterized Hive script Workflow branches based on action outcome

Script Files Action-specific script files (for example, HiveQL scripts) Parameters passed from Oozie workflow file Parameters passed from Oozie workflow file DROP TABLE IF EXISTS ${TABLE_NAME}; CREATE EXTERNAL TABLE ${TABLE_NAME} (Col1 STRING, Col2 FLOAT, Col3 FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '${LOCATION}';

Job.properties jobTracker=jobtrackerhost:9010 queueName=default oozie.use.system.libpath=true oozie.wf.application.path=/example/workflow/ tableName=ExampleTable tableFolder=/example/ExampleTable Oozie job settings Path to workflow file in HDFS Variables (for example, to set values for script parameters)

Demo: Oozie In this demonstration, you will see how to:
Prepare Oozie Workflow Files Run an Oozie Workflow

Module Summary Design processes that are repeatable with minimal dependencies Use HCatalog to abstract data storage location and schema Automate Big Data processing: PowerShell Microsoft Hadoop .NET SDK Oozie

06 | Automating Big Data Processing

Similar presentations

Presentation on theme: "06 | Automating Big Data Processing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

06 | Automating Big Data Processing

Similar presentations

Presentation on theme: "06 | Automating Big Data Processing"— Presentation transcript:

Similar presentations

About project

Feedback