06 | Automating Big Data Processing

Slides:

Advertisements

Similar presentations

BUILDING TOOLS FOR THE HADOOP DEVELOPER matt

Advertisements

SQOOP HCatalog Integration

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

Senior Project Manager & Architect Love Your Data.

Transform + analyze Visualize + decide Capture + manage Dat a.

CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.

Module 13 Automating SQL Server 2008 R2 Management.

Application Development On AWS MOULIKRISHNA KOPPOLU CHANDAN SINGH RANA.

Workflow Management CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.

Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.

Hive Facebook 2009.

Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri.

An Introduction to HDInsight June 27 th,

Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.

A NoSQL Database - Hive Dania Abed Rabbou.

Esri UC 2014 | Technical Workshop | Creating Geoprocessing Services Kevin Hibma.

IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.

Nov 2006 Google released the paper on BigTable.

Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.

Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.

Please note that the session topic has changed

PowerPoint Instructions These are not native PowerPoint objects. They are PNG objects. To change the color, you need to go to the Format Tab.

Graeme Malcolm |

MSBIC Hadoop Series Hadoop & Microsoft BI Bryan Smith

Copyright © New Signature Who we are: Focused on consistently delivering great customer experiences. What we do: We help you transform your business.

BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.

MSBIC Hadoop Series Querying Data with Hive Bryan Smith

Dumps PDF Perform Data Engineering on Microsoft Azure HD Insight dumps.html Complete PDF File Download From.

CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook

Big Data, Data Mining, Tools

HIVE A Warehousing Solution Over a MapReduce Framework

Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.

Leveraging a Hadoop Cluster from SQL Server Integration Services

Data Platform and Analytics Foundational Training

Data-driven serverless apps with Azure functions

Hadoop in the Enterprise

MSBIC Hadoop Series Processing Data with Pig

Incrementally Moving to the Cloud Using Biml

Hadoopla: Microsoft and the Hadoop Ecosystem

Building Analytics At Scale With USQL and C#

Hive Mr. Sriram

Deploying and Configuring SSIS Packages

Automating SQL Server Management

Azure Machine Learning & ML Studio

HDInsight makes Hadoop Easy

07 | Analyzing Big Data with Excel

Microsoft Question Answers - Valid Microsoft Dumps PDF Dumps4download.us

U-SQL Object Model.

Server & Tools Business

Orchestration and data movement with Azure Data Factory v2

Overview of big data tools

Data analytics with Hadoop In the Microsoft Azure cloud

CSE 491/891 Lecture 21 (Pig).

Charles Tappert Seidenberg School of CSIS, Pace University

Orchestration and data movement with Azure Data Factory v2

IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.

HDInsight & Power BI By Łukasz Gołębiewski.

Server & Tools Business

04 | Always On High Availability

02 | Getting Started with HDInsight

05 | Processing Big Data with Hive

03 | Windows Azure PowerShell

04 | Processing Big Data with Pig

PnP Partner Pack - Introduction

06 | SQL Server and the Cloud

02 | Mastering Your Data Graeme Malcolm | Data Technology Specialist, Content Master Pete Harris | Learning Product Planner, Microsoft.

Presentation transcript:

06 | Automating Big Data Processing Graeme Malcolm | Data Technology Specialist, Content Master Pete Harris | Learning Product Planner, Microsoft

Module Overview Overview of Big Data Processing Storage and Schema Considerations HCatalog Oozie

Overview of Big Data Processing Big Data Processing Workflow Upload source data to HDFS in a Windows Azure storage blob container Transform the data using Pig, Hive, and Map/Reduce Consume the results of the transformation for reporting and analysis Provision Windows Azure HDInsight on-demand Ensure data processing operations are repeatable Minimize hard-coded dependencies

Storage and Schema Considerations Hard-coded paths and schema can break scripts SourceData = LOAD '/data/source' USING PigStorage(',') AS (col1:chararray, col2:float); SortedData = ORDER SourceData BY col1 ASC; STORE SortedData INTO '/data/output'; HCatalog uses Hive tables to abstract storage and schema SourceData = LOAD 'StagingTable' USING org.apache.hcatalog.pig.HCatLoader(); STORE SortedData INTO 'OutputTable' USING org.apache.hcatalog.pig.HCatStorer();

Demo: HCatalog In this demonstration, you will see how to: Use HCatalog to Execute HiveQL Use HCatalog in a Pig Latin Script

Automating Big Data Processing Tasks Windows Azure PowerShell The Windows Azure HDInsight .NET SDK Oozie

Introduction to Oozie Oozie Workflow Document Script files XML file defining workflow actions Script files Files used by workflow actions - for example, a HiveQL query file Can contain parameters The job.properties file Configuration file setting parameter values HDInsight configuration files Files to configure execution context – for example Hive-Default.xml

Oozie Workflow File Workflow consists of actions <workflow-app xmlns="uri:oozie:workflow:0.2" name="MyWorkflow"> <start to="FirstAction"/> <action name="FirstAction"> <hive xmlns="uri:oozie:hive-action:0.2"> <script>CreateTable.q</script> <param>TABLE_NAME=${tableName}</param> <param>LOCATION=${tableFolder}</param> </hive> <ok to="SecondAction"/> <error to="fail"/> </action> <action name="SecondAction"> … <kill name="fail"> <message>Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app> Workflow consists of actions This action runs a parameterized Hive script Workflow branches based on action outcome

Script Files Action-specific script files (for example, HiveQL scripts) Parameters passed from Oozie workflow file Parameters passed from Oozie workflow file DROP TABLE IF EXISTS ${TABLE_NAME}; CREATE EXTERNAL TABLE ${TABLE_NAME} (Col1 STRING, Col2 FLOAT, Col3 FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '${LOCATION}';

Job.properties nameNode=wasb://my_container@my_storage_account.blob.core.windows.net jobTracker=jobtrackerhost:9010 queueName=default oozie.use.system.libpath=true oozie.wf.application.path=/example/workflow/ tableName=ExampleTable tableFolder=/example/ExampleTable Oozie job settings Path to workflow file in HDFS Variables (for example, to set values for script parameters)

Demo: Oozie In this demonstration, you will see how to: Prepare Oozie Workflow Files Run an Oozie Workflow

Module Summary Design processes that are repeatable with minimal dependencies Use HCatalog to abstract data storage location and schema Automate Big Data processing: PowerShell Microsoft Hadoop .NET SDK Oozie