06 | Automating Big Data Processing

Slides:



Advertisements
Similar presentations
BUILDING TOOLS FOR THE HADOOP DEVELOPER matt
Advertisements

SQOOP HCatalog Integration
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
Senior Project Manager & Architect Love Your Data.
Transform + analyze Visualize + decide Capture + manage Dat a.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Module 13 Automating SQL Server 2008 R2 Management.
Application Development On AWS MOULIKRISHNA KOPPOLU CHANDAN SINGH RANA.
Workflow Management CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
Hive Facebook 2009.
Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri.
An Introduction to HDInsight June 27 th,
Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.
A NoSQL Database - Hive Dania Abed Rabbou.
Esri UC 2014 | Technical Workshop | Creating Geoprocessing Services Kevin Hibma.
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Nov 2006 Google released the paper on BigTable.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Please note that the session topic has changed
PowerPoint Instructions These are not native PowerPoint objects. They are PNG objects. To change the color, you need to go to the Format Tab.
Graeme Malcolm |
MSBIC Hadoop Series Hadoop & Microsoft BI Bryan Smith
Copyright © New Signature Who we are: Focused on consistently delivering great customer experiences. What we do: We help you transform your business.
BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.
MSBIC Hadoop Series Querying Data with Hive Bryan Smith
Dumps PDF Perform Data Engineering on Microsoft Azure HD Insight dumps.html Complete PDF File Download From.
CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook
Big Data, Data Mining, Tools
HIVE A Warehousing Solution Over a MapReduce Framework
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Leveraging a Hadoop Cluster from SQL Server Integration Services
Data Platform and Analytics Foundational Training
Data-driven serverless apps with Azure functions
Hadoop in the Enterprise
MSBIC Hadoop Series Processing Data with Pig
Incrementally Moving to the Cloud Using Biml
Hadoopla: Microsoft and the Hadoop Ecosystem
Building Analytics At Scale With USQL and C#
Hive Mr. Sriram
Deploying and Configuring SSIS Packages
SQOOP.
Automating SQL Server Management
Azure Machine Learning & ML Studio
HDInsight makes Hadoop Easy
07 | Analyzing Big Data with Excel
Microsoft Question Answers - Valid Microsoft Dumps PDF Dumps4download.us
U-SQL Object Model.
Server & Tools Business
Orchestration and data movement with Azure Data Factory v2
Overview of big data tools
Data analytics with Hadoop In the Microsoft Azure cloud
Setup Sqoop.
CSE 491/891 Lecture 21 (Pig).
Charles Tappert Seidenberg School of CSIS, Pace University
Orchestration and data movement with Azure Data Factory v2
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
HDInsight & Power BI By Łukasz Gołębiewski.
Server & Tools Business
04 | Always On High Availability
02 | Getting Started with HDInsight
05 | Processing Big Data with Hive
03 | Windows Azure PowerShell
04 | Processing Big Data with Pig
PnP Partner Pack - Introduction
06 | SQL Server and the Cloud
02 | Mastering Your Data Graeme Malcolm | Data Technology Specialist, Content Master Pete Harris | Learning Product Planner, Microsoft.
Presentation transcript:

06 | Automating Big Data Processing Graeme Malcolm | Data Technology Specialist, Content Master Pete Harris | Learning Product Planner, Microsoft

Module Overview Overview of Big Data Processing Storage and Schema Considerations HCatalog Oozie

Overview of Big Data Processing Big Data Processing Workflow Upload source data to HDFS in a Windows Azure storage blob container Transform the data using Pig, Hive, and Map/Reduce Consume the results of the transformation for reporting and analysis Provision Windows Azure HDInsight on-demand Ensure data processing operations are repeatable Minimize hard-coded dependencies

Storage and Schema Considerations Hard-coded paths and schema can break scripts SourceData = LOAD '/data/source' USING PigStorage(',') AS (col1:chararray, col2:float); SortedData = ORDER SourceData BY col1 ASC; STORE SortedData INTO '/data/output'; HCatalog uses Hive tables to abstract storage and schema SourceData = LOAD 'StagingTable' USING org.apache.hcatalog.pig.HCatLoader(); STORE SortedData INTO 'OutputTable' USING org.apache.hcatalog.pig.HCatStorer();

Demo: HCatalog In this demonstration, you will see how to: Use HCatalog to Execute HiveQL Use HCatalog in a Pig Latin Script

Automating Big Data Processing Tasks Windows Azure PowerShell The Windows Azure HDInsight .NET SDK Oozie

Introduction to Oozie Oozie Workflow Document Script files XML file defining workflow actions Script files Files used by workflow actions - for example, a HiveQL query file Can contain parameters The job.properties file Configuration file setting parameter values HDInsight configuration files Files to configure execution context – for example Hive-Default.xml

Oozie Workflow File Workflow consists of actions <workflow-app xmlns="uri:oozie:workflow:0.2" name="MyWorkflow"> <start to="FirstAction"/> <action name="FirstAction"> <hive xmlns="uri:oozie:hive-action:0.2"> <script>CreateTable.q</script> <param>TABLE_NAME=${tableName}</param> <param>LOCATION=${tableFolder}</param> </hive> <ok to="SecondAction"/> <error to="fail"/> </action> <action name="SecondAction"> … <kill name="fail"> <message>Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app> Workflow consists of actions This action runs a parameterized Hive script Workflow branches based on action outcome

Script Files Action-specific script files (for example, HiveQL scripts) Parameters passed from Oozie workflow file Parameters passed from Oozie workflow file DROP TABLE IF EXISTS ${TABLE_NAME}; CREATE EXTERNAL TABLE ${TABLE_NAME} (Col1 STRING, Col2 FLOAT, Col3 FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '${LOCATION}';

Job.properties nameNode=wasb://my_container@my_storage_account.blob.core.windows.net jobTracker=jobtrackerhost:9010 queueName=default oozie.use.system.libpath=true oozie.wf.application.path=/example/workflow/ tableName=ExampleTable tableFolder=/example/ExampleTable Oozie job settings Path to workflow file in HDFS Variables (for example, to set values for script parameters)

Demo: Oozie In this demonstration, you will see how to: Prepare Oozie Workflow Files Run an Oozie Workflow

Module Summary Design processes that are repeatable with minimal dependencies Use HCatalog to abstract data storage location and schema Automate Big Data processing: PowerShell Microsoft Hadoop .NET SDK Oozie