Graeme Malcolm |

Graeme Malcolm | Microsoft gmalc@microsoft.com @graeme_malcolm

Agenda 01 | Getting Started with Big Data, Hadoop, and HDInsight 02 | Processing Big Data with Hive and Pig 03 | Building a Big Data Workflow with Sqoop, Oozie, and the.NET SDK 04 | Real-time Big Data Processing with HBase and Storm 05 | In-Memory Big Data Processing with Spark

Setting Expectations Prerequisites: –Familiarity with database concepts and basic SQL query syntax –Familiarity with programming fundamentals –Experience with Microsoft Windows Experience with Visual Studio, and Azure is preferable; but not required

Demo Environment Microsoft Azure Subscription –Free trial available in some regions Windows client computer –Azure PowerShell –Visual Studio 2015 and Azure SDK –Excel –Power BI Desktop

Getting Started with Big Data, Hadoop, and HDInsight

What is Big Data? Data that is too large or complex for analysis in traditional relational databases Typified by the “3 V’s”: –Volume – Huge amounts of data to process –Variety – A mixture of structured and unstructured data –Velocity – New data generated extremely frequently Web server log reporting Social media sentiment analysis Sensor and IoT Processing

What is Hadoop? Hadoop –Open source distributed data processing cluster –Data processed in Hadoop Distributed File System (HDFS) Related projects –Hive –Pig –Oozie –Sqoop –Others HDFS Name NodeData Nodes Hadoop Cluster

MapReduce 1.Source data is divided among data nodes 2.Map phase generates key/value pairs 3.Reduce phase aggregates values for each key Lorem ipsum sit amet magma sit elit Fusce magna sed sit amet magna KeyValue Lorem1 ipsum1 sit1 amet1 magma1 sit1 elit1 KeyValue Fusce1 magma1 sed1 sit1 amet1 magma1 KeyValue Lorem1 ipsum1 sit3 amet2 magma3 elit1 Fusce1 sed1 MAP REDUCE

public static class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context){ int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); }

HDInsight cluster (VMs) What is HDInsight Hadoop as an Azure Service Hortonworks HDP on Azure VMs –Windows Server 2012 R2 –Linux Azure Storage provides HDFS layer Azure SQL Database stores metadata Azure Storage HDFS SQL Database Hive/Oozie Metadata

DEMO Getting Started with HDInsight

Processing Big Data with Hive and Pig

What is Hive? A metadata service that projects tabular schemas over folders Enables the contents of folders to be queried as tables, using SQL-like query semantics Queries are translated into MapReduce jobs SELECT…

CREATE TABLE table1 (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '; CREATE TABLE table2 (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/data/table2'; CREATE EXTERNAL TABLE table3 (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/data/table3'; Internal table (folders deleted when table is dropped) Default location (/hive/warehouse/table1) Stored in a custom location (but still internal, so the folder is deleted when table is dropped) External table (folders and files are left intact in Azure Blob Store when the table is dropped) Creating Hive Tables

DEMO Creating Hive Tables

Save data files in table folders Use the LOAD statement Use the INSERT statement Use a CREATE TABLE AS SELECT (CTAS) statement LOAD DATA LOCAL INPATH '/data/source' INTO TABLE MyTable; INSERT INTO TABLE MyTable SELECT Col1, Col2 FROM StagingTable; CREATE TABLE SummaryTable ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/data/summarytable' AS SELECT Col1, SUM(Col2) As Total FROM MyTable GROUP BY Col1;

Map Reduce SELECT… set hive.execution.engine=tez; SELECT… A Small Diversion – What is Tez?

DEMO Querying Hive Tables

Running Hive Jobs from PowerShell The AzureHDInsightHiveJobDefinition cmdlet –Create a job definition –Use Query for explicit HiveQL statements, or File to reference a saved script –Run the job with the Start-AzureHDInsightJob cmdlet The Invoke-Hive cmdlet –Simpler syntax to run a HiveQL query –Use Query for explicit HiveQL statements, or File to reference a saved script

DEMO Using Hive in PowerShell

Hive and ODBC 1.Download and install the Hive ODBC Driver for HDInsight 2.Optionally, create a data source name (DSN) for your HDInsight cluster 3.Use an ODBC connection to query Hive tables

DEMO Accessing Hive via ODBC

What is Pig? Pig performs a series of transformations to data relations based on Pig Latin statements Relations are loaded using schema on read semantics to project table structure at runtime You can run Pig Latin statements interactively in the Grunt shell, or save a script file and run them as a batch

Relations, Bags, Tuples, and Fields A relation is an outer bag –A bag is a collection of tuples –A tuple is an ordered set of fields –A field is a data item A field can contain an inner bag A bag can contain tuples with non- matching schema (a, 1) (b, 2) (c, 3) (d, {(4, 5), (6,7)}) (e) (f, 8, 9)

-- Load comma-delimited source data. Default data type is chararray, but temp is a long int Readings = LOAD '/weather/data.txt' USING PigStorage(',') AS (date, temp:long); -- Group the tuples by date GroupedReadings = GROUP Readings BY date; -- Get the average temp value for each date grouping GroupedAvgs = FOREACH GroupedReadings GENERATE group, AVG(Readings.temp) AS avgtemp; -- Ungroup the dates with the average temp AvgWeather = FOREACH GroupedAvgs GENERATE FLATTEN(group) as date, avgtemp; -- Sort the results by date SortedResults = ORDER AvgWeather BY date ASC; -- Save the results in the /weather/summary folder STORE SortedResults INTO '/weather/summary'; 2013-06-01,12 2013-06-01,14 2013-06-01,16 2013-06-02,9 2013-06-02,12 2013-06-02,9... 2013-06-01 14.00 2013-06-02 10.00

Common Pig Latin Operations LOAD FILTER FOR EACH … GENERATE ORDER JOIN GROUP FLATTEN LIMIT DUMP STORE

Pig Latin and MapReduce Pig generates MapReduce code from Pig Latin MapReduce jobs are generated on: –DUMP –STORE Readings = LOAD '/weather/data.txt' USING PigStorage(',') AS (date, temp:long); GroupedReadings = GROUP Readings BY date; GroupedAvgs = FOREACH GroupedReadings GENERATE group, AVG(Readings.temp) AS avgtemp; AvgWeather = FOREACH GroupedAvgs GENERATE FLATTEN(group) as date, avgtemp; SortedResults = ORDER AvgWeather BY date ASC; STORE SortedResults INTO '/weather/summary'; MapReduce code generated here

DEMO Using Pig

Building a Big Data Workflow with Sqoop, Oozie, and the.NET SDK

What is Sqoop? Sqoop is a database integration service –Built on open source Hadoop technology –Enables bi-directional data transfer between Hadoop clusters and databases

Sqoop Syntax Basic syntax: sqoop command --arg1, --arg2,...--argN Commands:  import  export  help  import-all-tables  create-hive-table  list-databases  list-tables  eval  codegen  version

Using the Import Command sqoop import --connect jdbc_connection_string --username user_name --password password | -P --table table_name --columns col1,...colN | --query 'SELECT…' --warehouse-dir | --target-dir path --fields-terminated-by char --lines-terminated-by char --hive-import [--hive-overwrite] -m | --num-mappers number_of_mappers

Using the Export Command sqoop export --connect jdbc_connection_string --username user_name --password password | -P --table table_name --export-dir path --fields-terminated-by char --lines-terminated-by char -m | --num-mappers number_of_mappers

Using Sqoop from PowerShell 1.Define the sqoop job New-AzureHDInsightSqoopJobDefinition 2.Submit the sqoop job Start-AzureHDInsightJob 3.Get job output Wait-AzureHDInsightJob Get-AzureHDInsightJobOutput

What is Oozie? A workflow engine for actions in a Hadoop cluster –Hive –Pig –Sqoop –Others Support parallel workstreams and conditional branching

Anatomy of an Oozie Application Oozie workflow file –XML file defining workflow actions Script files –Files used by workflow actions - for example, HiveQL or Pig Latin INSERT…LOAD…

CreateTable.hql … Workflow failed. [${wf:errorMessage(wf:lastErrorNode())}] Start here This action runs a Hive script file in the workflow folder Workflow branches based on action outcome Oozie Workflow Document

oozie job -oozie http://localhost:11000/oozie -config c:\files\job.properties -run nameNode=wasb://hdfiles@hdstore.blob.core.windows.net jobTracker=jobtrackerhost:9010 queueName=default oozie.use.system.libpath=true oozie.wf.application.path=/data/workflow/ nameNode=wasb://hdfiles@hdstore.blob.core.windows.net jobTracker=jobtrackerhost:9010 queueName=default oozie.use.system.libpath=true oozie.wf.application.path=/data/workflow/ Oozie Command Line

nameNode=wasb://my_container@my_storage_account.blob.core.windows.net jobTracker=jobtrackerhost:9010 queueName=default oozie.use.system.libpath=true oozie.wf.application.path=/example/workflow/ hiveScript=CreateTable.hql tableName=mytable tableFolder=/data/mytable nameNode=wasb://my_container@my_storage_account.blob.core.windows.net jobTracker=jobtrackerhost:9010 queueName=default oozie.use.system.libpath=true oozie.wf.application.path=/example/workflow/ hiveScript=CreateTable.hql tableName=mytable tableFolder=/data/mytable ${hiveScript} TABLE_NAME=${tableName} LOCATION=${tableFolder} ${hiveScript} TABLE_NAME=${tableName} LOCATION=${tableFolder} DROP TABLE IF EXISTS ${TABLE_NAME}; CREATE EXTERNAL TABLE ${TABLE_NAME} (Col1 STRING, Col2 FLOAT, Col3 FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '${LOCATION}'; DROP TABLE IF EXISTS ${TABLE_NAME}; CREATE EXTERNAL TABLE ${TABLE_NAME} (Col1 STRING, Col2 FLOAT, Col3 FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '${LOCATION}'; Oozie Parameters

Running Oozie Jobs from PowerShell 1.Create configuration XML (in place of job.properties) $oozieConfig = @" … "@ 2.Use REST interface to create and start the job, and to retrieve job status Invoke-RestMethod

DEMO Running an Oozie Workflow

What is Avro? Apache Avro is a splittable serialization and data interchange format Based on JSON - language-agnostic Serializes schema and data Supports compression C# Pig Hive

Using Avro in.NET 1.Import the Microsoft Azure HDInsight Avro NuGet Package 2.Use Avro classes to serialize a stream –Use AvroSerializer to serialize data-only Use reflection to serialize.NET objects as data-only Serialize other data in a generic JSON record schema –Use AvroContainer to serialize schema and data Use reflection to serialize.NET objects and schema Serialize other data with a generic JSON record schema using Microsoft.Hadoop.Avro;

https;Account;Key Using Azure Storage in.NET 1.Import Azure Storage NuGet package 2.Create a connection string for your storage account 3.Create a CloudBlobClient object 4.Create a CloudBlobContainer object that references your container 5.Create a CloudBlockBlob object that references a blob 6.Read or write a stream to/from the blob

Submitting Hadoop Jobs in.NET 1.Import HDInsight Management NuGet package –Includes Hadoop Client 2.Create a *Credential object for your cluster –Certificate –Basic Authentication 3.Create a *JobCreateParameters object to define the job –MapReduce, Streaming, Pig, Hive, etc. 4.Use the JobSubmissionClientFactory class to create a client from your credentials 5.Use the client to create the job based on your parameters 6.Use the client to get the job ID and check status until complete

DEMO Using the.NET SDK for HDInsight

Real-time Big Data Processing with HBase and Storm

What is HBase? Low-latency NoSQL store Schema groups fields into column families Read/Write operations include: –put –get –scan readings keysensorreading idlocationdatetimevalue 1Sensor1Building 12015-01-01125.9 2Sensor2Building 22015-01-01152.3 3Sensor1Building 12015-01-0287.3 4Sensor2Building 22015-01-02151.8 put 'readings', '5', 'sensor:id', 'Sensor1' put 'readings', '5', 'sensor:location', 'Building 1' put 'readings', '5', 'reading:datetime', '2015-01-03' put 'readings', '5', 'reading:value', '126.3' 5Sensor1Building 12015-01-03126.3

DEMO Using HBase

Spout What is Storm? A event processor for streaming data Defines a streaming topology that consists of: –Spouts: Consume data sources and emit streams that contain tuples –Bolts: Operate on tuples in streams Storm topologies run indefinitely on unbounded streams of data –Real-time monitoring –Event logging Bolt

DEMO Storm

In-Memory Big Data Processing with Spark

What is Spark? An in-memory parallel processing framework Support for: –SQL-like querying –Streaming –Machine Learning pipelines HDInsight service is currently in preview

DEMO Spark

©2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Office, Azure, System Center, Dynamics and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Graeme Malcolm |

Similar presentations

Presentation on theme: "Graeme Malcolm |"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Graeme Malcolm |

Similar presentations

Presentation on theme: "Graeme Malcolm |"— Presentation transcript:

Similar presentations

About project

Feedback