Presentation is loading. Please wait.

Presentation is loading. Please wait.

Graeme Malcolm |

Similar presentations


Presentation on theme: "Graeme Malcolm |"— Presentation transcript:

1 Graeme Malcolm | Microsoft gmalc@microsoft.com @graeme_malcolm

2 Agenda 01 | Getting Started with Big Data, Hadoop, and HDInsight 02 | Processing Big Data with Hive and Pig 03 | Building a Big Data Workflow with Sqoop, Oozie, and the.NET SDK 04 | Real-time Big Data Processing with HBase and Storm 05 | In-Memory Big Data Processing with Spark

3 Setting Expectations Prerequisites: –Familiarity with database concepts and basic SQL query syntax –Familiarity with programming fundamentals –Experience with Microsoft Windows Experience with Visual Studio, and Azure is preferable; but not required

4 Demo Environment Microsoft Azure Subscription –Free trial available in some regions Windows client computer –Azure PowerShell –Visual Studio 2015 and Azure SDK –Excel –Power BI Desktop

5 Getting Started with Big Data, Hadoop, and HDInsight

6 What is Big Data? Data that is too large or complex for analysis in traditional relational databases Typified by the “3 V’s”: –Volume – Huge amounts of data to process –Variety – A mixture of structured and unstructured data –Velocity – New data generated extremely frequently Web server log reporting Social media sentiment analysis Sensor and IoT Processing

7 What is Hadoop? Hadoop –Open source distributed data processing cluster –Data processed in Hadoop Distributed File System (HDFS) Related projects –Hive –Pig –Oozie –Sqoop –Others HDFS Name NodeData Nodes Hadoop Cluster

8 MapReduce 1.Source data is divided among data nodes 2.Map phase generates key/value pairs 3.Reduce phase aggregates values for each key Lorem ipsum sit amet magma sit elit Fusce magna sed sit amet magna KeyValue Lorem1 ipsum1 sit1 amet1 magma1 sit1 elit1 KeyValue Fusce1 magma1 sed1 sit1 amet1 magma1 KeyValue Lorem1 ipsum1 sit3 amet2 magma3 elit1 Fusce1 sed1 MAP REDUCE

9 public static class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context){ int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); }

10 HDInsight cluster (VMs) What is HDInsight Hadoop as an Azure Service Hortonworks HDP on Azure VMs –Windows Server 2012 R2 –Linux Azure Storage provides HDFS layer Azure SQL Database stores metadata Azure Storage HDFS SQL Database Hive/Oozie Metadata

11 DEMO Getting Started with HDInsight

12 Processing Big Data with Hive and Pig

13 What is Hive? A metadata service that projects tabular schemas over folders Enables the contents of folders to be queried as tables, using SQL-like query semantics Queries are translated into MapReduce jobs SELECT…

14 CREATE TABLE table1 (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '; CREATE TABLE table2 (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/data/table2'; CREATE EXTERNAL TABLE table3 (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/data/table3'; Internal table (folders deleted when table is dropped) Default location (/hive/warehouse/table1) Stored in a custom location (but still internal, so the folder is deleted when table is dropped) External table (folders and files are left intact in Azure Blob Store when the table is dropped) Creating Hive Tables

15 DEMO Creating Hive Tables

16 Save data files in table folders Use the LOAD statement Use the INSERT statement Use a CREATE TABLE AS SELECT (CTAS) statement LOAD DATA LOCAL INPATH '/data/source' INTO TABLE MyTable; INSERT INTO TABLE MyTable SELECT Col1, Col2 FROM StagingTable; CREATE TABLE SummaryTable ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/data/summarytable' AS SELECT Col1, SUM(Col2) As Total FROM MyTable GROUP BY Col1;

17 Map Reduce SELECT… set hive.execution.engine=tez; SELECT… A Small Diversion – What is Tez?

18 DEMO Querying Hive Tables

19 Running Hive Jobs from PowerShell The AzureHDInsightHiveJobDefinition cmdlet –Create a job definition –Use Query for explicit HiveQL statements, or File to reference a saved script –Run the job with the Start-AzureHDInsightJob cmdlet The Invoke-Hive cmdlet –Simpler syntax to run a HiveQL query –Use Query for explicit HiveQL statements, or File to reference a saved script

20 DEMO Using Hive in PowerShell

21 Hive and ODBC 1.Download and install the Hive ODBC Driver for HDInsight 2.Optionally, create a data source name (DSN) for your HDInsight cluster 3.Use an ODBC connection to query Hive tables

22 DEMO Accessing Hive via ODBC

23 What is Pig? Pig performs a series of transformations to data relations based on Pig Latin statements Relations are loaded using schema on read semantics to project table structure at runtime You can run Pig Latin statements interactively in the Grunt shell, or save a script file and run them as a batch

24 Relations, Bags, Tuples, and Fields A relation is an outer bag –A bag is a collection of tuples –A tuple is an ordered set of fields –A field is a data item A field can contain an inner bag A bag can contain tuples with non- matching schema (a, 1) (b, 2) (c, 3) (d, {(4, 5), (6,7)}) (e) (f, 8, 9)

25 -- Load comma-delimited source data. Default data type is chararray, but temp is a long int Readings = LOAD '/weather/data.txt' USING PigStorage(',') AS (date, temp:long); -- Group the tuples by date GroupedReadings = GROUP Readings BY date; -- Get the average temp value for each date grouping GroupedAvgs = FOREACH GroupedReadings GENERATE group, AVG(Readings.temp) AS avgtemp; -- Ungroup the dates with the average temp AvgWeather = FOREACH GroupedAvgs GENERATE FLATTEN(group) as date, avgtemp; -- Sort the results by date SortedResults = ORDER AvgWeather BY date ASC; -- Save the results in the /weather/summary folder STORE SortedResults INTO '/weather/summary'; 2013-06-01,12 2013-06-01,14 2013-06-01,16 2013-06-02,9 2013-06-02,12 2013-06-02,9... 2013-06-01 14.00 2013-06-02 10.00

26 Common Pig Latin Operations LOAD FILTER FOR EACH … GENERATE ORDER JOIN GROUP FLATTEN LIMIT DUMP STORE

27 Pig Latin and MapReduce Pig generates MapReduce code from Pig Latin MapReduce jobs are generated on: –DUMP –STORE Readings = LOAD '/weather/data.txt' USING PigStorage(',') AS (date, temp:long); GroupedReadings = GROUP Readings BY date; GroupedAvgs = FOREACH GroupedReadings GENERATE group, AVG(Readings.temp) AS avgtemp; AvgWeather = FOREACH GroupedAvgs GENERATE FLATTEN(group) as date, avgtemp; SortedResults = ORDER AvgWeather BY date ASC; STORE SortedResults INTO '/weather/summary'; MapReduce code generated here

28 DEMO Using Pig

29 Building a Big Data Workflow with Sqoop, Oozie, and the.NET SDK

30 What is Sqoop? Sqoop is a database integration service –Built on open source Hadoop technology –Enables bi-directional data transfer between Hadoop clusters and databases

31 Sqoop Syntax Basic syntax: sqoop command --arg1, --arg2,...--argN Commands:  import  export  help  import-all-tables  create-hive-table  list-databases  list-tables  eval  codegen  version

32 Using the Import Command sqoop import --connect jdbc_connection_string --username user_name --password password | -P --table table_name --columns col1,...colN | --query 'SELECT…' --warehouse-dir | --target-dir path --fields-terminated-by char --lines-terminated-by char --hive-import [--hive-overwrite] -m | --num-mappers number_of_mappers

33 Using the Export Command sqoop export --connect jdbc_connection_string --username user_name --password password | -P --table table_name --export-dir path --fields-terminated-by char --lines-terminated-by char -m | --num-mappers number_of_mappers

34 Using Sqoop from PowerShell 1.Define the sqoop job New-AzureHDInsightSqoopJobDefinition 2.Submit the sqoop job Start-AzureHDInsightJob 3.Get job output Wait-AzureHDInsightJob Get-AzureHDInsightJobOutput

35 What is Oozie? A workflow engine for actions in a Hadoop cluster –Hive –Pig –Sqoop –Others Support parallel workstreams and conditional branching

36 Anatomy of an Oozie Application Oozie workflow file –XML file defining workflow actions Script files –Files used by workflow actions - for example, HiveQL or Pig Latin INSERT…LOAD…

37 CreateTable.hql … Workflow failed. [${wf:errorMessage(wf:lastErrorNode())}] Start here This action runs a Hive script file in the workflow folder Workflow branches based on action outcome Oozie Workflow Document

38 oozie job -oozie http://localhost:11000/oozie -config c:\files\job.properties -run nameNode=wasb://hdfiles@hdstore.blob.core.windows.net jobTracker=jobtrackerhost:9010 queueName=default oozie.use.system.libpath=true oozie.wf.application.path=/data/workflow/ nameNode=wasb://hdfiles@hdstore.blob.core.windows.net jobTracker=jobtrackerhost:9010 queueName=default oozie.use.system.libpath=true oozie.wf.application.path=/data/workflow/ Oozie Command Line

39 nameNode=wasb://my_container@my_storage_account.blob.core.windows.net jobTracker=jobtrackerhost:9010 queueName=default oozie.use.system.libpath=true oozie.wf.application.path=/example/workflow/ hiveScript=CreateTable.hql tableName=mytable tableFolder=/data/mytable nameNode=wasb://my_container@my_storage_account.blob.core.windows.net jobTracker=jobtrackerhost:9010 queueName=default oozie.use.system.libpath=true oozie.wf.application.path=/example/workflow/ hiveScript=CreateTable.hql tableName=mytable tableFolder=/data/mytable ${hiveScript} TABLE_NAME=${tableName} LOCATION=${tableFolder} ${hiveScript} TABLE_NAME=${tableName} LOCATION=${tableFolder} DROP TABLE IF EXISTS ${TABLE_NAME}; CREATE EXTERNAL TABLE ${TABLE_NAME} (Col1 STRING, Col2 FLOAT, Col3 FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '${LOCATION}'; DROP TABLE IF EXISTS ${TABLE_NAME}; CREATE EXTERNAL TABLE ${TABLE_NAME} (Col1 STRING, Col2 FLOAT, Col3 FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '${LOCATION}'; Oozie Parameters

40 Running Oozie Jobs from PowerShell 1.Create configuration XML (in place of job.properties) $oozieConfig = @" … "@ 2.Use REST interface to create and start the job, and to retrieve job status Invoke-RestMethod

41 DEMO Running an Oozie Workflow

42 What is Avro? Apache Avro is a splittable serialization and data interchange format Based on JSON - language-agnostic Serializes schema and data Supports compression C# Pig Hive

43 Using Avro in.NET 1.Import the Microsoft Azure HDInsight Avro NuGet Package 2.Use Avro classes to serialize a stream –Use AvroSerializer to serialize data-only Use reflection to serialize.NET objects as data-only Serialize other data in a generic JSON record schema –Use AvroContainer to serialize schema and data Use reflection to serialize.NET objects and schema Serialize other data with a generic JSON record schema using Microsoft.Hadoop.Avro;

44 https;Account;Key Using Azure Storage in.NET 1.Import Azure Storage NuGet package 2.Create a connection string for your storage account 3.Create a CloudBlobClient object 4.Create a CloudBlobContainer object that references your container 5.Create a CloudBlockBlob object that references a blob 6.Read or write a stream to/from the blob

45 Submitting Hadoop Jobs in.NET 1.Import HDInsight Management NuGet package –Includes Hadoop Client 2.Create a *Credential object for your cluster –Certificate –Basic Authentication 3.Create a *JobCreateParameters object to define the job –MapReduce, Streaming, Pig, Hive, etc. 4.Use the JobSubmissionClientFactory class to create a client from your credentials 5.Use the client to create the job based on your parameters 6.Use the client to get the job ID and check status until complete

46 DEMO Using the.NET SDK for HDInsight

47 Real-time Big Data Processing with HBase and Storm

48 What is HBase? Low-latency NoSQL store Schema groups fields into column families Read/Write operations include: –put –get –scan readings keysensorreading idlocationdatetimevalue 1Sensor1Building 12015-01-01125.9 2Sensor2Building 22015-01-01152.3 3Sensor1Building 12015-01-0287.3 4Sensor2Building 22015-01-02151.8 put 'readings', '5', 'sensor:id', 'Sensor1' put 'readings', '5', 'sensor:location', 'Building 1' put 'readings', '5', 'reading:datetime', '2015-01-03' put 'readings', '5', 'reading:value', '126.3' 5Sensor1Building 12015-01-03126.3

49 DEMO Using HBase

50 Spout What is Storm? A event processor for streaming data Defines a streaming topology that consists of: –Spouts: Consume data sources and emit streams that contain tuples –Bolts: Operate on tuples in streams Storm topologies run indefinitely on unbounded streams of data –Real-time monitoring –Event logging Bolt

51 DEMO Storm

52 In-Memory Big Data Processing with Spark

53 What is Spark? An in-memory parallel processing framework Support for: –SQL-like querying –Streaming –Machine Learning pipelines HDInsight service is currently in preview

54 DEMO Spark

55 ©2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Office, Azure, System Center, Dynamics and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.


Download ppt "Graeme Malcolm |"

Similar presentations


Ads by Google