Presentation is loading. Please wait.

Presentation is loading. Please wait.

9/12/2018 11:12 PM BRK3323 Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platform, and intelligent Michael Rys Principal Program.

Similar presentations


Presentation on theme: "9/12/2018 11:12 PM BRK3323 Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platform, and intelligent Michael Rys Principal Program."— Presentation transcript:

1 9/12/ :12 PM BRK3323 Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platform, and intelligent Michael Rys Principal Program Manager, Big Data Team @MikeDoesBigData © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

2 Agenda Modern Data Warehouse architectures with Data Lake:
From ETL to LETS Introduction to Azure Data Lake and U-SQL Job submission with Azure Data Factory Core U-SQL “LETS”: Schematizing unstructured data Other data formats Scaling out over many files Sharing schematized data Examples of scaling out your custom ETL processing code with U-SQL: JSON Processing Custom Image processing

3 The Traditional Data Warehouse
Microsoft Analytics Platform System 9/12/2018 The Traditional Data Warehouse BI and analytics Dashboards Reporting Real-time data 2 Data warehouse ETL Increasing data volumes 1 New data sources and types 3 Cloud-born data 4 Data sources OLTP ERP CRM LOB Non-relational data Devices Web Sensors Social © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.

4 The Data Lake Approach Designed for the Questions you don’t YET know!
Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Hadoop, Spark, R, Azure Data Lake Analytics (ADLA) Devices Social Batch queries Devices LOB applications Video Interactive queries Social LOB applications Real-time analytics Sensors Web Sensors Video Machine Learning Relational Web Clickstream Data warehouse Relational Clickstream Designed for the Questions you don’t YET know!

5 Azure Data Lake An on-demand, real-time stream processing service with no-limits data lake built to support massively parallel analytics Performance at scale Optimized for analytics Multiple analytics engines Single repository sharing HDFS Compatible REST API 1 ADL Store .NET, SQL, Python, R scaled out by U-SQL ADL Analytics Open Source Apache Hadoop ADL Client HDInsight Hive

6 Machine Learning & Data Science Conference
9/12/ :12 PM Big data pipeline and data flow in Azure Ingestion Preparation, analytics and machine learning Discovery Azure Data Catalog Bulk ingestion Business apps Custom apps Sensors and devices Machine Learning People Data Lake Analytics Visualization Power BI HDInsight (Hadoop and Spark) Stream Analytics SQL DW Azure Data Lake Store Event ingestion DATA INTELLIGENCE ACTION © 2015 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

7 Big Data Warehouse ETL Azure Data Lake Store Analyst BI Models
Power User Data Engineer Data Scientist ERP, CRM, and other LOB Data Reports and Dashboards ETL Azure SQL Data Warehouse BI Models OLTP and other RDBMS Governance and Master Data Management Data Quality and Lineage Polybase Clickstream Logs and Events Azure Data Lake Analytics (U-SQL) Apache Hadoop on HDInsight Apache Spark on HDInsight Sensors, Social, Weather, and other un-structured data Azure Data Lake Store

8 Realtime Processing with Lambda Architecture
IoT Sensors and/or User activity streams Realtime Dashboards Streaming Layer Event Broker (Event Hubs, Apache Kafka) Analyst Clean, Curate, Aggregate Combine reference data Perform Scoring from ML models Event Broker Social, Trends, Weather etc. Automated Systems Reference Data Trained Machine Learning Models Clickstream, Batch Files, server logs, Azure Data Lake Analytics (U-SQL) Apache Hadoop on HDInsight Apache Spark on HDInsight Data Engineer Images, videos, and other unstructured data Azure Data Lake Store Data Scientist

9 Load-Extract-Transform-Store & Share
Schematizing unstructured data (Load—Extract—Transform—Store) for analysis Cook data for other users (LETS and Share) As unstructured data As structured data Large-scale custom processing with custom code Cloud-scale Cognitive Processing Augment big data with high-value data from where it lives

10 Azure Data Lake Analytics
9/12/ :12 PM Azure Data Lake Analytics © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

11 Data Lake Analytics Workloads
9/12/2018 Data Lake Analytics Workloads With BATCH workload, Data Lake Analytics is ideal for The transformation and preparation of data for use in other systems Analytics on VERY LARGE amounts of data Massively Parallel programs written in .NET, Python and R, scaled out with U-SQL Performing Cognition at Scale on large collections © 2015 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

12 Introducing U-SQL A framework for Big Data
Scales out your custom code in .NET, Python, R over your Data Lake Familiar syntax to millions of SQL & .NET developers Unifies Declarative nature of SQL with the imperative power of C# Processing of structured, semi-structured and unstructured data Querying multiple Azure Data Sources (Federated Query) Analyzing with Batch, Interactive, Streaming, & Machine Learning in one language A framework for Big Data

13 Query data where it lives
Easily query data in multiple Azure data stores without moving it to a single store Azure Data Lake Storage U-SQL can query data from multiple sources in Azure. Where possible data transformation is pushed close to the remote query engine to minimize data transfer and maximize performance. Azure Storage Blobs Query Write Query Write Azure SQL in VMs Azure Data Lake Analytics Query U-SQL Query Query Query Azure SQL DB Azure SQL Data Warehouse

14 Embedded Artificial Intelligence
Host Deep Neural Networks (DNNs) 6 Built-in Cognitive Functions Face API Image Tagging Emotion analysis OCR Text Key Phrase Extraction Text Sentiment Analysis `

15 9/12/ :12 PM Azure Data Factory Compose, orchestrate & monitor data services at scale Fully managed service to support orchestration of data movement and transformation Connect to relational or non-relational data that is on-premises or in the cloud Single pane of glass to monitor and manage data processing pipelines Globally deployed service infrastructure Cost Effective No SQL DB ADL Stored Procedures Hadoop on Azure Trusted data BI & analytics Data Lake Analytics Custom Code Machine Learning VM © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

16 File System* HDFS* Amazon S3
Azure Data Factory Connects ADL Store out-of-the-box to all your stores Category Data store Supported as source Supported as sink Azure Azure Data Lake Store Azure Blob storage Azure SQL Database Azure SQL Data Warehouse Azure Table storage Azure DocumentDB ✓ ✓ ✓ ✓ ✓ ✓ Databases SQL Server* Oracle* MySQL* DB2* Teradata* PostgreSQL* Sybase* Cassandra* MongoDB* Amazon Redshift ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓                 File File System* HDFS* Amazon S3 ✓ ✓ ✓ ✓     Others Salesforce Generic ODBC* Generic OData Web Table (table from HTML) GE Historian* ✓ ✓ ✓ ✓ ✓     * Can be on-premises or on Azure IaaS, enabled using Data Management Gateway

17 Submit Azure Data Lake Jobs with Azure Data Factory
9/12/ :12 PM Submit Azure Data Lake Jobs with Azure Data Factory ADF v1: Data Lake Analytics U-SQL Activity against ADLA linked service Service Principal or OAuth AAD credential Provides Parameter model Can specify special runtime ADF v2 (in preview): Service Principal (via Application entity in AAD) Simplified with same power to ADF v1 Adds support for ADLA pipeline and recurring job insights { "name": "ComputeEventsByRegionPipeline", "properties": { "description": "This is a U-SQL pipeline.", "activities": [ { "type": "DataLakeAnalyticsU-SQL", "typeProperties": { "scriptPath": "scripts\\kona\\SearchLogProcessing.txt", "scriptLinkedService": "StorageLinkedService", "degreeOfParallelism": 3, "priority": 100, "parameters": { "in": "/datalake/input/SearchLog.tsv", "out": "/datalake/output/Result.tsv" } }, "inputs": [ { "name": "DataLakeTable" } ], "outputs": [ { "name": "EventsByRegionTable" } ], "policy": { "timeout": "06:00:00", "concurrency": 1, "executionPriorityOrder": "NewestFirst", "retry": 1 }, "scheduler": { "frequency": "Day", "interval": 1 }, "name": "EventsByRegion", "linkedServiceName": "AzureDataLakeAnalyticsLinkedService" } ], "start": " T00:00:00Z", "end": " T01:00:00Z", "isPaused": false } } © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

18 Gain insights from ADLA pipeline & recurring jobs
MICROSOFT CONFIDENTIAL Gain insights from ADLA pipeline & recurring jobs Original Jobs View Original List jobs submitted in the last 30 days Aggregate trends of jobs over 30 days Order and filter list of jobs New Pipeline Jobs View New Superset of original jobs view Adds grouping of jobs by pipelines & recurrences Jobs and consumption trends per pipeline Quickly identify pipelines and jobs to troubleshoot Quickly compare failed jobs with “last known good” instance Manage pipeline cost, improve efficiency and predict future cost How to use Create ADF v2 pipelines containing ADLA U-SQL activities Pipelines and Recurrences automatically appear in ADLA portal Submit and monitor pipeline/recurring jobs using Azure PowerShell, ADLA SDK and REST APIs

19 Schematize unstructured data
SMSG Readiness 9/12/2018 Schematize unstructured data © 2015 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

20 Expression-flow Programming Style
9/12/ :12 PM Expression-flow Programming Style Automatic "in-lining" of U-SQL expressions – whole script leads to a single execution model. Execution plan that is optimized out-of-the-box and w/o user intervention. Per job and user driven level of parallelization. Detail visibility into execution steps, for debugging. Heatmap like functionality to identify performance bottlenecks. © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

21 “Unstructured” Files EXTRACT Expression OUTPUT Expression
@s = EXTRACT a string, b int FROM "filepath/file.csv" USING Extractors.Csv(encoding: Encoding.Unicode); Built-in Extractors: Csv, Tsv, Text with lots of options, Parquet Custom Extractors: e.g., JSON, XML, etc. (see OUTPUT Expression OUTPUT @s TO "filepath/file.csv" USING Outputters.Csv(); Built-in Outputters: Csv, Tsv, Text, Parquet Custom Outputters: e.g., JSON, XML, etc. (see Filepath URIs Relative URI to default ADL Storage account: "filepath/file.csv" Absolute URIs: ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv" WASB: Schema on Read Write to File Built-in and custom Extractors and Outputters ADL Storage and Azure Blob Storage

22 Announcing: Processing Parquet in Azure Data Lake with U-SQL
SMSG Readiness 9/12/2018 Announcing: Processing Parquet in Azure Data Lake with U-SQL © 2015 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

23 Scale out over many files: File Sets!
SMSG Readiness 9/12/2018 Scale out over many files: File Sets! © 2015 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

24 File Sets Simple pattern language on filename and path Virtual columns
@pattern string = "/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}"; Binds two columns date and suffix Wildcards the filename Limits on number of files and file sizes can be improved with SET = "FileSetV2Dot5:on,GroupedInputArray:on"; (Will become default between now and end of year) Virtual columns EXTRACT name string , suffix string // virtual column , date DateTime // virtual column USING Extractors.Csv(); Refer to virtual columns in query predicates to get partition elimination Warning gets raised if no partition elimination was found File Sets Simple Patterns Virtual Columns Only on EXTRACT for now (On OUTPUT by early 2018)

25 Cook Data and share with others
SMSG Readiness 9/12/2018 Cook Data and share with others © 2015 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

26 Meta Data Object Model ADLA Account/Catalog [1,n] C# Extractors
C# Reducers Database C# Processors C# Fns C# UDTs C# UDAgg C# Applier [1,n] C# Combiners C# Outputters Creden-tials Data Source Schema C# Assemblies [0,n] Ext. tables tables views TVFs Procedures Table Types Packages Statistics Clustered Index Legend User objects MD Name C# Name partitions Contains Refers to Implemented and named by

27 U-SQL Catalog Naming Discovery Sharing Securing Naming Discovery
Default Database and Schema context: master.dbo Quote identifiers with []: [my table] Stores data in ADL Storage /catalog folder Discovery Visual Studio Server Explorer Azure Data Lake Analytics Portal SDKs and Azure Powershell commands Sharing Within an Azure Data Lake Analytics account Across ADLA accounts that share same Azure Active Directory: Referencing Assemblies Calling TVFs, Procedures and referencing tables and views Inserting into tables Securing Secured with AAD principals at catalog and Database level Naming Discovery Sharing Securing

28 VIEWs and TVFs Views Table-Valued Functions (TVFs)
CREATE VIEW V AS EXTRACT… CREATE VIEW V AS SELECT … Cannot contain user-defined objects (e.g. UDF or UDOs)! Will be inlined Table-Valued Functions (TVFs) CREATE FUNCTION F string = "default") [TABLE ( … )] AS BEGIN = … END; Provides parameterization One or more results Can contain multiple statements Can contain user-code (needs assembly reference) Will always be inlined Infers schema or checks against specified return schema Views for simple cases TVFs for parameterization and most cases

29 Procedures Allows encapsulation of U-SQL scripts
CREATE PROCEDURE P string = "default“) AS BEGIN …; TO …; INSERT INTO T …; END; Provides parameterization No result but writes into file or table Can contain multiple statements Can contain user-code (needs assembly reference) Will always be inlined Can contain DDL (but no CREATE, DROP FUNCTION/PROCEDURE)

30 Tables CREATE TABLE CREATE TABLE AS SELECT CREATE TABLE T (col1 int
, col2 string , col3 SQL.MAP<string,string> , INDEX idx CLUSTERED (col2 ASC) PARTITION BY (col1) DISTRIBUTED BY HASH (driver_id) ); Structured Data, built-in Data types only (no UDTs) Clustered Index (needs to be specified): row-oriented Fine-grained distribution (needs to be specified): HASH, DIRECT HASH, RANGE, ROUND ROBIN Addressable Partitions (optional) CREATE TABLE T (INDEX idx CLUSTERED …) AS SELECT …; CREATE TABLE T (INDEX idx CLUSTERED …) AS EXTRACT…; CREATE TABLE T (INDEX idx CLUSTERED …) AS myTVF(DEFAULT); Infer the schema from the query Still requires index and distribution (does not support partitioning) CREATE TABLE CREATE TABLE AS SELECT

31 When to use Tables Benefits of Table clustering and distribution
Faster lookup of data provided by distribution and clustering when right distribution/cluster is chosen Data distribution provides better localized scale out Used for filters, joins and grouping Benefits of Table partitioning Provides data life cycle management (“expire” old partitions) Partial re-computation of data at partition level Query predicates can provide partition elimination Do not use when… No filters, joins and grouping No reuse of the data for future queries If in doubt: use sampling (e.g., SAMPLE ANY(x)) and test.

32 SMSG Readiness 9/12/2018 Large-scale custom processing with custom code (more details in BRK3350!) © 2015 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

33 U-SQL extensibility Built-in operators, function, aggregates
Extend U-SQL with C#/.NET Built-in operators, function, aggregates C# expressions (in SELECT expressions) User-defined functions (UDFs) User-defined aggregates (UDAGGs) User-defined operators (UDOs)

34 What are UDOs? Custom Operator Extensions in language of your choice
User-Defined Extractors Converts files into rowset (see BRK3323 for more examples) User-Defined Outputters Converts rowset into files (see BRK3323 for more examples) User-Defined Processors Take one row and produce one row Pass-through versus transforming User-Defined Appliers Take one row and produce 0 to n rows Used with OUTER/CROSS APPLY User-Defined Combiners Combines rowsets (like a user-defined join) User-Defined Reducers Take n rows and produce m rows (normally m<n) Scaled out with explicit U-SQL Syntax that takes a UDO instance (created as part of the execution): EXTRACT OUTPUT CROSS APPLY Custom Operator Extensions in language of your choice Scaled out by U-SQL PROCESS COMBINE REDUCE

35 JSON Processing How do I extract data from JSON documents?

36 Microsoft.Analytics.Samples.Formats
JSON Processing Architecture of Sample Format Assembly Single JSON document per file: Use JsonExtractor Multiple JSON documents per file: Do not allow row delimiter (e.g., CR/LF) in JSON Use built-in Text Extractor to extract Use JsonTuple to schematize (with CROSS APPLY) Currently loads full JSON document into memory better to use JSONReader Processing if docs are large Microsoft.Analytics.Samples.Formats Assembly provides Extractors and UDOs/UDFs for: JSON XML AVRO NewtonSoft.Json System.Xml Microsoft.Hadoop.Avro

37 JSON Processing Key to field relative to objects in JsonExtractor
EXTRACT personid int, name string, addresses string USING new Json.JsonExtractor(“[*].person"); @person = SELECT personid, name, Json.JsonFunctions.JsonTuple(addresses)["address"] AS address_array @addresses = SELECT personid, name, Json.JsonFunctions.JsonTuple(address) AS address CROSS APPLY EXPLODE (Json.JsonFunctions.JsonTuple(address_array).Values) AS A(address); @result = address["addressid"]AS addressid, address["street"]AS street, address["postcode"]AS postcode, address["city"]AS city Key to field relative to objects in JsonExtractor JPath Expression mapping objects to Row Generates 1-level key value-pairs as SqlMap Gets value from map as string Get object map for array item Convert string array into Map and pivot all Values into rows Get desired keys from object map

38 Cloud-scale Cognitive Processing
Racing Parked Car Green Outdoor See my session BRK3350 for more details and many more examples

39 Related Ignite Presentations
BRK Understanding big data on Azure - structured, unstructured and streaming, Tuesday, September 26, 10:45 AM - 12:00 PM, OCCC W307 BRK Data on Azure: The big picture, Wednesday, September 27, 12:30 PM - 1:45 PM, Hyatt Plaza International G BRK Run Python, R and .NET code at Data Lake scale with U-SQL in Azure Data Lake, Thursday, September 28, 10:45 AM - 12:00 PM, Hyatt Plaza International D-F Stop by the booth and the Hands On-Labs!

40 Additional Resources Blogs and community page:
(U-SQL Github) Documentation, presentations and articles: Getting Started with R in U-SQL ADL forums and feedback Continue your education at Microsoft Virtual Academy online.

41 Please evaluate this session
Tech Ready 15 9/12/2018 Please evaluate this session From your Please expand notes window at bottom of slide and read. Then Delete this text box. PC or tablet: visit MyIgnite Phone: download and use the Microsoft Ignite mobile app Your input is important! © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

42 9/12/ :12 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.


Download ppt "9/12/2018 11:12 PM BRK3323 Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platform, and intelligent Michael Rys Principal Program."

Similar presentations


Ads by Google