Download presentation
Presentation is loading. Please wait.
1
Building Analytics At Scale With USQL and C#
Josh Fennessy Principal BlueGranite
2
Data Intelligence Action Data Information Management Big Data Stores
Machine Learning and Analytics Intelligence People Data Sources Machine Learning Cognitive Services Data Factory Data Lake Store SQL Data Warehouse Data Lake Analytics Bot Framework Apps Web Mobile Bots Data Catalog Apps HDInsight (Hadoop and Spark) Event Hubs Cortana Sensors and devices Dashboards & Visualizations Stream Analytics Automate d Systems Power BI Data Data Intelligence Action
3
Azure Data lake Analytics Store
Hyper-scale distributed storage Integrated with Azure Active Directory No file size or account size limits Compatible with WebHDFS Pay for what you use Analytics Clusterless distributed computing platform Based on C# and USQL Build complex data processing jobs in Visual Studio Pay per job instead of per hour
4
Requirements Required Recommended Azure Subscription
Azure Data Lake Store Account Recommended Visual Studio 2015/2017 Azure Data Lake Tools
5
Creating an Account
6
Setting up Visual Studio
7
Setting up Visual STudio
8
Setting up Visual Studio
9
Now what? Process data for later analysis
The T in ELT/ETL Do actual analysis of data and save results Give structure to un/semi-structured data for later use
10
BASIC Job Components ROWSET USQL
Description of data stored in one or more files Read using an EXTRACTOR USQL Looks like TSQL, smells like TSQL, but used to do parallel batch processing of data
11
BASIC Job Components USQL OUTPUT
Looks like TSQL, smells like TSQL, but used to do parallel batch processing of data OUTPUT Results of the transformation written back to storage Uses OUTPUTTERS to format the file
12
Demo Basic Job Structure
13
Pay Per Job ADLA is charged per job
Each successful execution will incur charges Pay per Analytic Unit per compute hour Prorated to minute Let's take a look at the previous job that was run
14
Other job components USQL Tables User Defined Code
Used to store data permanently that is accessed often. Supports partitioning, bucketing, and many other Big Data features used by other distributed processing environments User Defined Code Build custom processing tools to extend OOB capabilities All user defined code written in C# Deployed to job via assemblies
15
Demo Working with Tables
16
Demo Multiple Rowsets
17
EXTERNAL SCRIPTS Execute R or Python scripts
Embed script inline or store in separate file Pass data to script Return data back to USQL and output directly or use in further transformations
18
Demo Call External R Script
19
Unstructured Data ADLA can also work with unstructured data
Data that has no defined structure – but it has structure Cognitive Services can be useful to make sense of unstructured data
20
Demo Process Images with Cognitive Services
21
Operationalize Building a USQL job is only part of the process
Azure Data Factory is the easiest way to schedule and implement USQL in production
22
Operationalize
23
Operationalize PowerShell can also be used to execute a USQL job
24
When Things go wrong Job Failures are bound to happen. Don't worry! There is a process: Browse to Job Management in the Azure Data Lake Analytics Portal
25
When Things go wrong Job Failures are bound to happen. Don't worry! There is a process: Find your failed job and select it
26
When Things go wrong Job Failures are bound to happen. Don't worry! There is a process: Review errors, inputs, and outputs to locate root cause and remediate
27
Optimizing performance
Performance optimizing is a balance between job cost and total execution time Allocating more AUs may improve performance, but may also greatly increase cost. Controlling USQL code is the most important step to optimizing performance. Ask yourself: Is this the most efficient way to do this operation?
28
Optimizing performance
Performance optimizing is a balance between job cost and total execution time AU efficiency is the most important metric to understand performance v. cost
29
Optimizing performance
User Defined Objects Use UDOs sparingly. The optimizer is not able to help at all with performance issues related to UDOs. Consider replacing your logic in the UDO with SELECT…CROSS APPLY – that solves up to 90% of the case for a UDO UDO for EXTRACTORS or OUTPUTTERS are usually OK, but avoid for data transformation
30
Optimizing performance
Final Thoughts Deeply understand the query lifecycle Monitor for data skew in your ADLA tables Use Partitioning wisely Avoid UDOs Optimize for the right balance of cost / performance Good performance at small scale != good performance at large scale. Do full scale testing an analysis too!
31
RECAP
32
Azure DatA Lake Analytics
Designed for ETL/ELT on extremely large data Familiar to SQL and/or C# users Priced per job, no charge when idle Linked to a Data Lake Store account Rowsets for data stored in files
33
Azure DatA Lake Analytics
Load distributed tables for data that is referenced frequently, or is extremely large Integrate with external languages: R or Python Cognitive Services built-in; no extra charge! Operationalize with Azure Data Factory or PowerShell
34
Azure DatA Lake Analytics
Performance subjective based on balance between cost and execution time Learn query lifecycle to truly understand performance characteristics
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.