Building Analytics At Scale With USQL and C# Josh Fennessy Principal BlueGranite
Data Intelligence Action Data Information Management Big Data Stores Machine Learning and Analytics Intelligence People Data Sources Machine Learning Cognitive Services Data Factory Data Lake Store SQL Data Warehouse Data Lake Analytics Bot Framework Apps Web Mobile Bots Data Catalog Apps HDInsight (Hadoop and Spark) Event Hubs Cortana Sensors and devices Dashboards & Visualizations Stream Analytics Automate d Systems Power BI Data Data Intelligence Action
Azure Data lake Analytics Store Hyper-scale distributed storage Integrated with Azure Active Directory No file size or account size limits Compatible with WebHDFS Pay for what you use Analytics Clusterless distributed computing platform Based on C# and USQL Build complex data processing jobs in Visual Studio Pay per job instead of per hour
Requirements Required Recommended Azure Subscription Azure Data Lake Store Account Recommended Visual Studio 2015/2017 Azure Data Lake Tools
Creating an Account
Setting up Visual Studio
Setting up Visual STudio
Setting up Visual Studio
Now what? Process data for later analysis The T in ELT/ETL Do actual analysis of data and save results Give structure to un/semi-structured data for later use
BASIC Job Components ROWSET USQL Description of data stored in one or more files Read using an EXTRACTOR USQL Looks like TSQL, smells like TSQL, but used to do parallel batch processing of data
BASIC Job Components USQL OUTPUT Looks like TSQL, smells like TSQL, but used to do parallel batch processing of data OUTPUT Results of the transformation written back to storage Uses OUTPUTTERS to format the file
Demo Basic Job Structure
Pay Per Job ADLA is charged per job Each successful execution will incur charges Pay per Analytic Unit per compute hour Prorated to minute Let's take a look at the previous job that was run
Other job components USQL Tables User Defined Code Used to store data permanently that is accessed often. Supports partitioning, bucketing, and many other Big Data features used by other distributed processing environments User Defined Code Build custom processing tools to extend OOB capabilities All user defined code written in C# Deployed to job via assemblies
Demo Working with Tables
Demo Multiple Rowsets
EXTERNAL SCRIPTS Execute R or Python scripts Embed script inline or store in separate file Pass data to script Return data back to USQL and output directly or use in further transformations
Demo Call External R Script
Unstructured Data ADLA can also work with unstructured data Data that has no defined structure – but it has structure Cognitive Services can be useful to make sense of unstructured data
Demo Process Images with Cognitive Services
Operationalize Building a USQL job is only part of the process Azure Data Factory is the easiest way to schedule and implement USQL in production
Operationalize
Operationalize PowerShell can also be used to execute a USQL job
When Things go wrong Job Failures are bound to happen. Don't worry! There is a process: Browse to Job Management in the Azure Data Lake Analytics Portal
When Things go wrong Job Failures are bound to happen. Don't worry! There is a process: Find your failed job and select it
When Things go wrong Job Failures are bound to happen. Don't worry! There is a process: Review errors, inputs, and outputs to locate root cause and remediate
Optimizing performance Performance optimizing is a balance between job cost and total execution time Allocating more AUs may improve performance, but may also greatly increase cost. Controlling USQL code is the most important step to optimizing performance. Ask yourself: Is this the most efficient way to do this operation?
Optimizing performance Performance optimizing is a balance between job cost and total execution time AU efficiency is the most important metric to understand performance v. cost
Optimizing performance User Defined Objects Use UDOs sparingly. The optimizer is not able to help at all with performance issues related to UDOs. Consider replacing your logic in the UDO with SELECT…CROSS APPLY – that solves up to 90% of the case for a UDO UDO for EXTRACTORS or OUTPUTTERS are usually OK, but avoid for data transformation
Optimizing performance Final Thoughts Deeply understand the query lifecycle Monitor for data skew in your ADLA tables Use Partitioning wisely Avoid UDOs Optimize for the right balance of cost / performance Good performance at small scale != good performance at large scale. Do full scale testing an analysis too!
RECAP
Azure DatA Lake Analytics Designed for ETL/ELT on extremely large data Familiar to SQL and/or C# users Priced per job, no charge when idle Linked to a Data Lake Store account Rowsets for data stored in files
Azure DatA Lake Analytics Load distributed tables for data that is referenced frequently, or is extremely large Integrate with external languages: R or Python Cognitive Services built-in; no extra charge! Operationalize with Azure Data Factory or PowerShell
Azure DatA Lake Analytics Performance subjective based on balance between cost and execution time Learn query lifecycle to truly understand performance characteristics