Building Analytics At Scale With USQL and C#

Building Analytics At Scale With USQL and C#
Josh Fennessy Principal BlueGranite

Data Intelligence Action Data Information Management Big Data Stores
Machine Learning and Analytics Intelligence People Data Sources Machine Learning Cognitive Services Data Factory Data Lake Store SQL Data Warehouse Data Lake Analytics Bot Framework Apps Web Mobile Bots Data Catalog Apps HDInsight (Hadoop and Spark) Event Hubs Cortana Sensors and devices Dashboards & Visualizations Stream Analytics Automate d Systems Power BI Data Data Intelligence Action

Azure Data lake Analytics Store
Hyper-scale distributed storage Integrated with Azure Active Directory No file size or account size limits Compatible with WebHDFS Pay for what you use Analytics Clusterless distributed computing platform Based on C# and USQL Build complex data processing jobs in Visual Studio Pay per job instead of per hour

Requirements Required Recommended Azure Subscription
Azure Data Lake Store Account Recommended Visual Studio 2015/2017 Azure Data Lake Tools

Creating an Account

Setting up Visual Studio

Setting up Visual STudio

Setting up Visual Studio

Now what? Process data for later analysis
The T in ELT/ETL Do actual analysis of data and save results Give structure to un/semi-structured data for later use

BASIC Job Components ROWSET USQL
Description of data stored in one or more files Read using an EXTRACTOR USQL Looks like TSQL, smells like TSQL, but used to do parallel batch processing of data

BASIC Job Components USQL OUTPUT
Looks like TSQL, smells like TSQL, but used to do parallel batch processing of data OUTPUT Results of the transformation written back to storage Uses OUTPUTTERS to format the file

Demo Basic Job Structure

Pay Per Job ADLA is charged per job
Each successful execution will incur charges Pay per Analytic Unit per compute hour Prorated to minute Let's take a look at the previous job that was run

Other job components USQL Tables User Defined Code
Used to store data permanently that is accessed often. Supports partitioning, bucketing, and many other Big Data features used by other distributed processing environments User Defined Code Build custom processing tools to extend OOB capabilities All user defined code written in C# Deployed to job via assemblies

Demo Working with Tables

Demo Multiple Rowsets

EXTERNAL SCRIPTS Execute R or Python scripts
Embed script inline or store in separate file Pass data to script Return data back to USQL and output directly or use in further transformations

Demo Call External R Script

Unstructured Data ADLA can also work with unstructured data
Data that has no defined structure – but it has structure Cognitive Services can be useful to make sense of unstructured data

Demo Process Images with Cognitive Services

Operationalize Building a USQL job is only part of the process
Azure Data Factory is the easiest way to schedule and implement USQL in production

Operationalize

Operationalize PowerShell can also be used to execute a USQL job

When Things go wrong Job Failures are bound to happen. Don't worry! There is a process: Browse to Job Management in the Azure Data Lake Analytics Portal

When Things go wrong Job Failures are bound to happen. Don't worry! There is a process: Find your failed job and select it

When Things go wrong Job Failures are bound to happen. Don't worry! There is a process: Review errors, inputs, and outputs to locate root cause and remediate

Optimizing performance
Performance optimizing is a balance between job cost and total execution time Allocating more AUs may improve performance, but may also greatly increase cost. Controlling USQL code is the most important step to optimizing performance. Ask yourself: Is this the most efficient way to do this operation?

Performance optimizing is a balance between job cost and total execution time AU efficiency is the most important metric to understand performance v. cost

User Defined Objects Use UDOs sparingly. The optimizer is not able to help at all with performance issues related to UDOs. Consider replacing your logic in the UDO with SELECT…CROSS APPLY – that solves up to 90% of the case for a UDO UDO for EXTRACTORS or OUTPUTTERS are usually OK, but avoid for data transformation

Final Thoughts Deeply understand the query lifecycle Monitor for data skew in your ADLA tables Use Partitioning wisely Avoid UDOs Optimize for the right balance of cost / performance Good performance at small scale != good performance at large scale. Do full scale testing an analysis too!

Azure DatA Lake Analytics
Designed for ETL/ELT on extremely large data Familiar to SQL and/or C# users Priced per job, no charge when idle Linked to a Data Lake Store account Rowsets for data stored in files

Load distributed tables for data that is referenced frequently, or is extremely large Integrate with external languages: R or Python Cognitive Services built-in; no extra charge! Operationalize with Azure Data Factory or PowerShell

Performance subjective based on balance between cost and execution time Learn query lifecycle to truly understand performance characteristics

Building Analytics At Scale With USQL and C#

Similar presentations

Presentation on theme: "Building Analytics At Scale With USQL and C#"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Building Analytics At Scale With USQL and C#

Similar presentations

Presentation on theme: "Building Analytics At Scale With USQL and C#"— Presentation transcript:

Similar presentations

About project

Feedback