Download presentation
Presentation is loading. Please wait.
Published byAmy Hensley Modified over 6 years ago
1
6/17/2018 8:38 PM BRK3350 Run Python, R and .NET code at Data Lake scale with U-SQL in Azure Data Lake Michael Rys Principal Program Manager, Big Data Team @MikeDoesBigData © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
2
Agenda Characteristics of Big Data Analytics Programming
Scaling out existing code with U-SQL: Scaling out Cognitive Libraries Introduction to U-SQL’s Extensibility Framework Scaling out .NET with U-SQL: Custom Image processing Scaling out Python with U-SQL Scaling out R with U-SQL: Model generation, Model testing and scoring
3
Characteristics of Big Data Analytics
Some sample use cases Digital Crime Unit – Analyze complex attack patterns to understand BotNets and to predict and mitigate future attacks by analyzing log records with complex custom algorithms Image Processing – Large-scale image feature extraction and classification using custom code Shopping Recommendation – Complex pattern analysis and prediction over shopping records using proprietary algorithms Characteristics of Big Data Analytics Requires processing of any type of data Allow use of custom algorithms Scale to any size and be efficient Bring your own coding expertise and existing code and scale it out?
4
Status Quo: SQL for Big Data
Declarativity does scaling and parallelization for you Extensibility is bolted on and not “native” hard to work with anything other than structured data difficult to extend with custom code: complex installations and frameworks Limited to one or two languages Status Quo: SQL for Big Data
5
Status Quo: Programming Languages for Big Data
Extensibility through custom code is “native” Declarativity is bolted on and not “native” User often has to care about scale and performance SQL is 2nd class within string, only local optimizations Often no code reuse/ sharing across queries Status Quo: Programming Languages for Big Data
6
Why U-SQL? Python .NET R U-SQL Framework
Declarativity and Extensibility are equally native! Get benefits of both! Why U-SQL? U-SQL Framework .NET Python R Scales out your custom imperative Code (written in .NET, Python, R, and more to come) in a declarative SQL-based framework
7
Declarative Framework
User Extensions U-SQL Example Declarative Framework Extract User Code User Code Process Extract User Code User Code Output User Code User Code
8
SCOPE – Microsoft’s internal Big Data language framework
SQL and C# integration model Optimization and Scaling model Runs 100’000s of jobs daily Hive Complex data types (Maps, Arrays) Data format alignment for text files T-SQL/ANSI SQL Many of the SQL capabilities (windowing functions, meta data model etc.) The origins of U-SQL U-SQL SCOPE Hive T-SQL/ ANSI SQL
9
Scale Out Cognitive Library
Racing Parked Speaking Points are: U-SQL allows to wrap existing libraries and scale them out Uses its own framework to provide value add Car Green Outdoor
10
Imaging REFERENCE ASSEMBLY ImageCommon; REFERENCE ASSEMBLY FaceSdk;
REFERENCE ASSEMBLY ImageEmotion; REFERENCE ASSEMBLY ImageTagging; REFERENCE ASSEMBLY ImageOcr; @imgs = EXTRACT FileName string, ImgData byte[] USING new Cognition.Vision.ImageExtractor(); // Extract the number of objects on each image and tag them @objects = PRODUCE FileName, NumObjects int, Tags SqlMap<string, float?> READONLY FileName USING new Cognition.Vision.ImageTagger(); TO "/objects.tsv" USING Outputters.Tsv();
11
Text Analysis REFERENCE ASSEMBLY [TextSentiment];
REFERENCE ASSEMBLY [TextKeyPhrase]; @WarAndPeace = EXTRACT No int, Year string, Book string, Chapter string, Text string USING Extractors.Csv(); @sentiment = PRODUCE No, Year, Book, Chapter, Text, Sentiment string, Conf double USING new Cognition.Text.SentimentAnalyzer(true); TO "/sentiment.tsv" USING Outputters.Tsv();
12
U-SQL/Cognitive Example
REFERENCE ASSEMBLY ImageCommon; REFERENCE ASSEMBLY FaceSdk; REFERENCE ASSEMBLY ImageEmotion; REFERENCE ASSEMBLY ImageTagging; @objects = PROCESS MegaFaceView PRODUCE FileName, NumObjects int, Tags SqlMap<string,float?> READONLY FileName USING new Cognition.Vision.ImageTagger(); @tags = SELECT FileName, T.Tag CROSS APPLY EXPLODE(Tags.Split) AS T(Tag, Conf) WHERE Tag.Contains("dog") OR Tag.Contains("cat"); @emotion = SELECT ImageName, Details.Emotion FROM MegaFaceView CROSS APPLY new Cognition.Vision.EmotionApplier(imgCol:"image") AS Details(NumFaces int, FaceIndex int, RectX float, RectY float, Width float, Height float, Emotion string, Confidence float); @correlation = SELECT T.FileName, Emotion, Tag AS E INNER JOIN @tags AS T ON E.FileName == T.FileName; U-SQL/Cognitive Example Images Objects Emotions filter join aggregate Identify objects in images (tags) Identify faces and emotions and images Join datasets – find out which tags are associated with happiness
13
U-SQL extensibility Built-in operators, function, aggregates
Extend U-SQL with C#/.NET, Python, R etc. Built-in operators, function, aggregates C# expressions (in SELECT expressions) User-defined functions (UDFs) User-defined aggregates (UDAGGs) User-defined operators (UDOs)
14
What are UDOs? Custom Operator Extensions in language of your choice
User-Defined Extractors Converts files into rowset (see BRK3323 for more examples) User-Defined Outputters Converts rowset into files (see BRK3323 for more examples) User-Defined Processors Take one row and produce one row Pass-through versus transforming User-Defined Appliers Take one row and produce 0 to n rows Used with OUTER/CROSS APPLY User-Defined Combiners Combines rowsets (like a user-defined join) User-Defined Reducers Take n rows and produce m rows (normally m<n) Scaled out with explicit U-SQL Syntax that takes a UDO instance (created as part of the execution): EXTRACT OUTPUT CROSS APPLY Custom Operator Extensions in language of your choice Scaled out by U-SQL PROCESS COMBINE REDUCE
15
Scaling out C# with U-SQL
Copyright Camera Make Camera Model Thumbnail Michael Canon 70D Samsung S7
16
How to specify .NET UDOs? .Net API provided to build UDOs
Any .Net language usable however only C# is first-class in tooling Use U-SQL specific .Net DLLs Deploying UDOs Compile DLL Upload DLL to ADLS register with U-SQL script VisualStudio provides tool support UDOs can Invoke managed code Invoke native code deployed with UDO assemblies Invoke other language runtimes (e.g., Python, R) be scaled out by U-SQL execution framework UDOs cannot Communicate between different UDO invocations Call Webservices/Reach outside the vertex boundary How to specify .NET UDOs?
17
How to specify UDOs? Code behind C#, Python, R
18
How to specify UDOs? C# Class Project for U-SQL
19
UDO model Marking UDOs Parameterizing UDOs UDO signature
[SqlUserDefinedExtractor] public class DriverExtractor : IExtractor { private byte[] _row_delim; private string _col_delim; private Encoding _encoding; // Define a non-default constructor since I want to pass in my own parameters public DriverExtractor( string row_delim = "\r\n", string col_delim = ",“ , Encoding encoding = null ) _encoding = encoding == null ? Encoding.UTF8 : encoding; _row_delim = _encoding.GetBytes(row_delim); _col_delim = col_delim; } // DriverExtractor // Converting text to target schema private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow) var schema = outputrow.Schema; if (schema[i].Type == typeof(int)) var tmp = Convert.ToInt32(c); outputrow.Set(i, tmp); } ... } //SerializeCol public override IEnumerable<IRow> Extract( IUnstructuredReader input , IUpdatableRow outputrow) foreach (var row in input.Split(_row_delim)) using(var s = new StreamReader(row, _encoding)) int i = 0; foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None)) OutputValueAtCol_I(c, i++, outputrow); } // foreach } // using yield return outputrow.AsReadOnly(); } // Extract } // class DriverExtractor Marking UDOs Parameterizing UDOs UDO signature UDO-specific processing pattern Rowsets and their schemas in UDOs Setting results By position By name
20
Managing Assemblies CREATE ASSEMBLY db.assembly FROM @path;
CREATE ASSEMBLY db.assembly FROM byte[]; Can also include additional resource files REFERENCE ASSEMBLY db.assembly; Referencing .Net Framework Assemblies Always accessible system namespaces: U-SQL specific (e.g., for SQL.MAP) All provided by system.dll system.core.dll system.data.dll, System.Runtime.Serialization.dll, mscorelib.dll (e.g., System.Text, System.Text.RegularExpressions, System.Linq) Add all other .Net Framework Assemblies with: REFERENCE SYSTEM ASSEMBLY [System.XML]; Enumerating Assemblies Powershell command U-SQL Studio Server Explorer and Azure Portal DROP ASSEMBLY db.assembly; Create assemblies Reference assemblies Enumerate assemblies Drop assemblies VisualStudio makes registration easy!
21
DEPLOY RESOURCE Syntax: 'DEPLOY' 'RESOURCE' file_path_URI { ',' file_path_URI }. Example: DEPLOY RESOURCE "/config/configfile.xml", "package.zip"; Semantics: Files have to be in ADLS or WASB Files are deployed to vertex and are accessible from any custom code Limits: Single resource file limit is 400MB Overall limit for deployed resource files is 3GB Deploy additional files into each vertex’ local directory Can be accessed from custom code
22
U-SQL Vertex Code (.NET)
Compilation output (in job folder) U-SQL Metadata Service C# managed dll REFERENCE ASSEMBLY C++ native dll Algebra Compilation and Optimization Additional non-dll files & Deployed resources System files (built-in Runtimes, Core DLLs, OS) ADLS DEPLOY RESOURCE Deployed to Vertices
23
Scale Out Python With U-SQL
Author Tweet MikeDoesBigData @AzureDataLake: Come and see the #SQLSaturday sessions on #USQL AzureDataLake What are your recommendations for Python Author Mentions Topics MikeDoesBigData {#SQLSaturday, #USQL} AzureDataLake {#SQLSaturday}
24
Python Extensions Use U-SQL to create a massively distributed program.
REFERENCE ASSEMBLY [ExtPython]; def get_mentions(tweet): return ';'.join( ( w[1:] for w in tweet.split() if ) ) def usqlml_main(df): del df['time'] del df['author'] df['mentions'] = df.tweet.apply(get_mentions) del df['tweet'] return df "; @t = SELECT * FROM (VALUES Hello Hello ) AS D( date, time, author, tweet ); @m = ON date PRODUCE date string, mentions string USING new Python script as string or in file Header contract: uses dataframe Use U-SQL to create a massively distributed program. Executing Python code across many nodes. Using standard libraries such as numpy and pandas. Documentation: Returns dataframe result Scale out Python script over date partitions
25
U-SQL Vertex Code (Python)
Compilation output (in job folder) U-SQL Metadata Service C# managed dll Python Python Engine & Libs REFERENCE ASSEMBLY ExtPython C++ native dll Algebra Compilation and Optimization Additional Python Libs and Script System files (built-in Runtimes, Core DLLs, OS) ADLS DEPLOY RESOURCE Script.py OtherLibs.zip Deployed to Vertices
26
Python (and R) Extension Execution Paradigm
6/17/2018 8:38 PM Python (and R) Extension Execution Paradigm Reduce Stage Rowset Partition 1 Rowset Partition N Python/R.Reducer (type mapping) Python/R.Reducer (type mapping) Data frame Your Python/R Code Data frame Data frame Your Python/R Code Data frame Reduce Vertex 1 Reduce Vertex N © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
27
Scale Out R With U-SQL
28
R running in U-SQL Generate a linear model
REFERENCE ASSEMBLY [ExtR]; string = @"/usqlext/samples/R/iris.csv"; string inputFromUSQL$Species = as.factor(inputFromUSQL$Species) lm.fit=lm(unclass(Species)~.-Par, data=inputFromUSQL) #do not return readonly columns and make sure that the column names are the same in usql and r scripts, outputToUSQL=data.frame(summary(lm.fit)$coefficients) colnames(outputToUSQL) <- c(""Estimate"", ""StdError"", ""tValue"", ""Pr"") outputToUSQL"; @InputData = EXTRACT SepalLength double, SepalWidth double, PetalLength double, PetalWidth double, Species string USING Extractors.Csv(); @ExtendedData = SELECT 0 AS Par, * @ModelCoefficients = ON Par PRODUCE Par, Estimate double, StdError double, tValue double, Pr double READONLY Par USING new rReturnType:"dataframe"); USING Outputters.Tsv(); R running in U-SQL Generate a linear model SampleScript_LM_Iris.R lm.fit=lm(unclass(Species)~.-Par, data=inputFromUSQL) output2USQL=summary(lm.fit)
29
R running in U-SQL Use a previously generated model
REFERENCE ASSEMBLY master.ExtR; DEPLOY // Prediction Model string string int = 10; // R script to run load(""my_model_LM_Iris.rda"") outputToUSQL=data.frame(predict(lm.fit, inputFromUSQL, interval=""confidence""))"; @InputData = EXTRACT SepalLength double, SepalWidth double, PetalLength double, PetalWidth double, Species string USING Extractors.Csv(); //Randomly partition the data to apply the model in parallel @ExtendedData = SELECT AS Par, * // Predict Species @RScriptOutput = ON Par PRODUCE Par, fit double, lwr double, upr double READONLY Par USING new rReturnType:"dataframe", stringsAsFactors:false); USING Outputters.Csv(outputHeader:true); Use a previously generated model
30
U-SQL Vertex Code (R) Compilation output (in job folder)
U-SQL Metadata Service C# managed dll R R Engine & Libs REFERENCE ASSEMBLY ExtR C++ native dll Algebra Compilation and Optimization Additional R Libs and Script System files (built-in Runtimes, Core DLLs, OS) ADLS DEPLOY RESOURCE Script.R OtherLibs.zip Deployed to Vertices
31
6/17/2018 8:38 PM Summary © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
32
Scaling Out your Code and Language with U-SQL
Bring your Code or Write your Custom Operator Extensions in .Net (C#, F#, etc) Python R … Scaled out by U-SQL
33
Related Ignite Presentations
BRK Understanding big data on Azure - structured, unstructured and streaming, Tuesday, September 26, 10:45 AM - 12:00 PM, OCCC W307 BRK Data on Azure: The big picture, Wednesday, September 27, 12:30 PM - 1:45 PM, Hyatt Plaza International G BRK Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platform, and intelligent, Wednesday, September 27, 9:00 AM - 10:15 AM, OCCC W307 Stop by the booth and the Hands On-Labs!
34
Additional Resources Blogs and community page:
(U-SQL Github) Documentation, presentations and articles: Getting Started with R in U-SQL ADL forums and feedback
35
Please evaluate this session
Tech Ready 15 6/17/2018 Please evaluate this session From your Please expand notes window at bottom of slide and read. Then Delete this text box. PC or tablet: visit MyIgnite Phone: download and use the Microsoft Ignite mobile app Your input is important! © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
36
6/17/2018 8:38 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.