INTELLIGENT DATA SOLUTIONS COM Intro to Data Factory PASS Cloud Virtual Chapter March 23, 2015 Steve Hughes, Architect
INTELLIGENT DATA SOLUTIONS COM 2 About the Presenter Steve Hughes – Architect for Pragmatic Works Blog: LinkedIn: linked.com/in/dataonwheels
INTELLIGENT DATA SOLUTIONS COM 3 What is Data Factory? Cloud-based, highly scalable data movement and transformation tool Built on Azure for integrating all kinds of data Still in preview so it is likely not yet feature complete (e.g. Machine Learning Activity added in December 2014)
INTELLIGENT DATA SOLUTIONS COM 4 Data Factory Components Linked Servers SQL Server Database – PaaS, IaaS, On Premise Azure Storage – Blob, Table Datasets Input/Output using JSON deployed with PowerShell Pipelines Activities using JSON deployed with PowerShell Copy, HDInsight, Azure Machine Learning
INTELLIGENT DATA SOLUTIONS COM 5 Current Activities Supported CopyActivity copy data from a source to a sink (destination) HDInsightActivity – Pig, Hive, MapReduce Transformations MLBatchScoringActivity – Can be used to score data with the ML Batch Scoring API StoredProcedureActivity – Executes stored procedures in an Azure SQL Database C# or.NET Custom Activity
INTELLIGENT DATA SOLUTIONS COM Data for the Demo Movies.txt in Azure Blob Storage Movies table in Azure SQL Database
INTELLIGENT DATA SOLUTIONS COM Building a Data Factory Pipeline 1.Create Data Factory 2.Create Linked Services 3.Create Input and Output Tables or Datasets 4.Create Pipeline 5.Set the Active Period for the Pipeline
INTELLIGENT DATA SOLUTIONS COM 8 Step 1: Create a Data Factory in Windows Azure
INTELLIGENT DATA SOLUTIONS COM 9 Step 2 – Create Linked Services 1.Click Linked Services tile 2.Add Data Stores 1.Add Blob Storage 2.Add SQL Database Three Data Store Types Supported: Azure Storage Account Azure SQL Database SQL Server Data Gateways can also be used for on premise SQL Server sources
INTELLIGENT DATA SOLUTIONS COM 10 Step 3 – Create Datasets/Tables JSON File for Datasets Structure – Name, Type (String,Int,Decimal,Guid,Boolean,Date) {name: “ThisName”, type:”String”} Location – Azure Table, Azure Blob, SQL Database Availability – “cadence in which a slice of the table is produced”
INTELLIGENT DATA SOLUTIONS COM 11 Step 3 – Input JSON { "name": "MoviesFromBlob", "properties": { "structure": [ { "name": "MovieTitle", "type": "String"}, { "name": "Studio", "type": "String"}, { "name": "YearReleased", "type": "Int"} ], "location": { "type": "AzureBlobLocation", "folderPath": "data-factory-files/Movies.csv", "format": { "type": "TextFormat", "columnDelimiter": "," }, "linkedServiceName": "Shughes Blob Storage" }, "availability": { "frequency": "hour", "interval": 4 } Structure defines the structure of the data in the file Location defines the location and file format information Availability sets the cadence to once every 4 hours Dataset Name
INTELLIGENT DATA SOLUTIONS COM 12 Step 3 – Output JSON { "name": "MoviesToSqlDb", "properties": { "structure": [ { "name": "MovieName", "type": "String"}, { "name": "Studio", "type": "String"}, { "name": "YearReleased", "type": "Int"} ], "location": { "type": "AzureSQLTableLocation", "tableName": "Movies", "linkedServiceName": "Media Library DB" }, "availability": { "frequency": "hour", "interval": 4 } Dataset Name Structure defines the table Structure, only fields targeted are mapped Location defines the location and the table name Availability sets the cadence to once every 4 hours
INTELLIGENT DATA SOLUTIONS COM 13 Step 3 – Deploy Datasets Deployment is done via PowerShell PS C:\> New-AzureDataFactoryTable -ResourceGroupName shughes-datafactory - DataFactoryName shughes-datafactory -File c:\data\JSON\MoviesFromBlob.json PS C:\> New-AzureDataFactoryTable -ResourceGroupName shughes-datafactory - DataFactoryName shughes-datafactory -File c:\data\JSON\MoviesToSqlDb.json
INTELLIGENT DATA SOLUTIONS COM 14 Step 4 – Pipeline JSON { "name": "MoviesPipeline", "properties": { "description" : "Copy data from csv file in Azure storage to Azure SQL database table", "activities": [ { "name": "CopyMoviesFromBlobToSqlDb", "description": "Add new movies to the Media Library", "type": "CopyActivity", "inputs": [ {"name": "MoviesFromBlob"} ], "outputs": [ {"name": "MoviesToSqlDb"} ], "transformation": { "source": { "type": "BlobSource" }, "sink": { "type": "SqlSink" } Pipeline Name Activity definition – type (CopyActivity), Input, Output Activity Name CopyActivity transformation – source and sink "Policy": { "concurrency": 1, "executionPriorityOrder": "NewestFirst", "style": "StartOfInterval", "retry": 0, "timeout": "01:00:00" } Policy required for SqlSink – concurrency must be set or deployment fails
INTELLIGENT DATA SOLUTIONS COM 15 Step 4 – Deploy Pipeline New-AzureDataFactoryPipeline -ResourceGroupName shughes- datafactory -DataFactoryName shughes-datafactory -File c:\data\JSON\MoviesPipeline.json
INTELLIGENT DATA SOLUTIONS COM 16 Step 4 – Deployed Pipeline
INTELLIGENT DATA SOLUTIONS COM 17 Step 4 – Pipeline Diagram
INTELLIGENT DATA SOLUTIONS COM 18 Step 5 – Set Active Period Set-AzureDataFactoryPipelineActivePeriod -ResourceGroupName shughes- datafactory -DataFactoryName shughes-datafactory -StartDateTime –EndDateTime –Name MoviesPipeline This gives the duration that data slices will be available to be processed. The frequency is set in the dataset parameters.
INTELLIGENT DATA SOLUTIONS COM Exploring Blades in Azure Portal Start with the Diagram Drill to various details in the pipeline Latest Update full online design capability
INTELLIGENT DATA SOLUTIONS COM Looking at Monitoring Review monitoring information in Azure portal
INTELLIGENT DATA SOLUTIONS COM 21 Common Use Cases Log Import for Analysis
INTELLIGENT DATA SOLUTIONS COM 22 Resources Azure Storage Explorer – Codeplex.com Azure.Microsoft.com – Data Factory Azure.Microsoft.com – Azure PowerShell
Products Improve the quality, productivity, and performance of your SQL Server and BI solutions. Services Speed development through training and rapid development services from Pragmatic Works. Foundation Helping those who don’t have the means to get into information technology and to achieve their dreams. Questions? Contact me at m Blog: Pragmatic Works: