Download presentation
Presentation is loading. Please wait.
Published byJames Sharp Modified over 9 years ago
4
… data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing. – Gartner, “The State of Data Warehousing in 2012” Data sources
5
5 Increasing data volumes 1 Real- time data 2 Non-Relational Data New data sources & types 3 Cloud-born data 4
6
ETL Tool (SSIS, etc) EDW (SQL Svr, Teradata, etc) Extract Original Data Load Transformed Data Transform BI Tools Data Marts Data Lake(s) Dashboards Apps
7
ETL Tool (SSIS, etc) EDW (SQL Svr, Teradata, etc) Extract Original Data Load Transformed Data Transform BI Tools Ingest (EL) Original Data Data Marts Data Lake(s) Dashboards Apps
8
ETL Tool (SSIS, etc) EDW (SQL Svr, Teradata, etc) Extract Original Data Load Transformed Data Transform BI Tools Ingest (EL) Original Data Scale-out Storage & Compute (HDFS, Blob Storage, etc) Transform & Load Data Marts Data Lake(s) Dashboards Apps Streaming data
9
ETL Tool (SSIS, etc) EDW (SQL Svr, Teradata, etc) Extract Original Data Load Transformed Data Transform BI Tools Ingest (EL) Original Data Scale-out Storage & Compute (HDFS, Blob Storage, etc) Transform & Load Data Marts Data Lake(s) Dashboards Apps Streaming data
10
BI Tools Data Marts Data Lake(s) Dashboards Apps Data Hub (Storage & Compute) Data Sources (Import From) Move data among Hubs Data Hub (Storage & Compute) Data Sources (Import From) Ingest Connect & CollectTransform & EnrichPublish Information Production: Ingest Move to data mart, etc
11
BI Tools Data Marts Data Lake(s) Dashboards Apps Data Hub (Storage & Compute) Data Sources (Import From) Data Connector: Import from source to Hub Data Connector: Import/Export among Hubs Data Hub (Storage & Compute) Data Sources (Import From) Data Connector: Import from source to Hub Data Connector: Export from Hub to data store Connect & CollectTransform & EnrichPublish Information Production: Coordination & Scheduling Monitoring & Mgmt Data Lineage
14
Example Scenario: Customer Profiling (game usage analytics)
15
2277,2013-06-01 02:26:54.3943450,111,164.234.187.32,24.84.225.233,true,8,1,2058 2277,2013-06-01 03:26:23.2240000,111,164.234.187.32,24.84.225.233,true,8,1,2058-2123-2009-2068-2166 2277,2013-06-01 04:22:39.4940000,111,164.234.187.32,24.84.225.233,true,8,1, 2277,2013-06-01 05:43:54.1240000,111,164.234.187.32,24.84.225.233,true,8,1,2058-225545-2309-2068-2166 2277,2013-06-01 06:11:23.9274300,111,164.234.187.32,24.84.225.233,true,8,1,223-2123-2009-4229-9936623 2277,2013-06-01 07:37:01.3962500,111,164.234.187.32,24.84.225.233,true,8,1, 2277,2013-06-01 08:12:03.1109790,111,164.234.187.32,24.84.225.233,true,8,1,234322-2123-2234234-12432-344323 … Log Files Snippet (10s of TBs per day in cloud storage) User Table UserIDFirstNameLastNameState… 2277PratikPatelOregon 664432DaveNettletonWashington 8853MikeFlaskoCalifornia New User Activity Per Week By Region profileiddaystatedurationrankweaponsusedinteractedwith 11486/2/2013Oregon2163315 10046/2/2013Missouri224062 2926/1/2013Georgia20113715 10596/2/2013Oregon2710452 6756/2/2013California6516432 13486/3/2013Nebraska219552
16
Data Factory Walkthrough
17
New-AzureDataFactory -Name “HaloTelemetry“ -Location “West-US“ New-AzureDataFactory -Name “GameTelemetry“ -Location “West-US“
18
New-AzureDataFactoryLinkedService -Name "MyHDInsightCluster“ -DataFactory“GameTelemetry" -File HDIResource.json New-AzureDataFactoryLinkedService -Name "MyStorageAccount" -DataFactory“GameTelemetry" -File BlobResource.json
19
On Premises SQL Server Azure Blob Storage 1000’s Log Files New User View Azure Data Factory
20
On Premises SQL Server Azure Blob Storage 1000’s Log Files New User View Azure Data Factory View Of Game Usage View Of New Users New User Activity
21
View Of On Premises SQL Server Azure Blob Storage 1000’s Log Files New User View Copy “NewUsers” to Blob Storage Cloud New Users Azure Data Factory View Of Game Usage View Of New Users New User Activity Pipeline
22
On Premises SQL Server Azure Blob Storage 1000’s Log Files New User View Copy NewUsers to Blob Storage Cloud New Users Azure Data Factory View Of Game Usage View Of Mask & Geo- Code New Users Geo Dictionary Geo Coded Game Usage HDInsight New User Activity Pipeline
23
On Premises SQL Server Azure Blob Storage 1000’s Log Files New User View Copy NewUsers to Blob Storage Cloud New Users Azure Data Factory View Of Game Usage View Of Runs On Mask & Geo- Code New Users Geo Dictionary Geo Coded Game Usage Join & Aggregate HDInsight New User Activity View Of Pipeline
24
On Premises SQL Server Azure Blob Storage 1000’s Log Files New User View Copy NewUsers to Blob Storage Cloud New Users Azure Data Factory View Of Game Usage View Of Runs On Mask & Geo- Code New Users Geo Dictionary Geo Coded Game Usage Join & Aggregate HDInsight New User Activity View Of Pipeline
25
“GeoCoded Game Usage” Table:
26
Pipeline Definition:
27
// Deploy Table New-AzureDataFactoryTable -DataFactory“GameTelemetry“ -File NewUserActivityPerRegion.json // Deploy Pipeline New-AzureDataFactoryPipeline -DataFactory “GameTelemetry“ -File NewUserTelemetryPipeline.json // Start Pipeline Set-AzureDataFactoryPipelineActivePeriod -Name “NewUserTelemetryPipeline“ -DataFactory “GameTelemetry“ -StartTime 10/29/2014 12:00:00
28
"availability": { "frequency": "Day", interval": 1 } Hourly 12-1 1-2 2-3 GameUsage Activity: (e.g. Hive) :
29
Dataset2 Dataset3 Hourly 12-1 1-2 2-3 Daily Monday Tuesday Wednesday Daily Monday Tuesday Wednesday Hive Activity GameUsage GeoCodeDictionary Geo-Coded GameUsage
30
Is my data successfully getting produced? Is it produced on time? Am I alerted quickly of failures? What about troubleshooting information? Are there any policy warnings or errors?
34
Easily move data to my existing data marts for consumption by my existing BI tools Azure DB SQL Server on premises
35
Automation & Management Data Transformation & Movement Execution Layer (Data Storage & Processing) Automation/Coordination Layer (Coordination, Scheduling, Management) Low Frequency $0.60$0.48$1.50$1.20 High Frequency $1.00$0.80$2.50$2.00 0-100 activities100+ activities0-100 activities100+ activities CloudOn Premises HDInsight (hrs) Compute/VM (hrs) Data Transfer (GB) ADF Pricing Per Month Resources Used to Execute Activities in a Pipeline: Note: public preview = 50% discount on the rates shown above
36
Coordination: Rich scheduling Complex dependencies Incremental rerun Authoring: JSON & Powershell/C# Management: Lineage Data production policies (late data, rerun, latency, etc) Hub: Azure Hub (HDInsight + Blob storage) Activities: Hive, Pig, C# Data Connectors: Blobs, Tables, Azure DB, On Prem SQL Server, MDS [internal]
38
Contact me: mike.flasko@microsoft.com
40
www.microsoft.com/learning http://microsoft.com/technet http://channel9.msdn.com/Events/TechEd http://developer.microsoft.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.