Download presentation
Presentation is loading. Please wait.
1
The Modern Data Warehouse and Azure
Chris Seferlis – Sr Microsoft
2
Who am I? Former CIO – 20+ years in IT US Army Veteran
Wife and 2 Girls Outdoors person (Run, Hike, Cycle, Ski, Fish) Child Herder (aka Jr Soccer Coach) Video Feed Coming soon: Youtube.com/bizdataviz Questions:
3
Why We’re Here: Modern Data Warehouse
4
And… Streaming & Big Data
5
And… Advanced Analytics
6
Decisions… decisions…
7
Traditional Method
8
1 2 3 4 5 6 7 Traditional Data Architecture for BI Programs Source
Audit, Balance & Control Data Governance/Catalog/Dictionary Source Extract & Load Raw Data Store Transform Structure Semantic Layer Data Delivery Source 1 Source 2 Source 3 Source 4 On-Prem SQL Server Source 5 Source 6 API Call SFTP SSIS Azure Sql DB Azure Sql DB Views SSAS Power BI Link to traditional data architecture Link to Azure data architecture
9
Why Migrate to Azure? Flexibility Scale
Offset Limited Local IT Resources Event Based File Ingestion Unstructured Data Large Data Volumes Near Real Time Requirements Data Science Capabilities Development Time to Production Support for large audiences Mobile Collaboration
10
Azure Function ABS Watcher
Azure Data Architecture for BI Programs Subject area OLAP Model SFTP AI, ML Tools Logical Model + Metadata Dashboards Workbooks Reports API Calls Self-hosted Integration Runtime Azure Logic App SFTP File Watcher Data Pull or Push Temporary Store Multi-file Consolidation To Data Models Source Raw Data Store Transform & Load Enterprise Data Science Source 1 Source 3 Cloud On-Prem 4 Source 5 Source 2 Dimensional model Semantic Layer Delivery Azure Logic App & SQL Server Procedure event logging to Cosmos DB or Azure SQL Database Azure Function ABS Watcher Permanent Current File + Deltas (Separate New Update, Delete) Files Standardized Data Store Generate Current Version File Separate Delta Analyze Visualize Azure Blob Storage Databricks Azure Data Lake PolyBase t-SQL Spark Power BI 10 Unstructured Cosmos DB 8 9 Source 6 … Azure Data Factory Pipeline Ingestion “Orchestrators” PBI Logs Azure SQL DW Azure SQL DB Source 7 Link to traditional data architecture Link to Azure data architecture
11
Azure Data Architecture ~ Traditional Comparison
Subject area OLAP Model SFTP Dashboards Workbooks Reports API Calls Self-hosted Integration Runtime Azure Logic App SFTP File Watcher Data Pull or Push Temporary Store Multi-file Consolidation To Data Models Source Raw Data Store Transform & Load Enterprise Source 1 Source 3 Cloud On-Prem 4 Source 5 Source 2 Dimensional model Semantic Layer Delivery Azure Logic App & SQL Server Procedure event logging to Cosmos DB or Azure SQL Database Standardized Data Store Analyze Visualize Azure Blob Storage Databricks Azure Data Lake PolyBase t-SQL Spark Power BI 10 Cosmos DB 8 9 Source 6 … Azure Data Factory Pipeline Ingestion “Orchestrators” PBI Logs Azure SQL DW Azure SQL DB Source 7 SSIS SQL DB Tabular PBI Traditional
12
Azure Function ABS Watcher
Azure Data Architecture ~ Value Add Subject area OLAP Model SFTP AI, ML Tools Logical Model + Metadata Dashboards Workbooks Reports API Calls Self-hosted Integration Runtime Azure Logic App SFTP File Watcher Data Pull or Push Temporary Store Multi-file Consolidation To Data Models Source Raw Data Store Transform & Load Enterprise Data Science Source 1 Source 3 Cloud On-Prem 4 Source 5 Source 2 Dimensional model Semantic Layer Delivery Azure Logic App & SQL Server Procedure event logging to Cosmos DB or Azure SQL Database Azure Function ABS Watcher Permanent Current File + Deltas (Separate New Update, Delete) Files Standardized Data Store Generate Current Version File Separate Delta Analyze Visualize Azure Blob Storage Databricks Azure Data Lake PolyBase t-SQL Spark AAS Power BI 10 Unstructured Cosmos DB 8 9 Source 6 … Azure Data Factory Pipeline Ingestion “Orchestrators” PBI Logs Azure SQL DW Azure SQL DB Source 7 SSIS SQL DB Tabular PBI Traditional
13
Do I need a data lake?
14
Azure Data Lake Storage Gen2
A “no-compromises” Data Lake: secure, performant, massively-scalable Data Lake storage that brings the cost and scale profile of object storage together with the performance and analytics feature set of data lake storage COST EFFECTIVE INTEGRATION READY SECURE MANAGEABLE FAST SCALABLE Support for fine-grained ACLs, protecting data at the file and folder level Multi-layered protection via at-rest Storage Service encryption and Azure Active Directory integration Automated Lifecycle Policy Management Object Level tiering Atomic file operations means jobs complete faster High throughput No limits on data store size Global footprint (50 regions) Object store pricing levels File system operations minimize transactions required for job completion Optimized for Spark and Hadoop Analytic Engines Tightly integrated with Azure end to end analytics solutions
15
Convergence of two Storage Services
10/9/2019 3:12 AM Convergence of two Storage Services Azure Blob Storage General Purpose Object Storage Azure Data Lake Store Optimized for Big Data analytics Global scale – All Azure regions Full BCDR capabilities Tiered - Hot/Cool/Archive Cost Efficient Large partner ecosystem Built for Hadoop Hierarchical namespace ACLs, AAD and RBAC Performance tuned for big data Very high scale capacity and throughput Azure Data Lake Storage Gen2 The best of Blobs and ADLS © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
16
Azure Data Lake Storage Gen2 architecture
Blob API ADLS API HIERARCHICAL FILE SYSTEM Performance Enhancements Scale and Cost Effectiveness Security Blob Storage Object Tiering and Lifecycle Policy Management AAD Integration, RBAC, Storage Account Security HA/DR support through ZRS and RA-GRS Data Governance and Management
17
Data processing
18
Azure Logic App ~ SFTP Listener
Push from Source Source Data Pull or Push Azure Logic App SFTP File Watcher Raw Data Store Temporary Data Store SFTP File Watchers 2 3 Logic App SFTP File Watcher SFTP File Added or Changed Logic App Log Event SFTP File Found Azure Database Stored Proc Log File Found Logic App Log Event & Call ADF Pipeline Azure Data Factory SFTP Orchestrator Azure Blob Storage SFTP Source 5 Source 6 Event Hub Send Event Azure Blob Storage Azure Data Factory
19
Azure Data Factory Orchestrator
Scheduled Pull from Source (traditional SSIS) API Calls Self-hosted Integration Runtime Data Pull or Push Temporary Store Source Raw Data Store Source 1 Source 3 Cloud On-Prem 4 Source 2 Azure Blob Storage … Logic App Log Event After every activity!! Event Hub Send Event 2 ~ ADF Orchestrator 3 Azure Blob Storage Azure Database Stored Proc Get Start Date Azure Data Factory Triggered Pipeline Update Run Date Copy Dataset Azure Data Factory
20
Azure Blob Storage ~ Preprocessing
No Deletes Needed Source Azure Data Factory Azure Blob Storage finalContainer Azure Function Unapproved Departments Must Delete Cleansed CSV File Raw Data Store Temporary Data Store 2b 2c Azure Blob Storage tempContainer Databricks or ADFgen2 Delete.py /or/ Pipeline Azure SQL Database Azure Function ABS Watcher 5a b c d e If found Logic App Cosmos DB Logic App Data Factory Data Lake Store Databricks Full or incremental load parameter passed to ADL Orchestrator Azure Blob Storage Cosmos DB
21
Azure Data Factory Orchestrator
Scheduled Pull from Source No Preprocessing Needed Source Azure Data Factory Azure Blob Storage finalContainer Azure Function Raw Data Store Temporary Data Store Azure SQL Database Azure Function ABS Watcher 5a b c d e If found Logic App Cosmos DB Logic App Data Factory Data Lake Store Databricks Full or incremental load parameter passed to ADL Orchestrator Azure Blob Storage Cosmos DB
22
Wash, Rinse, Repeat… Some ingestion method 1 2 3 4 Raw Data Store
Some ingestion method Azure Blob Storage finalContainer Azure Function Raw Data Store Temporary Data Store Azure Function ABS Watcher 5a b c d e If found Logic App Cosmos DB Logic App Data Factory Data Lake Store Databricks Full or incremental load parameter passed to ADL Orchestrator Azure Blob Storage Cosmos DB
23
Azure Function ABS Watcher
Azure Data Lake Ingestion For all Sources Temporary Data Store Raw Data Store Generate Current Version File + Separate Delta Files Transform & Load Current File + Deltas (Separate New Update, Delete) Files Standardized Data Store 3 4 – ABS File Watcher (Root Container) 5 6 Azure Blob Azure Function ABS File Added or Changed Logic App Log Event ABS File Found Logic App Log Event & Call ADF Pipeline Azure Data Factory ADL Orchestrator Azure Data Lake Store Azure Function ABS Watcher Event Hub Send Event Azure Blob Storage Data Bricks Azure Data Lake Azure Data Factory
24
Decision Point… ADF vs SPs vs Databricks
Visual Designer Great Orchestration Data Flows for Transformation SPs Easy Lift and Shift Lots of Resources Standard SQL Code Databricks Granular control Spark Engine Flexible File Capabilities
25
Azure Data Factory Orchestrator
ADL Orchestrator Pipeline Ingestion Pipeline AsIs Pipeline PySpark Create row-level checksum Create delta files Create AsIs Files All ADF Metadata Logging Logic App Log Event Success Failure Event Hub Send Event or Azure Data Lake Store Separate New, Changed & Deleted Files Single “AsIs” Current File Source For Azure Blob One Orchestrator Pipeline For all Sources
26
Decision Point… ASDB vs ASDW
Symmetric Multi-Processing Transactional Data Database < 1TB Massively Parallel Processing Analytical Data Database > 1TB
27
Azure Data Warehouse Ingestion
For all Sources Current File + Deltas (Separate New Update, Delete) Files Standardized Data Store Transform & Load Enterprise Data Store Multi-file Consolidation To Data Models 3NF Schema Subject area specific integrated Data Hub With historical tracking OLAP Schema 6 7 8 Azure Data Lake Store Azure Data Factory Orchestrator Execute series of Stored Procedures Azure SQL Data Warehouse External Tables Azure SQL Data Warehouse 3NF Tables Event Hub Send Event Azure SQL Data Warehouse Logging Tables Azure SQL DB or ADW Azure Data Lake PolyBase t-SQL and/or 9 Azure Data Factory
28
Is that all?
29
What about… Data Quality? Master Data Management? Data Catalog?
Data Glossary?
30
Cloud Tools Tool Purpose 1 Azure Logic Apps SFTP "watcher“
Event logging Blob storage and data lake delete methodologies Notifications Automatic s Cosmos DB document upload and deletions 2 Azure Function Azure Blob Storage "listener" 3 Azure Event Hub event handling 4 Azure Blob Storage temporary work space 5 Azure Data Factory Process flow orchestrators Data copy QA methodologies
31
Cloud Tools (continued)
Purpose 6 Databricks Data processing and write to Azure Data Lake Other pre-processing data requirements 7 Azure Data Lake Delta files -- change data capture at the file level Current “AsIs” files Data science self-service Power BI self-service 8 Cosmos DB SQL API Logging ELT metadata 9 Azure Key Vault Supports Dev/QA/Prod Migration
32
Cloud Tools (continued)
Purpose 10 Azure SQL Database ELT metadata 11 Azure SQL Data Warehouse Both Inmon and Kimball data stores (loosely speaking) 12 Azure Analysis Services Tabular semantic layer 13 Power BI Reporting and self-service 14 Azure Data Catalog Data catalog for reports, sources, etc 15 Master Data Management 16 Data Quality Tool
33
Development Tools Tool Purpose 1 Visual Studio Python project
Auto generate the file-level metadata for complete file ingestion to Azure Data Lake 2 Visual Studio Azure Data Warehouse project Team Foundation Server source code control for Azure Data Warehouses 3 Visual Studio Logic App Project Team Foundation Server or GIT source code control for Azure Logic Apps 4 Visual Studio Database Project Team Foundation Server or GIT source code control for Azure SQL Databases 5 Github or Azure DevOps Source code control for Azure Data Factory and Databricks
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.