Data Wrangling for ETL enthusiasts

Slides:



Advertisements
Similar presentations
Business Intelligence System September 2013 BI.
Advertisements

An Introduction to HDInsight June 27 th,
Matthew Winter and Ned Shawa
Purpose of this presentation: Describe the capabilities and value of Power BI for the IT Professional Target audience: Business Intelligence IT Professionals.
Business Intelligence for everyone 2 For BI to deliver maximum value, all Information Workers must participate: Broad access to uncover and share insights.
Self-Service Data Integration with Power Query Stéphane Fréchette.
Andy Roberts Data Architect
What if your app could put the power of analytics everywhere decisions are made? Modern apps with data visualizations built-in have the power to inform.
A Suite of Products that allow you to Predict Outcomes, Prescribe Actions and Automate Decisions.
What if your app could put the power of analytics everywhere decisions are made? Modern apps with data visualizations built-in have the power to inform.
Internal Modern Data Platform Somnath Data Platform Architect.
Energy Management Solution
Data Lake and HAWQ Integration
BUILD BIG DATA ENTERPRISE SOLUTIONS FASTER ON AZURE HDINSIGHT
Energy Demand Forecasting
Analytics Warehouse P.J. Kelly.
Connected Infrastructure
Fan Engagement Solution
4/18/2018 6:56 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Data Platform and Analytics Foundational Training
Data Platform and Analytics Foundational Training
5/9/2018 7:28 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS.
Connected Living Connected Living What to look for Architecture
Data Platform and Analytics Foundational Training
Examine information management in Cortana Intelligence
ADF & SSIS: New Capabilities for Data Integration in the Cloud
Energy Demand Forecasting
Connected Living Connected Living What to look for Architecture
Delivering Business Insight with SQL Server 2005
Connected Infrastructure
Remote Monitoring solution
Energy Management Solution
Using a Gateway to Leverage On-Premises Data in Power BI
9/21/2018 3:41 AM BRK3180 Architect your big data solutions with SQL Data Warehouse & Azure Analysis Services Josh Caplan & Matt Usher Program Managers.
Migrating Your BI Platform To Azure
Business Intelligence for Project Server/Online
Accelerate Your Self-Service Data Analytics
Near Real Time ETLs with Azure Serverless Architecture
Collaborative Business Solutions
Managing batch processing Transient Azure SQL Warehouse Resource
Azure Data Lake for First Time Swimmers
THR1171 Azure Data Integration: Choosing between SSIS, Azure Data Factory, and Azure Databricks Cathrine Wilhelmsen, | cathrinew.net.
Databricks: the new kid on the block
Analytics in the Cloud using Microsoft Azure
Technical Capabilities
Azure Machine Learning on Databricks
Introduction to Dataflows in Power BI
Introducing Power BI dataflows
Power BI – Introduction to Dataflows
Understanding Azure Data Engineering Options Finding Clarity in a Vast & Changing Landscape Cameron Snapp.
Matthew Roche Senior Program Manager STOCKHOLM
ETL Patterns in the Cloud with Azure Data Factory
Analytics, BI & Data Integration
Get data insights faster with Data Wrangling
Power BI dataflows Beyond the basics
Wimmer Solutions Team Justin Barbara Meg SQL and PowerBI Developer
Data Wrangling as the key to success with Data Lake
Moving your on-prem data warehouse to cloud. What are your options?
Introduction to Azure Data Lake
Customer 360.
Michael French Principal Consultant 5/18/2019
The Modern Data Warehouse and Azure
Beyond orchestration with Azure Data Factory
SQL Server 2019 Bringing Apache Spark to SQL Server
Get your data flowing with Data Flows! and...umm...dataflows.
Visual Data Flows – Azure Data Factory v2
Dimension Load Patterns with Azure Data Factory Data Flows
Visual Data Flows – Azure Data Factory v2
Architecture of modern data warehouse
Presentation transcript:

Data Wrangling for ETL enthusiasts Mohamed Kabiruddin Cloud Solutions Architect (Data) @ Microsoft mdkabir Data Wrangling for ETL enthusiasts

ETL Process

Landing Staging Dimensional Model

Did I just spend 10 hours perfecting that lookup Did I just spend 10 hours perfecting that lookup? And still apply indexes to gain performance? This Photo by Unknown Author is licensed under CC BY-SA

CONSTRUCT THAT POWERFUL SURROGATE KEY

ULTIMATE GOAL: KEEP THE DATA WAREHOUSE 1. UPDATED 2. RELEVANT 3 ULTIMATE GOAL: KEEP THE DATA WAREHOUSE 1. UPDATED 2. RELEVANT 3. OPERATIONAL

DATA STAGES IN WRANGLING RAW REFINED PRODUCTION

Data Lake Design Considerations Data Lake Zones Transient Landing Zone Temporary storage of data to meet regulatory and quality control requirements. Limited access. May not be required depending on requirements. Raw Zone Original source of data ready for consumption. Metadata publicly available but access to data still limited. Trusted Zone Standardized and enriched datasets ready for consumption to those with appropriate role-based access. Metadata available to all. Curated/Refined Zone Data transformed from Trusted Zone to meet specific business requirements. Sandbox Zone Playground for Data Scientists for ad hoc exploratory use cases. Data Governance Considerations Security and Compliance Access Control Encryption Row-Level Security Metadata Management Data Quality Lifecycle Management

Are data wrangling and ETL the same then? Data wrangling is the process involved in transforming or preparing data for analysis Consider ETL to be one type of data wrangling, specifically a type of data wrangling managed and overseen by an organization’s shared services or IT organization. Data wrangling can also be handled by business users in desktop tools like Excel, or by data scientists in coding languages like Python or R Are data wrangling and ETL the same then?

TOOLSET FROM MICROSOFT Power BI Excel SSIS T-SQL U-SQL Polybase Azure Data Explorer Azure Data Factory Azure Stream Analytics Azure HD Insight (R & Python) Azure Databricks (R, Python and SparkSQL)

Modern data warehousing Canonical operations Transfer and store Load and ingest Process and clean Process Serve and analyze Serve A B C

Modern data warehousing pattern in Azure Data processing with Azure Databricks Data loading Azure Data Factory Applications Ingest storage Data processing Serving storage Logs, files, and media (unstructured) Azure Databricks Load flat files into data lake on a schedule Azure Storage/ Data Lake Store Read data from files using DBFS Load processed data into tables optimized for analytics Load into SQL DW tables Azure SQL DW Business and custom apps (structured) Applications manage their transactional data directly Extract and transform relational data Azure Data Factory SQL DB Dashboards Transactional storage Orchestration

Data factory - dataflows

Wrangling dataflows Wrangling Data Flow translates M generated by Power Query Online Mashup Editor into Spark code for cloud scale execution and provides best in class monitoring experience

Mapping Data Flow No-code data transformation @ scale Data cleansing, transformation, aggregation, conversion, etc. Cloud scale via Spark execution Easily build resilient data flows … not

Azure Databricks Best of Databricks Best of Microsoft A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure Best of Databricks Best of Microsoft Designed in collaboration with the founders of Apache Spark One-click set up; streamlined workflows Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. Native integration with Azure services (Power BI, SQL DW, Cosmos DB, ADLS, Azure Storage, Azure Data Factory, Azure AD, Event Hub, IoT Hub, HDInsight Kafka, SQL DB) Enterprise-grade Azure security (Active Directory integration, compliance, enterprise-grade SLAs)

Azure Databricks Notebooks Notebooks are a popular way to develop, and run, Spark Applications Notebooks are not only for authoring Spark applications but can be run/executed directly on clusters Shift+Enter click the at the top right of the cell in a notebook Submit via Job Fine grained permissions support so they can be securely shared with colleagues for collaboration Notebooks are well-suited for prototyping, rapid development, exploration, discovery and iterative development With Azure Databricks notebooks you have a default language but you can mix multiple languages in the same notebook: %python Allows you to execute python code in a notebook (even if that notebook is not python) %sql Allows you to execute sql code in a notebook (even if that notebook is not sql). %r Allows you to execute r code in a notebook (even if that notebook is not r). %scala Allows you to execute scala code in a notebook (even if that notebook is not scala). %sh Allows you to execute shell code in your notebook. %fs Allows you to use Databricks Utilities - dbutils filesystem commands. %md To include rendered markdown

Table Operations Azure Databricks tables support the following operations Listing database and tables Viewing table details including its schema and sample data Reading from tables Updating tables: Table schema is immutable. However, a user can update table data by changing the underlying files. Deleting tables: A user can delete tables either through the UI or programmatically

GOLD Silver Bronze

Your feedback is appreciated