Architecture for Real-Time ETL

Slides:



Advertisements
Similar presentations
Supervisor : Prof . Abbdolahzadeh
Advertisements

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Programming Models for IoT and Streaming Data IC2E Internet of Things Panel Judy Qiu Indiana University.
C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009.
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Pulsar Realtime Analytics At Scale Tony Ng, Sharad Murthy June 11, 2015.
David Besemer, CTO On Demand Data Integration with Data Virtualization.
BI Technical Infrastructure Approach
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Molecular Transactions G. Ramalingam Kapil Vaswani Rigorous Software Engineering, MSRI.
1 Apache Spark and Its Role in the Enterprise Data Hub Mike Olson, Chief Strategy Officer,
Modern Data Warehouse: Microsoft APS Alain Dormehl June 2015.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Matthew Winter and Ned Shawa
Windows Azure poDRw_Xi3Aw.
noun ; Software Defined Enterprise/SDE/ The enterprise who leverages software to flank their traditional business offerings, or to create entirely new.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
Andy Roberts Data Architect
Stream Processing with Tamás István Ujj
Data Warehousing The Easy Way with AWS Redshift
An Introduction To Big Data For The SQL Server DBA.
Apache Hadoop on Windows Azure Avkash Chauhan
©2015 DesignMind. All Rights Reserved.. 2 About DesignMind.
Apache Beam: The Case for Unifying Streaming API's Andrew Psaltis HDF / IoT Product Solution June 13, 2016 HDF / IoT Product Solution.
Microsoft Ignite /28/2017 6:07 PM
A Suite of Products that allow you to Predict Outcomes, Prescribe Actions and Automate Decisions.
Supervisor : Prof . Abbdolahzadeh
Big thanks to everyone!.
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
Pipe Engineering.
Connected Infrastructure
Data Platform and Analytics Foundational Training
PROTECT | OPTIMIZE | TRANSFORM
5/9/2018 7:28 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS.
Connected Living Connected Living What to look for Architecture
Data Platform and Analytics Foundational Training
Smart Building Solution
Hadoop Aakash Kag What Why How 1.
Introduction to Spark Streaming for Real Time data analysis
Connected Maintenance Solution
TDWI EXECUTIVE SUMMIT From Traditional to Modern: How Rakuten Marketing Realized the Promise of a New Generation of BI September 21, 2015 Donald Krapohl.
Intro to BI Architecture| Warren Sifre
Smart Building Solution
Connected Maintenance Solution
Connected Living Connected Living What to look for Architecture
Connected Infrastructure
Data Platform and Analytics Foundational Training
Akshun Gupta, Karthik Bala
Cloudy with a Chance of Data
LAMBDA ARCHITECTURES IN PRACTICE Kafka · Hadoop · Storm · Druid
ETL Architecture for Real-Time BI
9/21/2018 3:41 AM BRK3180 Architect your big data solutions with SQL Data Warehouse & Azure Analysis Services Josh Caplan & Matt Usher Program Managers.
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Microsoft Azure Cloud Platform Enables Mobile App Marketing Platform to Focus on its Growth By moblin.com “Using the Microsoft Azure platform and solutions,
Federico Perrero – Plant Manager
finding value in the digital maze
Data science and machine learning at scale, powered by Jupyter
Chapter 6 – Architectural Design
Replace with Application Image
Databricks: the new kid on the block
The Dataflow Model.
CMPT 354: Database System I
Technical Capabilities
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
CS 239 – Big Data Systems Fall 2018
SQL Server 2019 Bringing Apache Spark to SQL Server
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
Architecture of modern data warehouse
Presentation transcript:

Architecture for Real-Time ETL Nashville BI User Group 3 October 2017

Jon Boulineau Delivery Lead, HCA I lead data integration and data warehouse teams for HCA. My current project is the implementation of a streaming ETL pipeline using Hadoop ecosystem tools and SQL Server. Nashville PASS Leader jonboulineau @jboulineau jboulineau@gmail.com Rivulet.io ETL Architecture for Real-Time BI

3 | ETL Architecture for Real-Time BI

Applications Monolithic Localized RDBMS persistency Mutable databases with spotty audit logs Vertically scaled Applications 4 | ETL Architecture for Real-Time BI

ETL / Staging / Data Warehouse Significant overhead to make up for mutability of Application databases (e.g. Delta detection) Forced into (shrinking) batch windows due to overhead ‘Lossy’ snapshots Inaccurate in processing time, inaccurate in event time Brittle: failures tend to be catastrophic, partition intolerant, low availability Fault intolerant Expensive / difficult to scale Many points of failure How well is your ETL logic managed? ETL / Staging / Data Warehouse 5 | ETL Architecture for Real-Time BI

Forced into (shrinking) batch windows due to overhead ‘Lossy’ snapshots Inaccurate in processing time, inaccurate in event time Brittle: failures tend to be catastrophic, partition intolerant, low availability Fault intolerant Expensive / difficult to scale Many points of failure Data accuracy skew due to divergent logic ETL and Data Marts 6 | ETL Architecture for Real-Time BI

Batch Inadequacy Change Drivers Catastrophic Failure ‘Lossy’ Time-based inaccuracies Scale Fault Intolerance Complexity (Tool / Pattern Specific) 7 | ETL Architecture for Real-Time BI

BI Use Cases Batch Inadequacy Change Drivers Catastrophic Failure Hybrid Applications Machine Learning Industry Specific Competition Batch Inadequacy Catastrophic Failure ‘Lossy’ Time-based inaccuracies Scale Fault Intolerance Complexity (Tool / Pattern Specific) 8 | ETL Architecture for Real-Time BI

Change Drivers Application Architecture BI Use Cases Batch Inadequacy Polyglot Persistence Event-Driven Architecture Distributed Microservices BI Use Cases Hybrid Applications Machine Learning Industry Specific Competition Batch Inadequacy Catastrophic Failure ‘Lossy’ Time-based inaccuracies Scale Fault Intolerance Complexity (Tool / Pattern Specific) Real-time ETL is as much about overcoming limitations of batch processing and adapting to changes in application architecture as it is about enabling BI use cases 9 | ETL Architecture for Real-Time BI

Patterns for Real-Time ETL

Application Architecture BI Use Cases Batch Inadequacy Polyglot Persistence Event-Driven Architecture Distributed Microservices BI Use Cases Hybrid Applications Machine Learning Industry Specific Competition Batch Inadequacy Catastrophic Failure ‘Lossy’ Time-based inaccuracies Scale Fault Intolerance Complexity (Tool / Pattern Specific) 11 | ETL Architecture for Real-Time BI

Microbatch Application Architecture BI Use Cases Batch Inadequacy Polyglot Persistence Event-Driven Architecture Distributed Microservices BI Use Cases Hybrid Applications Machine Learning Industry Specific Competition Batch Inadequacy Catastrophic Failure ‘Lossy’ Time-based inaccuracies Scale Fault Intolerance Complexity (Tool / Pattern Specific) 12 | ETL Architecture for Real-Time BI

Data Virtualization Image source: http://virtualization.sys-con.com/node/1849158

Data Virtualization Application Architecture BI Use Cases Polyglot Persistence Event-Driven Architecture Distributed Microservices BI Use Cases Hybrid Applications Machine Learning Industry Specific Competition Batch Inadequacy Catastrophic Failure ‘Lossy’ Time-based inaccuracies Scale Fault Intolerance Complexity (Tool / Pattern Specific) 14 | ETL Architecture for Real-Time BI

Is it possible? Application Architecture BI Use Cases Batch Inadequacy Polyglot Persistence Event-Driven Architecture Distributed Microservices BI Use Cases Hybrid Applications Machine Learning Industry Specific Competition Batch Inadequacy Catastrophic Failure ‘Lossy’ Time-based inaccuracies Scale Fault Intolerance Complexity (Tool / Pattern Specific) 15 | ETL Architecture for Real-Time BI

Tool Specific Complexity * Hybrid Applications Catastrophic Failure Tool Specific Complexity * Hybrid Applications Industry Specific Competition Event Driven Architecture Time-Based Inaccuracies Lossy Polyglot Persistence Distributed Microservices Scale Fault Tolerance Machine Learning * Reprocessing Statefull Algorithms Processing inefficiencies 16 | ETL Architecture for Real-Time BI

Tool Specific Complexity * Hybrid Applications Catastrophic Failure Tool Specific Complexity * Hybrid Applications Industry Specific Competition Event Driven Architecture Lossy Polyglot Persistence Distributed Microservices * Reprocessing Time-Based Inaccuracies Scale Fault Tolerance Machine Learning Statefull Algorithms Processing inefficiencies 17 | ETL Architecture for Real-Time BI

Tool Specific Complexity * Hybrid Applications Catastrophic Failure Tool Specific Complexity * Hybrid Applications Industry Specific Competition Event Driven Architecture Lossy Polyglot Persistence Distributed Microservices * Reprocessing Statefull Algorithms Machine Learning Time-Based Inaccuracies Scale Fault Tolerance Processing inefficiencies 18 | ETL Architecture for Real-Time BI

Lambda Architecture http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

Streaming

Events The Log: What every software engineer should know about real-time data's unifying abstraction https://goo.gl/j6mqP6

Events Image source: https://dzone.com/articles/how-use-sql-server-transaction

Events Image source: https://kafka.apache.org/documentation/

Stream Processing

Catastrophic Failure Tool Specific Complexity Hybrid Applications Industry Specific Competition Event Driven Architecture Lossy Polyglot Persistence Distributed Microservices Reprocessing Statefull Algorithms Machine Learning Time-Based Inaccuracies Scale Fault Tolerance Processing inefficiencies https://www.confluent.io/blog/stream-data-platform-1/

Volt DB Apache Kafka (Confluent) Amazon Kinesis Google Cloud Pub/Sub Microsoft Event Hub Spark Streaming, Flink, Beam, Storm, Samza, +1M Hadoop ecosystem projects Amazon Kinesis Analytics Google Dataflow Microsoft Stream Analytics Azure Data Warehouse Teradata SQL Server Amazon Redshift Etc. etc. etc. etc.

Advanced Topics in Steaming Distributed Systems CAP / PACELC Serial / Parallel algorithms Reasoning About Time Event Time / Processing Time Windowing Correctness Semantics Exactly-once, at-least once, at-most once Persistency Layer

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101