The “perfect” data platform

Slides:



Advertisements
Similar presentations
A comparison of MySQL And Oracle Jeremy Haubrich.
Advertisements

C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
SSIS Over DTS Sagayaraj Putti (139460). 5 September What is DTS?  Data Transformation Services (DTS)  DTS is a set of objects and utilities that.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
SQL Server Integration Services (SSIS) Presented by Tarek Ghazali IT Technical Specialist Microsoft SQL Server (MVP) Microsoft Certified Technology Specialist.
K E Y : SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Application Provider Visualization Access Analytics Curation Collection.
#devshark welcome to #devshark. #devshark HELLO! I’M Ville Rauma Fingersoft Product Owner Web
The eHealth Services Capstone Project
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Server to Server Communication Redis as an enabler Orion Free
K E Y : SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Transformation Provider Visualization Access Analytics Curation Collection.
K E Y : DATA SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Hardware (Storage, Networking, etc.) Big Data Framework Scalable.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
BI Reporting Tools Kalyn Kelly June 10, BI – Business Intelligence  A set of theories, methodologies, architectures, and technologies that transform.
If you have a transaction processing system, John Meisenbacher
TFS Training TFS Training. Introduction to Team Foundation Server Team Foundation Server Team Foundation Server is a Microsoft product which provides.
W HAT IS SAP HANA? HANA - High-Performance Analytic Appliance What is SAP HANA ? Is SAP HANA An another database …. ? A modern column store database ….?
Virtualization of Infrastructure as a Service (IaaS): Redundancy Mechanism of the Controller Node in OpenStack Cloud Computing Platform BY Shahed murshed.
Energy Management Solution
DATA Storage and analytics with AZURE DATA LAKE
Managing a database environment in the cloud
Monitoring Windows Server 2012
data & analytics beyond dashboards
Microsoft Dynamics 365 for Operations Roadmap Deployment Scenarios
4/19/ :02 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Big Data and AI in a Unified Platform
5/9/2018 7:28 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS.
Data Platform and Analytics Foundational Training
How Cutting Edge Big Data and Analytics Lets J. D
TDWI EXECUTIVE SUMMIT From Traditional to Modern: How Rakuten Marketing Realized the Promise of a New Generation of BI September 21, 2015 Donald Krapohl.
Creating Enterprise Grade BI Models with Azure Analysis Services
Business Intelligence 101
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Emily Kohne Oscar Rivera Adriana Perez Brenda Izaguirre
Cloud Computing & ANalytics
Collecting heterogeneous data into a central repository
Energy Management Solution
Where I am at: Swagatika Sarangi MDM Lead PASS Summit SQL Saturdays
LS BI Jóhann Einarsson Solution lead LS BI
Microsoft Build /20/2018 5:17 AM © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY,
Upgrading to Microsoft SQL Server 2014
Welcome! Power BI User Group (PUG)
The Jet Reports Suite of Solutions
September 11, Ian R Brooks Ph.D.
Java in the cloud PaaS Platform in Comparison
7/18/17 Customers Webinar Australian market 5/2/2018.
Big Data For Indian SMEs
“Upgrading Technology Financing: ML Enabling a Data-Powered Process”
Welcome! Power BI User Group (PUG)
Logical Data Warehousing and Tableau 10
Azure's Performance, Scalability, SQL Servers Automate Real Time Data Transfer at Low Cost MINI-CASE STUDY “Azure offers high performance, scalable, and.
Jasper Hillebrand Emerging Technologies Think Big Analytics / Teradata
"Cloud services" - what it is.
Developing Advanced Applications with Windows Azure
Welcome to the WeWork 200 Portland St, Boston MA.
Automating Profitable Growth™
Data Warehousing Concepts
Power BI at Enterprise-Scale
Agenda Need of Cloud Computing What is Cloud Computing
For Community and TSC Discussion Bin Hu
Modern data architecture at scale in the cloud : Best practices of Serverless, lambda and microservices architecture Prakriteswar Santikary, PhD Vice President.
Analytics, BI & Data Integration
REST Easy - Instant APIs for Your Database
David Gilmore & Richard Blevins Senior Consultants April 17th, 2012
Beyond orchestration with Azure Data Factory
Serverless Computing: Promises & Pitfalls
Visual Data Flows – Azure Data Factory v2
Architecture of modern data warehouse
Presentation transcript:

The “perfect” data platform Aleksander Djurka (Sasho), VP for Data @Vestiaire Collective AI Ukraine 2019-09-21 Thanks the organizers for their kind invitation, and attendees for their interest I am Sasho, working in Paris as the VP for Data at an e-commerce company called Vestiaire Collective (I invite you to check it out during the break) Previously worked at Alibaba, the Chinese e-commerce giant, where I had the privilege of seeing the large scale approach in action, along with the alicloud data tools that are still mostly not available in western markets Today I will be talking about “the perfect data platform”, which of course is a deliberately absurd title – there is no such thing. We will however try to get as close as possible.

Who in the audience is… … working as a Data Scientist?

Who in the audience is… … working as a Data Engineer?

Who in the audience is… … managing a data team?

Who in the audience is… ... planning to create a data platform from scratch?

1st Task

2nd Task

n-th Task Is this still the same problem?

Difficulty of the task Size of the task

Difficulty of the task Size of the task When to give up? Complete collapse I hate my job Discomfort Easy Size of the task Difficulty of the task

n-th Task

X X Difficulty of the task X …… Size of the task

Big Data The part that creates problems The part that creates value

Will it sc- ale?* * The number 1 job of any capable data platform

Vertical scaling Horizontal scaling Using more efficient infrastructure Vertical scaling Adding more infrastructure Horizontal scaling

Migration Migration Migration Capacity Time

What is a data plaform? Data Model

What is a data plaform? Data Model IN

What is a data plaform? Data Model IN

What is a data plaform? Data Model IN OUT

Goals for each section OUT IN Include all data Real-time collection Model IN OUT Include all data Real-time collection

Goals for each section OUT IN Include all data Real-time collection Model IN OUT Usability Stability Efficiency Include all data Real-time collection

Goals for each section OUT IN Include all data Real-time collection Model IN OUT Usability Stability Efficiency Maximize usage Maximize utility Include all data Real-time collection

Use-case focus Platform focus Cloud - enabled Focus for today Platform development Use case 1 Use case 2 Use case 3 Cloud Use-case focus Platform focus Cloud - enabled

OK, enough of this, let’s get specific

What do we need to do? 1: Start with use cases – what are the needs? 2: Make several design choices, such as DWH infra, ETL tools, Data Stream architecture, BI tools… 3: Design the working processes

What are the latency requirements (batch/real-time)? What do we need? ? What kind of data? ? From which sources? ? How much data? ? How will we use it? ? What are the latency requirements (batch/real-time)? Examples: System to process orders for customers System to calculate number of orders by age and gender of consumer System to subscribe to news articles System to display number of times “AIUkraine” was mentioned in all articles over time

Transactional Database Why Data Warehouse? Transactional Database Fast for processing reading/writing single records… … but slow at computing over large datasets * Analytical Database Great for analyzing / writing large datasets … … but queries take too long to be used for transactional purposes * Examples: System to process orders for customers System to calculate number of orders by age and gender of consumer System to subscribe to news articles System to display number of times “AIUkraine” was mentioned in all articles over time * The differences are getting smaller, perhaps to become insignificant for many use cases

Comparison of out-of-the-box solutions Examples Amazon Redshift Google BigQuery Snowflake Scaling: Purchase instances from Amazon, each instance brings additional CPU and storage capacity Scaling: Pay-per-query, depending on number of rows that are being scanned, decoupled from storage cost Scaling: Pay per computing instance usage time, control computing power, decoupled from storage Examples: System to process orders for customers System to calculate number of orders by age and gender of consumer System to subscribe to news articles System to display number of times “AIUkraine” was mentioned in all articles over time

DWH structure (Big) Data Warehouse Application Data Service … Data consumers Business System Data Operational Data Store Common Data/Dimension Model Application Data Service (Big) Data Warehouse Finance DS BI

Orchestration of data pipelines Data pipeline orchestration … Data consumers Business System Data Operational Data Store Common Data/Dimension Model Application Data Service (Big) Data Warehouse Finance DS BI Orchestration of data pipelines Tools for development, scheduling and production monitoring are needed. Some of the options for vendors/tools: Airflow Talend Matillion Pentaho DBT

Lambda vs. Kappa architecture for real-time computing Events Events Stream Batch Stream View View Query = λ (Complete data) = λ (live streaming data) * λ (Stored data) Query = K (New Data) = K (Live streaming data) Source: https://towardsdatascience.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-big-data-4f35c28005bb

In-memory cube / staging Consuming the data (BI use case) Visualization Visualization In-memory cube / staging Analytical Database Analytical Database If sufficiently performant

The full(?) Data Platform toolkit … Data consumers Business System Data Operational Data Store Common Data/Dimension Model Application Data Service (Big) Data Warehouse Finance DS BI Compatibility of data consumption tools with core platform Data stability monitoring and alerting Automated Data Quality control Data staging with connectivity to real-time and batch platforms IDE for ease of deployment and ops Role-based access controls on column level Data labeling and anonymization, particularly for PII data Standardization of CDM with clear naming convention Meta-data and lineage management tool Real-time and Batch versions of the data model Computing and storage capacity management Data transfer capacity management tool Real-time log collection

Join us! Our teams in Paris are hiring! Currently looking for: Data Scientists BI Developers Data Engineers … and many other roles in the tech team. Check it out at https://www.vestiairecollective.com/about/join-us/ or contact me at aleksander.djurka@vestiairecollective.com