Download presentation
Presentation is loading. Please wait.
1
The “perfect” data platform
Aleksander Djurka (Sasho), VP for Collective AI Ukraine Thanks the organizers for their kind invitation, and attendees for their interest I am Sasho, working in Paris as the VP for Data at an e-commerce company called Vestiaire Collective (I invite you to check it out during the break) Previously worked at Alibaba, the Chinese e-commerce giant, where I had the privilege of seeing the large scale approach in action, along with the alicloud data tools that are still mostly not available in western markets Today I will be talking about “the perfect data platform”, which of course is a deliberately absurd title – there is no such thing. We will however try to get as close as possible.
2
Who in the audience is… … working as a Data Scientist?
3
Who in the audience is… … working as a Data Engineer?
4
Who in the audience is… … managing a data team?
5
Who in the audience is… ... planning to create a data platform from scratch?
6
1st Task
7
2nd Task
8
n-th Task Is this still the same problem?
9
Difficulty of the task Size of the task
10
Difficulty of the task Size of the task When to give up?
Complete collapse I hate my job Discomfort Easy Size of the task Difficulty of the task
12
n-th Task
14
X X Difficulty of the task X …… Size of the task
15
Big Data The part that creates problems The part that creates value
16
Will it sc- ale?* * The number 1 job of any capable data platform
17
Vertical scaling Horizontal scaling Using more efficient
infrastructure Vertical scaling Adding more infrastructure Horizontal scaling
18
Migration Migration Migration Capacity Time
19
What is a data plaform? Data Model
20
What is a data plaform? Data Model IN
21
What is a data plaform? Data Model IN
22
What is a data plaform? Data Model IN OUT
23
Goals for each section OUT IN Include all data Real-time collection
Model IN OUT Include all data Real-time collection
24
Goals for each section OUT IN Include all data Real-time collection
Model IN OUT Usability Stability Efficiency Include all data Real-time collection
25
Goals for each section OUT IN Include all data Real-time collection
Model IN OUT Usability Stability Efficiency Maximize usage Maximize utility Include all data Real-time collection
26
Use-case focus Platform focus Cloud - enabled Focus for today
Platform development Use case 1 Use case 2 Use case 3 Cloud Use-case focus Platform focus Cloud - enabled
27
OK, enough of this, let’s get specific
28
What do we need to do? 1: Start with use cases – what are the needs? 2: Make several design choices, such as DWH infra, ETL tools, Data Stream architecture, BI tools… 3: Design the working processes
29
What are the latency requirements (batch/real-time)?
What do we need? ? What kind of data? ? From which sources? ? How much data? ? How will we use it? ? What are the latency requirements (batch/real-time)? Examples: System to process orders for customers System to calculate number of orders by age and gender of consumer System to subscribe to news articles System to display number of times “AIUkraine” was mentioned in all articles over time
30
Transactional Database
Why Data Warehouse? Transactional Database Fast for processing reading/writing single records… … but slow at computing over large datasets * Analytical Database Great for analyzing / writing large datasets … … but queries take too long to be used for transactional purposes * Examples: System to process orders for customers System to calculate number of orders by age and gender of consumer System to subscribe to news articles System to display number of times “AIUkraine” was mentioned in all articles over time * The differences are getting smaller, perhaps to become insignificant for many use cases
31
Comparison of out-of-the-box solutions
Examples Amazon Redshift Google BigQuery Snowflake Scaling: Purchase instances from Amazon, each instance brings additional CPU and storage capacity Scaling: Pay-per-query, depending on number of rows that are being scanned, decoupled from storage cost Scaling: Pay per computing instance usage time, control computing power, decoupled from storage Examples: System to process orders for customers System to calculate number of orders by age and gender of consumer System to subscribe to news articles System to display number of times “AIUkraine” was mentioned in all articles over time
32
DWH structure (Big) Data Warehouse Application Data Service
… Data consumers Business System Data Operational Data Store Common Data/Dimension Model Application Data Service (Big) Data Warehouse Finance DS BI
33
Orchestration of data pipelines
Data pipeline orchestration … Data consumers Business System Data Operational Data Store Common Data/Dimension Model Application Data Service (Big) Data Warehouse Finance DS BI Orchestration of data pipelines Tools for development, scheduling and production monitoring are needed. Some of the options for vendors/tools: Airflow Talend Matillion Pentaho DBT
34
Lambda vs. Kappa architecture for real-time computing
Events Events Stream Batch Stream View View Query = λ (Complete data) = λ (live streaming data) * λ (Stored data) Query = K (New Data) = K (Live streaming data) Source:
35
In-memory cube / staging
Consuming the data (BI use case) Visualization Visualization In-memory cube / staging Analytical Database Analytical Database If sufficiently performant
36
The full(?) Data Platform toolkit
… Data consumers Business System Data Operational Data Store Common Data/Dimension Model Application Data Service (Big) Data Warehouse Finance DS BI Compatibility of data consumption tools with core platform Data stability monitoring and alerting Automated Data Quality control Data staging with connectivity to real-time and batch platforms IDE for ease of deployment and ops Role-based access controls on column level Data labeling and anonymization, particularly for PII data Standardization of CDM with clear naming convention Meta-data and lineage management tool Real-time and Batch versions of the data model Computing and storage capacity management Data transfer capacity management tool Real-time log collection
37
Join us! Our teams in Paris are hiring! Currently looking for: Data Scientists BI Developers Data Engineers … and many other roles in the tech team. Check it out at or contact me at
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.