Presentation is loading. Please wait.

Presentation is loading. Please wait.

The “perfect” data platform

Similar presentations


Presentation on theme: "The “perfect” data platform"— Presentation transcript:

1 The “perfect” data platform
Aleksander Djurka (Sasho), VP for Collective AI Ukraine Thanks the organizers for their kind invitation, and attendees for their interest I am Sasho, working in Paris as the VP for Data at an e-commerce company called Vestiaire Collective (I invite you to check it out during the break) Previously worked at Alibaba, the Chinese e-commerce giant, where I had the privilege of seeing the large scale approach in action, along with the alicloud data tools that are still mostly not available in western markets Today I will be talking about “the perfect data platform”, which of course is a deliberately absurd title – there is no such thing. We will however try to get as close as possible.

2 Who in the audience is… … working as a Data Scientist?

3 Who in the audience is… … working as a Data Engineer?

4 Who in the audience is… … managing a data team?

5 Who in the audience is… ... planning to create a data platform from scratch?

6 1st Task

7 2nd Task

8 n-th Task Is this still the same problem?

9 Difficulty of the task Size of the task

10 Difficulty of the task Size of the task When to give up?
Complete collapse I hate my job Discomfort Easy Size of the task Difficulty of the task

11

12 n-th Task

13

14 X X Difficulty of the task X …… Size of the task

15 Big Data The part that creates problems The part that creates value

16 Will it sc- ale?* * The number 1 job of any capable data platform

17 Vertical scaling Horizontal scaling Using more efficient
infrastructure Vertical scaling Adding more infrastructure Horizontal scaling

18 Migration Migration Migration Capacity Time

19 What is a data plaform? Data Model

20 What is a data plaform? Data Model IN

21 What is a data plaform? Data Model IN

22 What is a data plaform? Data Model IN OUT

23 Goals for each section OUT IN Include all data Real-time collection
Model IN OUT Include all data Real-time collection

24 Goals for each section OUT IN Include all data Real-time collection
Model IN OUT Usability Stability Efficiency Include all data Real-time collection

25 Goals for each section OUT IN Include all data Real-time collection
Model IN OUT Usability Stability Efficiency Maximize usage Maximize utility Include all data Real-time collection

26 Use-case focus Platform focus Cloud - enabled Focus for today
Platform development Use case 1 Use case 2 Use case 3 Cloud Use-case focus Platform focus Cloud - enabled

27 OK, enough of this, let’s get specific

28 What do we need to do? 1: Start with use cases – what are the needs? 2: Make several design choices, such as DWH infra, ETL tools, Data Stream architecture, BI tools… 3: Design the working processes

29 What are the latency requirements (batch/real-time)?
What do we need? ? What kind of data? ? From which sources? ? How much data? ? How will we use it? ? What are the latency requirements (batch/real-time)? Examples: System to process orders for customers System to calculate number of orders by age and gender of consumer System to subscribe to news articles System to display number of times “AIUkraine” was mentioned in all articles over time

30 Transactional Database
Why Data Warehouse? Transactional Database Fast for processing reading/writing single records… … but slow at computing over large datasets * Analytical Database Great for analyzing / writing large datasets … … but queries take too long to be used for transactional purposes * Examples: System to process orders for customers System to calculate number of orders by age and gender of consumer System to subscribe to news articles System to display number of times “AIUkraine” was mentioned in all articles over time * The differences are getting smaller, perhaps to become insignificant for many use cases

31 Comparison of out-of-the-box solutions
Examples Amazon Redshift Google BigQuery Snowflake Scaling: Purchase instances from Amazon, each instance brings additional CPU and storage capacity Scaling: Pay-per-query, depending on number of rows that are being scanned, decoupled from storage cost Scaling: Pay per computing instance usage time, control computing power, decoupled from storage Examples: System to process orders for customers System to calculate number of orders by age and gender of consumer System to subscribe to news articles System to display number of times “AIUkraine” was mentioned in all articles over time

32 DWH structure (Big) Data Warehouse Application Data Service
Data consumers Business System Data Operational Data Store Common Data/Dimension Model Application Data Service (Big) Data Warehouse Finance DS BI

33 Orchestration of data pipelines
Data pipeline orchestration Data consumers Business System Data Operational Data Store Common Data/Dimension Model Application Data Service (Big) Data Warehouse Finance DS BI Orchestration of data pipelines Tools for development, scheduling and production monitoring are needed. Some of the options for vendors/tools: Airflow Talend Matillion Pentaho DBT

34 Lambda vs. Kappa architecture for real-time computing
Events Events Stream Batch Stream View View Query = λ (Complete data) = λ (live streaming data) * λ (Stored data) Query = K (New Data) = K (Live streaming data) Source:

35 In-memory cube / staging
Consuming the data (BI use case) Visualization Visualization In-memory cube / staging Analytical Database Analytical Database If sufficiently performant

36 The full(?) Data Platform toolkit
Data consumers Business System Data Operational Data Store Common Data/Dimension Model Application Data Service (Big) Data Warehouse Finance DS BI Compatibility of data consumption tools with core platform Data stability monitoring and alerting Automated Data Quality control Data staging with connectivity to real-time and batch platforms IDE for ease of deployment and ops Role-based access controls on column level Data labeling and anonymization, particularly for PII data Standardization of CDM with clear naming convention Meta-data and lineage management tool Real-time and Batch versions of the data model Computing and storage capacity management Data transfer capacity management tool Real-time log collection

37 Join us! Our teams in Paris are hiring! Currently looking for: Data Scientists BI Developers Data Engineers … and many other roles in the tech team. Check it out at or contact me at


Download ppt "The “perfect” data platform"

Similar presentations


Ads by Google