The “perfect” data platform Aleksander Djurka (Sasho), VP for Data @Vestiaire Collective AI Ukraine 2019-09-21 Thanks the organizers for their kind invitation, and attendees for their interest I am Sasho, working in Paris as the VP for Data at an e-commerce company called Vestiaire Collective (I invite you to check it out during the break) Previously worked at Alibaba, the Chinese e-commerce giant, where I had the privilege of seeing the large scale approach in action, along with the alicloud data tools that are still mostly not available in western markets Today I will be talking about “the perfect data platform”, which of course is a deliberately absurd title – there is no such thing. We will however try to get as close as possible.
Who in the audience is… … working as a Data Scientist?
Who in the audience is… … working as a Data Engineer?
Who in the audience is… … managing a data team?
Who in the audience is… ... planning to create a data platform from scratch?
1st Task
2nd Task
n-th Task Is this still the same problem?
Difficulty of the task Size of the task
Difficulty of the task Size of the task When to give up? Complete collapse I hate my job Discomfort Easy Size of the task Difficulty of the task
n-th Task
X X Difficulty of the task X …… Size of the task
Big Data The part that creates problems The part that creates value
Will it sc- ale?* * The number 1 job of any capable data platform
Vertical scaling Horizontal scaling Using more efficient infrastructure Vertical scaling Adding more infrastructure Horizontal scaling
Migration Migration Migration Capacity Time
What is a data plaform? Data Model
What is a data plaform? Data Model IN
What is a data plaform? Data Model IN
What is a data plaform? Data Model IN OUT
Goals for each section OUT IN Include all data Real-time collection Model IN OUT Include all data Real-time collection
Goals for each section OUT IN Include all data Real-time collection Model IN OUT Usability Stability Efficiency Include all data Real-time collection
Goals for each section OUT IN Include all data Real-time collection Model IN OUT Usability Stability Efficiency Maximize usage Maximize utility Include all data Real-time collection
Use-case focus Platform focus Cloud - enabled Focus for today Platform development Use case 1 Use case 2 Use case 3 Cloud Use-case focus Platform focus Cloud - enabled
OK, enough of this, let’s get specific
What do we need to do? 1: Start with use cases – what are the needs? 2: Make several design choices, such as DWH infra, ETL tools, Data Stream architecture, BI tools… 3: Design the working processes
What are the latency requirements (batch/real-time)? What do we need? ? What kind of data? ? From which sources? ? How much data? ? How will we use it? ? What are the latency requirements (batch/real-time)? Examples: System to process orders for customers System to calculate number of orders by age and gender of consumer System to subscribe to news articles System to display number of times “AIUkraine” was mentioned in all articles over time
Transactional Database Why Data Warehouse? Transactional Database Fast for processing reading/writing single records… … but slow at computing over large datasets * Analytical Database Great for analyzing / writing large datasets … … but queries take too long to be used for transactional purposes * Examples: System to process orders for customers System to calculate number of orders by age and gender of consumer System to subscribe to news articles System to display number of times “AIUkraine” was mentioned in all articles over time * The differences are getting smaller, perhaps to become insignificant for many use cases
Comparison of out-of-the-box solutions Examples Amazon Redshift Google BigQuery Snowflake Scaling: Purchase instances from Amazon, each instance brings additional CPU and storage capacity Scaling: Pay-per-query, depending on number of rows that are being scanned, decoupled from storage cost Scaling: Pay per computing instance usage time, control computing power, decoupled from storage Examples: System to process orders for customers System to calculate number of orders by age and gender of consumer System to subscribe to news articles System to display number of times “AIUkraine” was mentioned in all articles over time
DWH structure (Big) Data Warehouse Application Data Service … Data consumers Business System Data Operational Data Store Common Data/Dimension Model Application Data Service (Big) Data Warehouse Finance DS BI
Orchestration of data pipelines Data pipeline orchestration … Data consumers Business System Data Operational Data Store Common Data/Dimension Model Application Data Service (Big) Data Warehouse Finance DS BI Orchestration of data pipelines Tools for development, scheduling and production monitoring are needed. Some of the options for vendors/tools: Airflow Talend Matillion Pentaho DBT
Lambda vs. Kappa architecture for real-time computing Events Events Stream Batch Stream View View Query = λ (Complete data) = λ (live streaming data) * λ (Stored data) Query = K (New Data) = K (Live streaming data) Source: https://towardsdatascience.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-big-data-4f35c28005bb
In-memory cube / staging Consuming the data (BI use case) Visualization Visualization In-memory cube / staging Analytical Database Analytical Database If sufficiently performant
The full(?) Data Platform toolkit … Data consumers Business System Data Operational Data Store Common Data/Dimension Model Application Data Service (Big) Data Warehouse Finance DS BI Compatibility of data consumption tools with core platform Data stability monitoring and alerting Automated Data Quality control Data staging with connectivity to real-time and batch platforms IDE for ease of deployment and ops Role-based access controls on column level Data labeling and anonymization, particularly for PII data Standardization of CDM with clear naming convention Meta-data and lineage management tool Real-time and Batch versions of the data model Computing and storage capacity management Data transfer capacity management tool Real-time log collection
Join us! Our teams in Paris are hiring! Currently looking for: Data Scientists BI Developers Data Engineers … and many other roles in the tech team. Check it out at https://www.vestiairecollective.com/about/join-us/ or contact me at aleksander.djurka@vestiairecollective.com