Download presentation
1
The Evolution of Big Data Platform @ Netflix
Eva Tse July 22, 2015
5
.
6
Our biggest challenge is scale
7
Netflix Key Business Metrics
65+ million members 50 countries 1000+ devices supported 10 billion hours / quarter
8
Global Expansion 200 countries by end of 2016
9
Big Data Size Total ~20 PB DW on S3 Read ~10% DW daily Write ~10% of read data daily ~ 500 billion events daily ~ 350 active users
10
Our traditional BI stack is our competition
11
How do we meet the functionality bar and yet make it scale?
How do we make big data bite-size again?
12
Our North Star Infrastructure Architecture Self-serve
No undifferentiated heavy lifting Architecture Scalable and sustainable Self-serve Ecosystem of tools
13
Data Pipelines Event Data Suro/Kafka Ursula Cloud apps 15 min AWS S3
Dimension Data Aegisthus Cassandra SS Tables Daily
14
Big Data API Big Data Portal Metacat AWS S3 Data movement
Parquet FF Metacat (Federated metadata service) Pig workflow visualization Data movement Data visualization (Hadoop clusters) Job/Cluster perf Data lineage Data quality Storage Compute Service Tools (Federated execution service) Big Data Portal API Portal Big Data API AWS S3
15
Evolving Big Data Processing Needs
Analytics ETL Interactive data exploration Interactive slice & dice RT analytics & iterative/ML algo
16
Evolving Services/Tools Ecosystem
API Portal Evolving Services/Tools Ecosystem Data movement Data visualization Big Data API Big Data Portal Data lineage (Federated execution service) Data quality Metacat Pig workflow visualization (Federated metadata service) Job/Cluster perf visualization
17
AWS S3 as our DW Storage S3 as single source of truth (not HDFS)
11 9’s durability and 4 9’s availability Separate compute and storage Key enablement to multiple clusters easy upgrade via r/b deployment
18
Evolution of Big Data Processing Systems
20
Analytics Hive-QL is close to ANSI SQL syntax Hive metastore serves as single source of truth for metadata for big data
21
Better language construct for ETL Contributions since 0.11
Customization Integration with Metacat to Hive Metastore Integration with S3
22
Interactive data exploration and experimentation Why we like presto?
Integration with Hive metastore Easy integration with S3 Works at petabyte scale ANSI SQL for usability Fast
23
Our contributions S3 file system Query optimizations
Complex types support Parquet file format integration Working on predicate pushdown
24
Parquet Columnar file format Supported across Hive, Pig, Presto, Spark
Performance benefits across different processing engines Working on vectorized read, lazy load and lazy materialization
25
Interactive dashboard for slicing and dicing
Column-based in-memory data store for time series data Serves a specific use case very well
26
ETL, RT analytics, ML algorithms Why we like Spark?
Cohesive environment – batch and ‘stream’ processing Multiple language support – Scala, Python Performance benefits Run on top of YARN for multi-tenancy Community momentum
27
Evolution of Services/Tools Ecosystem
API Portal Evolution of Services/Tools Ecosystem Data movement Data visualization Big Data API Big Data Portal Data lineage (Federated execution service) Data quality Metacat Pig workflow visualization (Federated metadata service) Job/Cluster perf visualization
28
Federated execution engine
Expose [your fave big data engine] as a service Flexible data model to support future job types Cluster configuration management
29
Metacat Federated metadata catalog for the whole data platform
Proxy service to different metadata sources Data metrics, data usage, ownership, categorization and retention policy … Common interface for tools to interact with metadata To be open sourced in 2015 on Netflix OSS
30
Big Data API Big Data Portal Metacat d d Data movement
Service Tools API Portal Data movement Data visualization Big Data API Big Data Portal Data lineage (Federated execution service) Data quality Metacat Pig workflow visualization (Federated metadata service) Job/Cluster perf visualization
36
Big Data API Integration layer for our ecosystem of tools and services
Python library (called Kragle) Building block for our ETL workflow Building block for Big Data Portal
39
Big Data Portal One stop shop for all big data related tools and services Built on top of Big Data API
44
Open source is an integral part of our strategy to achieve scale
45
Big Data Processing Systems
Services/Tools Ecosystem
46
Why use Open Source? Collaborate with other internet scale tech companies Unchartered area/scale, lock-in is not desirable Need the flexibility to achieve scalability BUT… Lots of choices White box approach
47
Why contribute back? Non IP or trade secret
Help shape direction of projects Don’t want to fork and diverge Attract top talent
48
Why contribute our own tool?
Share our goodness Set industry standard Community can help evolve the tool
50
Is open source right for you?
52
Measuring big data - understanding data by usage
By Charles Smith, Netflix 1:40-2:20pm
53
Eva Tse jobs.netflix.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.