The Evolution of Big Data Platform @ Netflix Eva Tse July 22, 2015
.
Our biggest challenge is scale
Netflix Key Business Metrics 65+ million members 50 countries 1000+ devices supported 10 billion hours / quarter
Global Expansion 200 countries by end of 2016
Big Data Size Total ~20 PB DW on S3 Read ~10% DW daily Write ~10% of read data daily ~ 500 billion events daily ~ 350 active users
Our traditional BI stack is our competition
How do we meet the functionality bar and yet make it scale? How do we make big data bite-size again?
Our North Star Infrastructure Architecture Self-serve No undifferentiated heavy lifting Architecture Scalable and sustainable Self-serve Ecosystem of tools
Data Pipelines Event Data Suro/Kafka Ursula Cloud apps 15 min AWS S3 Dimension Data Aegisthus Cassandra SS Tables Daily
Big Data API Big Data Portal Metacat AWS S3 Data movement Parquet FF Metacat (Federated metadata service) Pig workflow visualization Data movement Data visualization (Hadoop clusters) Job/Cluster perf Data lineage Data quality Storage Compute Service Tools (Federated execution service) Big Data Portal API Portal Big Data API AWS S3
Evolving Big Data Processing Needs Analytics ETL Interactive data exploration Interactive slice & dice RT analytics & iterative/ML algo
Evolving Services/Tools Ecosystem API Portal Evolving Services/Tools Ecosystem Data movement Data visualization Big Data API Big Data Portal Data lineage (Federated execution service) Data quality Metacat Pig workflow visualization (Federated metadata service) Job/Cluster perf visualization
AWS S3 as our DW Storage S3 as single source of truth (not HDFS) 11 9’s durability and 4 9’s availability Separate compute and storage Key enablement to multiple clusters easy upgrade via r/b deployment
Evolution of Big Data Processing Systems
Analytics Hive-QL is close to ANSI SQL syntax Hive metastore serves as single source of truth for metadata for big data
Better language construct for ETL Contributions since 0.11 Customization Integration with Metacat to Hive Metastore Integration with S3
Interactive data exploration and experimentation Why we like presto? Integration with Hive metastore Easy integration with S3 Works at petabyte scale ANSI SQL for usability Fast
Our contributions S3 file system Query optimizations Complex types support Parquet file format integration Working on predicate pushdown
Parquet Columnar file format Supported across Hive, Pig, Presto, Spark Performance benefits across different processing engines Working on vectorized read, lazy load and lazy materialization
Interactive dashboard for slicing and dicing Column-based in-memory data store for time series data Serves a specific use case very well
ETL, RT analytics, ML algorithms Why we like Spark? Cohesive environment – batch and ‘stream’ processing Multiple language support – Scala, Python Performance benefits Run on top of YARN for multi-tenancy Community momentum
Evolution of Services/Tools Ecosystem API Portal Evolution of Services/Tools Ecosystem Data movement Data visualization Big Data API Big Data Portal Data lineage (Federated execution service) Data quality Metacat Pig workflow visualization (Federated metadata service) Job/Cluster perf visualization
Federated execution engine Expose [your fave big data engine] as a service Flexible data model to support future job types Cluster configuration management
Metacat Federated metadata catalog for the whole data platform Proxy service to different metadata sources Data metrics, data usage, ownership, categorization and retention policy … Common interface for tools to interact with metadata To be open sourced in 2015 on Netflix OSS
Big Data API Big Data Portal Metacat d d Data movement Service Tools API Portal Data movement Data visualization Big Data API Big Data Portal Data lineage (Federated execution service) Data quality Metacat Pig workflow visualization (Federated metadata service) Job/Cluster perf visualization
Big Data API Integration layer for our ecosystem of tools and services Python library (called Kragle) Building block for our ETL workflow Building block for Big Data Portal
Big Data Portal One stop shop for all big data related tools and services Built on top of Big Data API
Open source is an integral part of our strategy to achieve scale
Big Data Processing Systems Services/Tools Ecosystem
Why use Open Source? Collaborate with other internet scale tech companies Unchartered area/scale, lock-in is not desirable Need the flexibility to achieve scalability BUT… Lots of choices White box approach
Why contribute back? Non IP or trade secret Help shape direction of projects Don’t want to fork and diverge Attract top talent
Why contribute our own tool? Share our goodness Set industry standard Community can help evolve the tool
Is open source right for you?
Measuring big data - understanding data by usage By Charles Smith, Netflix Tomorrow @ 1:40-2:20pm
Eva Tse etse@netflix.com jobs.netflix.com