The Evolution of Big Data Netflix

Name: The Evolution of Big Data Netflix
Uploaded: 2017-07-15T09:44:55+00:00
Duration: PTM11S14
Channel: Rolf Patterson
Description: The Evolution of Big Data Netflix

The Evolution of Big Data Platform @ Netflix
Eva Tse July 22, 2015

Our biggest challenge is scale

Netflix Key Business Metrics
65+ million members 50 countries 1000+ devices supported 10 billion hours / quarter

Global Expansion 200 countries by end of 2016

Big Data Size Total ~20 PB DW on S3 Read ~10% DW daily Write ~10% of read data daily ~ 500 billion events daily ~ 350 active users

Our traditional BI stack is our competition

How do we meet the functionality bar and yet make it scale?
How do we make big data bite-size again?

Our North Star Infrastructure Architecture Self-serve
No undifferentiated heavy lifting Architecture Scalable and sustainable Self-serve Ecosystem of tools

Data Pipelines Event Data Suro/Kafka Ursula Cloud apps 15 min AWS S3
Dimension Data Aegisthus Cassandra SS Tables Daily

Big Data API Big Data Portal Metacat AWS S3 Data movement
Parquet FF Metacat (Federated metadata service) Pig workflow visualization Data movement Data visualization (Hadoop clusters) Job/Cluster perf Data lineage Data quality Storage Compute Service Tools (Federated execution service) Big Data Portal API Portal Big Data API AWS S3

Evolving Big Data Processing Needs
Analytics ETL Interactive data exploration Interactive slice & dice RT analytics & iterative/ML algo

Evolving Services/Tools Ecosystem
API Portal Evolving Services/Tools Ecosystem Data movement Data visualization Big Data API Big Data Portal Data lineage (Federated execution service) Data quality Metacat Pig workflow visualization (Federated metadata service) Job/Cluster perf visualization

AWS S3 as our DW Storage S3 as single source of truth (not HDFS)
11 9’s durability and 4 9’s availability Separate compute and storage Key enablement to multiple clusters easy upgrade via r/b deployment

Evolution of Big Data Processing Systems

Analytics Hive-QL is close to ANSI SQL syntax Hive metastore serves as single source of truth for metadata for big data

Better language construct for ETL Contributions since 0.11
Customization Integration with Metacat to Hive Metastore Integration with S3

Interactive data exploration and experimentation Why we like presto?
Integration with Hive metastore Easy integration with S3 Works at petabyte scale ANSI SQL for usability Fast

Our contributions S3 file system Query optimizations
Complex types support Parquet file format integration Working on predicate pushdown

Parquet Columnar file format Supported across Hive, Pig, Presto, Spark
Performance benefits across different processing engines Working on vectorized read, lazy load and lazy materialization

Interactive dashboard for slicing and dicing
Column-based in-memory data store for time series data Serves a specific use case very well

ETL, RT analytics, ML algorithms Why we like Spark?
Cohesive environment – batch and ‘stream’ processing Multiple language support – Scala, Python Performance benefits Run on top of YARN for multi-tenancy Community momentum

Evolution of Services/Tools Ecosystem
API Portal Evolution of Services/Tools Ecosystem Data movement Data visualization Big Data API Big Data Portal Data lineage (Federated execution service) Data quality Metacat Pig workflow visualization (Federated metadata service) Job/Cluster perf visualization

Federated execution engine
Expose [your fave big data engine] as a service Flexible data model to support future job types Cluster configuration management

Metacat Federated metadata catalog for the whole data platform
Proxy service to different metadata sources Data metrics, data usage, ownership, categorization and retention policy … Common interface for tools to interact with metadata To be open sourced in 2015 on Netflix OSS

Big Data API Big Data Portal Metacat d d Data movement
Service Tools API Portal Data movement Data visualization Big Data API Big Data Portal Data lineage (Federated execution service) Data quality Metacat Pig workflow visualization (Federated metadata service) Job/Cluster perf visualization

Big Data API Integration layer for our ecosystem of tools and services
Python library (called Kragle) Building block for our ETL workflow Building block for Big Data Portal

Big Data Portal One stop shop for all big data related tools and services Built on top of Big Data API

Open source is an integral part of our strategy to achieve scale

Big Data Processing Systems
Services/Tools Ecosystem

Why use Open Source? Collaborate with other internet scale tech companies Unchartered area/scale, lock-in is not desirable Need the flexibility to achieve scalability BUT… Lots of choices White box approach

Why contribute back? Non IP or trade secret
Help shape direction of projects Don’t want to fork and diverge Attract top talent

Why contribute our own tool?
Share our goodness Set industry standard Community can help evolve the tool

Is open source right for you?

Measuring big data - understanding data by usage
By Charles Smith, Netflix 1:40-2:20pm

Eva Tse jobs.netflix.com

The Evolution of Big Data Netflix

Similar presentations

Presentation on theme: "The Evolution of Big Data Netflix"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Evolution of Big Data Netflix

Similar presentations

Presentation on theme: "The Evolution of Big Data Netflix"— Presentation transcript:

Similar presentations

About project

Feedback