Samza: Stateful Scalable Stream Processing at LinkedIn

Slides:



Advertisements
Similar presentations
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Advertisements

© Hortonworks Inc Running Non-MapReduce Applications on Apache Hadoop Hitesh Shah & Siddharth Seth Hortonworks Inc. Page 1.
Spark: Cluster Computing with Working Sets
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Distributed Computations MapReduce
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Components of Windows Azure - more detail. Windows Azure Components Windows Azure PaaS ApplicationsWindows Azure Service Model Runtimes.NET 3.5/4, ASP.NET,
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
IMDGs An essential part of your architecture. About me
SQL Server 2014: Overview Phil ssistalk.com.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
1 MONGODB: CH ADMIN CSSE 533 Week 4, Spring, 2015.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Scale up Vs. Scale out in Cloud Storage and Graph Processing Systems
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Next Generation of Apache Hadoop MapReduce Owen
Apache Kafka A distributed publish-subscribe messaging system
Apache Tez : Accelerating Hadoop Query Processing Page 1.
CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Connected Infrastructure
Presented by: Omar Alqahtani Fall 2016
Distributed, real-time actionable insights on high-volume data streams
Data Loss and Data Duplication in Kafka
Scaling HDFS to more than 1 million operations per second with HopsFS
Samza: Stateful Scalable Stream Processing at LinkedIn
Business Continuity & Disaster Recovery
PROTECT | OPTIMIZE | TRANSFORM
How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)
Kafka & Samza Weize Sun.
Hadoop Aakash Kag What Why How 1.
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Hadoop.
Introduction to Distributed Platforms
Curator: Self-Managing Storage for Enterprise Clusters
Diskpool and cloud storage benchmarks used in IT-DSS
GWE Core Grid Wizard Enterprise (
Spark Presentation.
Scaling SQL with different approaches
Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker
Connected Infrastructure
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Data Platform and Analytics Foundational Training
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Business Continuity & Disaster Recovery
SONATA: Query-Driven Network Telemetry
Benchmarking Modern Distributed Stream Processing Systems
Introduction to Spark.
Boyang Peng, Le Xu, Indranil Gupta
Capital One Architecture Team and DataTorrent
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Ewen Cheslack-Postava
Hadoop Technopoints.
Evolution of messaging systems and event driven architecture
Overview of big data tools
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Charles Tappert Seidenberg School of CSIS, Pace University
with Raul Castro Fernandez* Matteo Migliavacca+ and Peter Pietzuch*
Database System Architectures
Fast, Interactive, Language-Integrated Cluster Computing
Presentation transcript:

Samza: Stateful Scalable Stream Processing at LinkedIn Shadi A. Noghabi*, Kartik Paramasivam^, Yi Pan^, Navina Ramesh^, Jon Bringhurst^, Indranil Gupta*, Roy Campbell* * University of Illinois at Urbana-Champaign ^ LinkedIn Corp. +

Stream (data in motion) Processing Security Click Stream Processing, Interactive User Feeds Security, Fraud Detection Application Monitoring Internet of Things Ads, Gaming, Trading etc.

Data Processing at LinkedIn Clients(browser,devices ….) Services Tier the 3 areas ingestion, Processing serving Ingestion Espresso Azure EventHub AWS Kinesis Brooklin Oracle Apache Kafka Processing Real Time Processing (Apache Samza)

Scale of Processing at LinkedIn In Apache Kafka alone 2.1 Trillion msg/Day 0.5 PB in, 2 PB out per day (compressed) 16 Million msg/sec peaks Many applications need state along with processing Several TB for a single application

Apache Samza A Battle-Tested and Scalable stream/data processing framework Top-level Apache project since 2014 In use at LinkedIn, Uber, Metamarkets, Netflix, Intuit, TripAdvisor, VmWare, Optimizely, Redfin, etc. Powers hundreds of apps in LinkedIn’s production

Samza’s Goals Scalability Fast Recovery & Restart Input partitioning Parallel and independent tasks Fast Recovery & Restart Parallel recovery Host Affinity Efficient Stateful Processing Local state Incremental checkpointing Unified Data Processing API For Stream and Batch Stream Processing as a library and Stream Processing as a Service (SPaaS)

Samza’s Goals Scalability Fast Recovery & Restart Input partitioning Parallel and independent tasks Fast Recovery & Restart Parallel recovery Host Affinity Compaction Efficient Stateful Processing Local state Incremental checkpointing 3-Tier caching Unified Data Processing API For Stream and Batch Stream Processing as a library and Stream Processing as a Service (SPaaS)

Processing from a Partitioned Source Input Stream Processing Mention that: repartitioning might be required to process from an unpartitioned source Partitions Tasks 1 1 Client 2 2 3 3 Kafka Topic/EventHub Send with PartitionKey Samza Application is a made up of Tasks every Task processes a unique collection of input partitions

Joining across co-partitioned streams Ad View Stream 1 Processing 2 Ad Click Through Rate Stream Tasks 3 1 1 2 2 Ad Click Stream 3 3 1 2 Samza Application 3

Multi-Stage Dataflow Example Application logic: Count number of ‘Page Views’ for each member in a 5 minute window Page View in stream Page View per Member out stream Repartition by member id Window Map SendTo Intermediate Stream

Multi-Stage Dataflow Example Page View in stream Page View per Member out stream Repartition by member id Window Map SendTo public class PageViewCountApplication implements StreamApplication { @Override public void init(StreamGraph graph, Config config) { MessageStream<PageViewEvent> pageViewEvents = graph.getInputStream("pageViewStream" ); MessageStream pageViewPerMember = graph.getOutputStream("pageViewPerMemberStream" ); pageView .partitionBy(m -> m.memberId) .window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), initialValue, (m, c) -> c + 1)) .map(MyStreamOutput::new) .sendTo(pageViewPerMember); } built-in transform functions

Samza’s Goals Scalability Fast Recovery & Restart Input partitioning Parallel and independent tasks Fast Recovery & Restart Parallel recovery Host Affinity Compaction Efficient Stateful Processing Local state Incremental checkpointing Unified Data Processing API For Stream and Batch Stream Processing as a library and Stream Processing as a Service (SPaaS)

Stateful Processing: Aggregations, Windowed Joins ... Samza Application Page View Kafka stream Page View Per Member Kafka stream D4 Task 1 Task 2 Task 3 Store count of page views per member Count number of ‘Page Views’ for each member in a 5 minute window State: Page View Count

Count number of ‘Page Views’ for each member in a 5 minute window Local State Samza Application Page View Kafka stream Page View Per Member Kafka stream D4 Task 1 Task 2 Task 3 Count number of ‘Page Views’ for each member in a 5 minute window

Count number of ‘Page Views’ for each member in a 5 minute window Local State Samza Application Page View Kafka stream Page View Per Member Kafka stream D4 Task 1 Task 2 Task 3 What about failures? How to not loose state? Count number of ‘Page Views’ for each member in a 5 minute window

Failure Recovery - Changelog Samza Application Page View Kafka stream Page View Per Member Kafka stream D4 DB partition 1 Task 1 DB partition 2 Task 2 ... DB partition k Task k State changes saved to a durable change log Periodically, at a checkpoint, offsets are flushed along with the state. Recovery from previous checkpoint upon failures Changelog e.g., Kafka log compacted topic ... partition k partition 2 partition 1

Failure Recovery - Changelog Samza Application Page View Kafka stream Page View Per Member Kafka stream D4 X=10 DB partition 1 Task 1 X=10 DB partition 2 Task 2 ... DB partition k Task k Offset 1005 State changes saved to a durable change log Periodically, at a checkpoint, offsets are flushed along with the state. Recovery from previous checkpoint upon failures Changelog e.g., Kafka log compacted topic X=10 ... partition k partition 2 partition 1

Failure Recovery - Incremental Checkpoint Samza Application Page View Kafka stream Page View Per Member Kafka stream D4 X=10 DB partition 1 Task 1 X=10 DB partition 2 Task 2 ... DB partition k Task k Offset 1005 Offsets e.g., Kafka log compacted topic Changelog e.g., Kafka log compacted topic Offset 1005 X=10 ... partition k partition 2 partition 1

Comparing Checkpointing Options Full state checkpointing Simply does not scale for non-trivial application state … but makes it easier to achieve “repeatable results” when recovering from failure Incremental state checkpointing Scales to any type of application state (e.g. some apps have ~2TB of app state in prod @LinkedIn) Achieving repeatable results requires additional techniques (e.g. mechanisms for de-duplicating data)

Fast Restarts with Local State Input Stream Durable : Task-Container-Host Mapping Task 1, Task 4 -> Host-A Task 2 -> Host-B Task 3 -> Host-C Task-1 Task-4 Task-2 Task-3 Samza Job Host-A Host-B Host-C Host Affinity in YARN : Try to place task on same host after upgrade Minimize state rebuilding Overhead Change-log

Local State Summary Pros Cons 100X better performance No issues with accidental DoS on remote DB No need to over provision the remote DB Does NOT work when adjunct data is large and not co-partitionable in input stream Auto-scaling becomes harder Repartitioning the Input stream can mess up local state

Samza’s Goals Scalability Fast Recovery & Restart Input partitioning Parallel and independent tasks Fast Recovery & Restart Parallel recovery Host Affinity Compaction Efficient Stateful Processing Local state Incremental checkpointing 3-Tier caching Unified Data Processing API For Stream and Batch Stream Processing as a library and Stream Processing as a Service (SPaaS)

Stream Application in Batch Application logic: Count number of ‘Page Views’ for each member in a 5 minute window and send the counts to ‘Page View Per Member’ Page View in stream Page View per Member out stream Repartition by member id Window Map SendTo HDFS Zero code changes PageView: hdfs://mydbsnapshot/PageViewFiles/ PageViewPerMember: hdfs://myoutputdb/PageViewPerMemberFiles

Stream Processing as a Library Page View Page View per Member Repartition by member id Window Map SendTo Launch Stream Processor public static void main(String[] args) { CommandLine cmdLine = new CommandLine(); OptionSet options = cmdLine.parser().parse(args); Config config = cmdLine.loadConfig(options); LocalApplicationRunner runner = new LocalApplicationRunner(config); PageViewCountApplication app = new PageViewCountApplication(); runner.run(app); runner.waitForFinish(); } job.coordinator.factory=org.apache.samza.zk. ZkJobCoordinatorFactory job.coordinator.zk.connect=my-zk.server:2191 Zero code changes

Stream Processing as a Library : Architecture Pluggable Job Coordinator Multiple Coordinator implementations YARN based Coordination (non-library option) Zookeeper based Coordination Azure Storage based Coordination Why do we need it ?  Some applications want to directly embed stream processing as part of broader app (eg. a web frontend) LinkedIn Tools are not natively built for YARN, so there is some developer friction Some applications need full control (e.g. unique security, deployment considerations) Solve API proliferation (Kafka APIs, Databus APIs etc. ) StreamProcessor Samza Container Job Coordinator StreamProcessor Samza Container Job Coordinator StreamProcessor Samza Container Job Coordinator ... ZooKeeper

Evaluation

Evaluation Setup Production Cluster 500 node YARN cluster real world applications Small Cluster (used for evaluation) 6 node cluster 64GB RAM, 24 core CPUs, a 1.6 TB SSD micro-benchmarks Read-only workload ~ adjunct data in a join Read-write workload ~ aggregation over time message size : 100 bytes (read only) or 80 bytes (read/write case)

Local State -- Throughput no batching or caching for remote db remote state 30-150x worse than local state on disk w/ caching comparable with in memory changelog adds minimal overhead

Local State -- Latency > 2 orders of magnitude slower compared to local state on disk w/ caching comparable with in memory changelog adds minimal overhead

Samza HDFS Benchmark Profile count, group-by country 500 files 250GB input

Apache Samza: A Real Time Data Processing Framework - Battle Tested at Scale !! Scalability Input partitioning Parallel and independent tasks Fast Recovery & Restart Parallel recovery Host Affinity Efficient Stateful Processing Local state Incremental checkpointing Unified Data Processing API For Stream and Batch Stream Processing as a library and Stream Processing as a Service (SPaaS)