Scaling Apache Flink® to very large State

Slides:



Advertisements
Similar presentations
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
Advertisements

The google file system Cs 595 Lecture 9.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Spark: Cluster Computing with Working Sets
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Distributed Computations
490dp Synchronous vs. Asynchronous Invocation Robert Grimm.
Chapter 9 Virtual Memory Produced by Lemlem Kebede Monday, July 16, 2001.
Distributed Computations MapReduce
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Titan Graph Database Meet Bhatt(13MCEC02).
MapReduce.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Module 12: Designing High Availability in Windows Server ® 2008.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
McLean HIGHER COMPUTER NETWORKING Lesson 15 (a) Disaster Avoidance Description of disaster avoidance: use of anti-virus software use of fault tolerance.
CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes 13: BigTable, HBASE, Cassandra Hector Garcia-Molina.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Intuitions for Scaling Data-Centric Architectures
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
BIG DATA/ Hadoop Interview Questions.
Big thanks to everyone!.
Experiences in running Apache Flink® at large scale
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Data Loss and Data Duplication in Kafka
Jonathan Walpole Computer Science Portland State University
Kafka & Samza Weize Sun.
The Future of Apache Flink®
High Availability Linux (HA Linux)
Large-scale file systems and Map-Reduce
AlwaysOn Mirroring, Clustering
CSCI5570 Large Scale Data Processing Systems
A Technical Overview of Microsoft® SQL Server™ 2005 High Availability Beta 2 Matthew Stephen IT Pro Evangelist (SQL Server)
PREGEL Data Management in the Cloud
Streaming Analytics with Apache Flink 1.0
EEC 688/788 Secure and Dependable Computing
Running Apache Flink® Everywhere
Unit OS10: Fault Tolerance
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Akka.NET The Future of Distributed Programming in .NET
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Apache Flink and Stateful Stream Processing
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Introduction to Operating Systems
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Computer Architecture
CS110: Discussion about Spark
Replication-based Fault-tolerance for Large-scale Graph Processing
EEC 688/788 Secure and Dependable Computing
Parallel Computation Patterns (Reduction)
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
EEC 688/788 Secure and Dependable Computing
Introduction to MapReduce
Co-designed Virtual Machines for Reliable Computer Systems
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
MapReduce: Simplified Data Processing on Large Clusters
Distributed systems A collection of autonomous computers linked by a network, with software designed to produce an integrated computing facility A well.
Presentation transcript:

Scaling Apache Flink® to very large State Stephan Ewen (@StephanEwen)

State in Streaming Programs case class Event(producer: String, evtType: Int, msg: String) case class Alert(msg: String, count: Long) Source map() mapWith State() filter() window() sum() keyBy keyBy env.addSource(…) .map(bytes => Event.parse(bytes) ) .keyBy("producer") .mapWithState { (event: Event, state: Option[Int]) => { // pattern rules } .filter(alert => alert.msg.contains("CRITICAL")) .keyBy("msg") .timeWindow(Time.seconds(10)) .sum("count")

State in Streaming Programs case class Event(producer: String, evtType: Int, msg: String) case class Alert(msg: String, count: Long) Source map() mapWith State() filter() window() sum() keyBy keyBy env.addSource(…) .map(bytes => Event.parse(bytes) ) .keyBy("producer") .mapWithState { (event: Event, state: Option[Int]) => { // pattern rules } .filter(alert => alert.msg.contains("CRITICAL")) .keyBy("msg") .timeWindow(Time.seconds(10)) .sum("count") Stateless Stateful

Internal & External State Internal State State in a separate data store Can store "state capacity" independent Usually much slower than internal state Hard to get "exactly-once" guarantees State in the stream processor Faster than external state Always exactly-once consistent Stream processor has to handle scalability

Scaling Stateful Computation State Sharding Larger-than-memory State Operators keep state shards (partitions) Stream and state partitioning symmetric  All state operations are local Increasing the operator parallelism is like adding nodes to a key/value store State is naturally fastest in main memory Some applications have lot of historic data  Lot of state, moderate throughput Flink has a RocksDB-based state backend to allow for state that is kept partially in memory, partially on disk

Scaling State Fault Tolerance Scale Checkpointing Checkpoint asynchronous Checkpoint less (incremental) Scale Recovery Need to recover fewer operators Replicate state Performance during regular operation Performance at recovery time

Asynchronous Checkpoints

Asynchronous Checkpoints Events flow without replication or synchronous writes State index (e.g., RocksDB) Events are persistent and ordered (per partition / key) in the log (e.g., Apache Kafka) Source / filter() / map() window()/ sum()

Asynchronous Checkpoints Trigger checkpoint Inject checkpoint barrier Source / filter() / map() window()/ sum()

Asynchronous Checkpoints RocksDB: Trigger state copy-on-write Take state snapshot Source / filter() / map() window()/ sum()

Asynchronous Checkpoints Durably persist snapshots asynchronously Persist state snapshots Processing pipeline continues Source / filter() / map() window()/ sum()

Asynchronous Checkpoints RocksDB LSM Tree

Asynchronous Checkpoints Asynchronous checkpoints work with RocksDBStateBackend In Flink 1.1.x, use RocksDBStateBackend.enableFullyAsyncSnapshots() In Flink 1.2.x, it is the default mode FsStateBackend and MemStateBackend not yet fully async.

The following slides show ideas, designs, and work in progress The final techniques ending up in Flink releases may be different, depending on results.

Incremental Checkpointing

Full Checkpointing @t1 @t2 @t3 Checkpoint 1 Checkpoint 2 Checkpoint 3 A A C F B D C C I D D E E @t1 @t2 @t3 A B A D G D C F E H I D C C E Checkpoint 1 Checkpoint 2 Checkpoint 3

Incremental Checkpointing F B D C C I D D E E @t1 @t2 @t3 A B G C E H D F I Checkpoint 1 Checkpoint 2 Checkpoint 3

Incremental Checkpointing d2 C1 d2 d3 C4 Storage Chk 1 Chk 2 Chk 3 Chk 4

Incremental Checkpointing Discussions To prevent applying many deltas, perform a full checkpoint once in a while Option 1: Every N checkpoints Option 2: Once size of deltas is as large as full checkpoint Ideally: Having a separate merger of deltas See later slides on state replication

Incremental Recovery

Full Recovery Flink's recovery provides "global consistency": After recovery, all states are together as if a failure free run happened Even in the presence of non-determinism Network External lookups and other non-deterministic user code All operators rewind to latest completed checkpoint

Incremental Recovery

Incremental Recovery

Incremental Recovery

State Replication

Standby State Replication Biggest delay during recovery is loading state Only way to alleviate this delay is if machines for recovery do not need to load state  Keep state outside Stream Processor  Have hot standbys that can immediately proceed Standbys: Replicate state to N other TaskManagers Failures of up to (N-1) TaskManagers, no state loading necessary Replication consistency managed by checkpoints Replication can happen in addition to checkpointing to DFS

Thank you! Questions?