MillWheel: Fault-Tolerant Stream Processing at Internet Scale

Slides:



Advertisements
Similar presentations
Muppet Scalable MapUpdate data-stream processing
Advertisements

Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Tyler Akidau, Alex Balikov, Kaya Bekiro glu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, Sam Whittle Google.
Distributed Computations
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Distributed Computations MapReduce
Distributed storage for structured data
Case Study - GFS.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
1 The Google File System Reporter: You-Wei Zhang.
 Anil Nori Distinguished Engineer Microsoft Corporation.
NOVA: CONTINUOUS PIG/HADOOP WORKFLOWS. storage & processing scalable file system e.g. HDFS distributed sorting & hashing e.g. Map-Reduce dataflow programming.
Distributed File Systems
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce M/R slides adapted from those of Jeff Dean’s.
BigTable and Accumulo CMSC 461 Michael Wilson. BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up.
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Hidemoto Nakada, Hirotaka Ogawa and Tomohiro Kudoh National Institute of Advanced Industrial Science and Technology, Umezono, Tsukuba, Ibaraki ,
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
AMQP, Message Broker Babu Ram Dawadi. overview Why MOM architecture? Messaging broker like RabbitMQ in brief RabbitMQ AMQP – What is it ?
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture Chunkservers Master Consistency Model File Mutation Garbage.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Bigtable: A Distributed Storage System for Structured Data
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Gorilla: A Fast, Scalable, In-Memory Time Series Database
1 CMPT 431© A. Fedorova Google File System A real massive distributed file system Hundreds of servers and clients –The largest cluster has >1000 storage.
MillWheel Fault-Tolerant Stream Processing at Internet Scale
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Distributed, real-time actionable insights on high-volume data streams
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
CS522 Advanced database Systems
MongoDB Distributed Write and Read
PREGEL Data Management in the Cloud
Google Filesystem Some slides taken from Alan Sussman.
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Replication Middleware for Cloud Based Storage Service
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 4th – Map/Reduce.
Slides prepared by Samkit
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Interpret the execution mode of SQL query in F1 Query paper
The Dataflow Model.
Decoupled Storage: “Free the Replicas!”
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
Presentation transcript:

MillWheel: Fault-Tolerant Stream Processing at Internet Scale Tyler Akidau et all. Google VLDB 2013 CS 632 Advanced Database Systems 13/03/2016

Outline Motivation/Introduction Running Example System Overview and Concepts MillWheel API Fault Tolerance Implementation And results

Motivation Characteristics of Data Input: Fast Continuous Real Time input and output Example: Google’s “Web Query Trend Tracker” called Zeitgeist

Google Zeitgeist

Requirements For MillWheel Real time processing of data Persistent State Abstraction to user Out-of-order data handling Monotonically increasing low-watermark Low latency at high scalability Guaranteed exactly once delivery

System Overview At high level, MillWheel is graph of Computations Computations parallelized over machines User not concerned about specifying these low level details User Specifies: Directed Computations Graph Application logic for each Computation Node User can add/remove computations dynamically without restarting entire system.

System Overview Delivery Guarantees: Per key aggregation Internal updates atomically checkpointed per key Records delivered exactly once Per key aggregation

System Overview Input to computation is represented as Triplet: <Key, Value, TimeStamp> Key: metadata field with semantic meaning Value: Byte string for entire record TimeStamp: assigned by users

Computation Holds Application Logic Invoked on receipt on incoming record or trigger timeout Code operates in context of single key Agnostic about distribution of keys on different machines

Computation

Computation Processing serialized per key, but parallelized over different keys.

Key Abstraction for aggregation and comparison between different records (Similar to map-reduce) Key extraction function(specified by user) assigns a key to the record. Computation code is run in the context of a specific key (accesses state for that specific key only)

Streams Pub-Sub System- Computations subscribe to streams as well as publish streams Key extraction function specified for each incoming stream

Low watermark Provides a bound on the timestamps of future records Definition: L.W(A) = min(oldest work of A, LW(C)) C are all input computations of A 3) Original Low watermarks are provided by Injectors

Low Watermark

Timers Timers are per-key programmatic hooks that trigger at a specific wall time or low watermark value Timers are journaled in persistent state and can survive process restarts and machine failures Timer runs the specified user function (when fired) and provides same exactly-once guarantees

Computation API

Example

Example

Injector and Low Watermark API Each computation calculates a low watermark value for all of its pending work Users rarely deals with low watermarks Users manipulate them indirectly by assigning timestamp to records Injectors bring external data, seed low watermark values for the rest of the pipeline If injector are distributed across multiple processes the least watermark among all processes is used

Fault Tolerance: Delivery Guarantees Exactly-Once Delivery: Steps performed on receipt of an input record for a computation are: Checked for duplication User code is run for the input Pending changes are committed to the backing store. Sender is ACKed Pending productions are sent(retried until they are ACKed) System assigns unique IDs to all records at production time, which are stored to identify duplicate records during retries Optimization: Collect multiple records for checkpointing

Fault Tolerance: Delivery Guarantees Strong Productions: Produced records are checkpointed before delivery Checkpointing is done in same atomic write as state modification Checkpoints are scanned into memory and replayed, if a process restarts Checkpoint data is deleted when productions are ACKed Exactly-Once Delivery and Strong Productions ensures user logic is idempotent

Fault Tolerance: Delivery Guarantees Weak Productions: Broadcast downstream deliveries optimistically, prior to persisting state Each stage waits for the downstream ACKs of records Completion times of consecutive stages increases so chances of experiencing a failure increase A small fraction of productions are checkpointed, allowing those stages to ACKs to their senders.

Fault Tolerance: Delivery Guarantees Weak Productions:

State Manipulation State: Persistent State, User state, and Timers Following user-visible guarantees must satisfy: No data loss Updates must ensure exactly-once semantics All persisted data must be consistent Low watermarks must reflect all pending state in the system Timers must fire in-order for a given key

State Manipulation To avoid Inconsistencies in Persistent state Per-key update are wrapped as single atomic operation To avoid network remnant stale writes Single writer per key at any given time Sequencer is attached to each write Mediator of backing store checks before allowing writes New worker invalidates any extant sequencer

State Manipulation

System Implementation Distributed systems with dynamic set of host servers Each computation in a pipeline runs on one or more machines Streams are delivered via RPC On each machine, the MillWheel system Marshals incoming work Manages process-level metadata Delegates data to the appropriate computation

System Implementation (cont.) Load distribution and balancing is handled by a master Each computation is divided into a set of lexicographic key intervals Intervals are assigned to a set of machines Depending on load intervals can be merged or splitted Low Watermarks Central authority tracks all low watermark values in the system and journals them to persistent state Consumer computations subscribe low watermark from all the senders Use minimum of all values Central authority’s low watermark value is at most that of workers’

Experimentation Latency distribution for records when running over 200 CPUs. Median record delay is 3.6 milliseconds and 95th-percentile latency is 30 milliseconds With strong productions and exactly-once guarantee both enabled Median Latency jumps to 33.7 milliseconds and 95th- percentile latency to 90 milliseconds

Experimentation Median latency (50th percentile) stays roughly constant, regardless of system size 99th-percentile latency does get significantly worse with increase in system size.

Experimentation Simple three-stage MillWheel pipeline on 200 CPUs Polled each computation’s low watermark value once per second Latency values for stages are around 1795 ms, 1954 ms, 2081 ms

Experimentation Load: aggregate CPU load of Millwheel and Storage System

Map-Update System Event: Map: Update: Four tuple <Stream ID, Time-Stamp, Key, Value> Keys are non-unique and have global key space Values are blob fields associated with key Map: Map(event) → event* Update: Functions with memory Slate: SU,K Update(event, slate) → event*

Muppet System Distributed Map and Update functions Hashing used to distribute events Passing of events to next function

Muppet: Slate management Persistent key-value store ‘Cassandra’ used to store Slates K-V store is indexed by <key, Column> Updated function is used as Column Cache of key-value records Flash memory storage for persistent state Speeds up cache warm up Higher Random I/O capacity to facilitate K-V store compaction Configurations: Flush rate Quorum for Cassandra store TTL for slates

Muppet: Failure Handling Machine crash: Detected By workers and reported to Master, unlike Map-Reduce Hash-ring to find next worker Event is lost, low latency is prime target Updater: Unflushed changes to K-V store are lost Input queue is also lost Queue Overflow: Drop incoming events Special overflow queue processing Slow down event passing

Muppet 2.0 Enhancements Code Sharing among workers on same machine No data passing among workers on same machine Central cache of slates on each machine Configure number of workers on a machine as per cores Primary and secondary queues for Updaters on a machine Reduced queue and slate lock contention Better load balancing Reduced hotspots

Muppet: extensions Changing number of machines in the cluster on the fly Handling Hotspots Exploit Commutativity and associativity of computations Slow down pace of events in the workflow for computations that do not need near-real time guarantees Placing of mappers and updaters close to their data Bulk-reading of slates side effects?

Thank You