Download presentation
Presentation is loading. Please wait.
Published byRobert Darrell Rogers Modified over 9 years ago
1
Tyler Akidau, Alex Balikov, Kaya Bekiro glu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, Sam Whittle Google
2
Motivation and Requirements Introduction High level Overview of System Fundamental abstractions of the MillWheel model API Fault tolerance System implementation. Experimental results related work 2
3
3 Records over one-second intervals (bucket) are compared to expected traffic that the model predicts Consistent mismatch over n windows concludes a query is spiking or dipping Model is updated with newly received data
4
Requires both short term and long term storage Needs duplicate prevention, as duplicate record deliveries could cause spurious spikes. Should distinguish whether data (expected) is delayed or actually not there. ◦ MillWheel uses low watermark 4
5
Real time processing of data Persistent state abstractions to user Handling of out-of-order data Constant latency as the system scales to more machines. Guarantee for exactly-once delivery of records 5
6
Framework for building low-latency data- processing applications System manages ◦ Persistent state ◦ Continuous flow of records ◦ Fault-tolerance guarantees Provides a notion of logical time 6
7
Provides fault tolerance at the framework level ◦ Correctness is ensured in case of failure ◦ Record are handled in an idempotent fashion Ensures exactly once delivery of record from the user’s perspective Check-pointing is at fine granularity ◦ Eliminates buffering of pending data for long periods between check points 7
8
8
9
At high level, MillWheel is a graph of computation nodes ◦ Users specify A directed Computation graph Application code for individual nodes ◦ Each node takes input and produce output ◦ Computation are also called as transformations Transformations are parallelized ◦ Users are not concerned with load-balancing at a fine-grained level 9
10
Users can add and remove computations dynamically All internal updates are atomically checkpointed per-key ◦ User code can access a per-key, per computation persistent store ◦ Allows for powerful per-key aggregations ◦ Uses replicated, highly available data store (e.g. Spanner) 10
11
Inputs and outputs in MillWheel are represented by triples. ◦ (key, value, timestamp) key is a metadata field with semantic value is an arbitrary byte string timestamp can be an arbitrary value 11
12
12
13
Computations holds the application logic Code is invoked upon receipt of input data Code operates in the context of a single key 13
14
Abstraction for aggregation and comparison between different records (Similar to map reduce) Key extraction function(Specified by consumer) assigns a key to the record. Computation code is run in the context of a specific key (accesses state for that specific key only). 14
15
Provides a bound on the timestamps of future records Low watermark of A is ◦ min(oldest work of A, low watermark of C) oldest work of A is the timestamp corresponding to the oldest unfinished record in A. : Node C produces the output to A for consumption 15
16
16
17
Timers are per-key programmatic hooks that trigger at a specific wall time or low watermark value Timers are journaled in persistent state and can survive process restarts and machine failures Timer runs the specified user function (when fired) and provides same exactly-once guarantees 17
18
18
19
User implements a custom subclass of the Computation class ◦ Provides method for accessing MillWheel abstractions ◦ ProcessRecord and ProcessTimer hooks provide two main entry points into user code ◦ Hooks are triggered in reaction to record receipt and timer expiration Per-key serialization is handled at the framework level 19
20
20
21
21
22
22
23
Each computation calculates a low watermark value for all of its pending work ◦ Users rarely deals with low watermarks ◦ Users manipulate them indirectly through timestamp assignation to records. Injectors bring external data, seed low watermark values for the rest of the pipeline ◦ If injector are distributed across multiple processes the least watermark among all processes is used 23
24
Exactly-Once Delivery ◦ Steps performed on receipt of an input record for a computation are Checked for duplication User code is run for the input Pending changes are committed to the backing store. Sender is ACKed Pending productions are sent( retried until they are ACKed ) ◦ System assigns unique IDs to all records at production time, which are stored to identify duplicate records during retries 24
25
Strong Productions ◦ Produced records are checkpointed before delivery Checkpointing is done in same atomic write as state modification Checkpoints are scanned into memory and replayed, if a process restarts Checkpoint data is deleted when productions are ACKed Exactly-Once Delivery and Strong Productions ensures user logic is idempotent 25
26
Some computations may already be idempotent ◦ Strong productions and/or exactly-once can be disabled Weak Productions ◦ Broadcast downstream deliveries optimistically, prior to persisting state ◦ Each stage waits for the downstream ACKs of records ◦ Completion times of consecutive stages increases so chances of experiencing a failure increases 26
27
To overcome this, a small fraction of productions are checkpointed, allowing those stages to ACK their senders. This selective checkpointing can both improve end-to-end latency and reduce overall resource consumption. 27
28
Following user-visible guarantees must satisfy: ◦ No data loss ◦ Updates must ensure exactly-once semantics ◦ All persisted data must be consistent ◦ Low watermarks must reflect all pending state in the system ◦ Timers must fire in-order for a given key 28
29
29 To avoid Inconsistencies in Persistent state ◦ Per-key update are wrapped as single atomic operation To avoid network remnant stale writes ◦ Sequencer is attached to each write ◦ Mediator of backing store checks before allowing writes ◦ New worker invalidates any extant sequencer
30
30
31
Distributed systems with dynamic set of host servers Each computation in a pipeline runs on one or more machines ◦ Streams are delivered via RPC On each machine, the MillWheel system ◦ Marshals incoming work ◦ Manages process-level metadata ◦ Delegates data to the appropriate computation
32
Load distribution and balancing is handled by a master ◦ Each computation is divided into a set of lexicographic key intervals ◦ Intervals are assigned to a set of machines ◦ Depending on load intervals can be merged or splitted Low Watermarks ◦ Central authority tracks all low watermark values in the system and journals them to persistent state
33
In-memory data structures are used to store aggregated timestamp Consumer computations subscribe low watermark from all the senders ◦ Use minimum of all values ◦ Central authority’s low watermark values is at least as those of the workers
34
Latency distribution for records when running over 200 CPUs. Median record delay is 3.6 milliseconds and 95th-percentile latency is 30 milliseconds Strong productions and exactly once disabled, with both enabled Median Latency jumps to 33.7 milliseconds
35
Median latency stays roughly constant, regardless of system size 99th-percentile latency does get significantly worse with increase in system size.
36
Simple three-stage MillWheel pipeline on 200 CPUs Polled each computation’s low watermark value once per second
37
Increasing available cache linearly improves CPU usage (after 550MB most data is cached, so further increases were not helpful)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.