Download presentation
Presentation is loading. Please wait.
1
The Dataflow Model
2
Introduction Huge datasets are very common
Current systems all fall short somehow Fault tolerance Scalability Correctness Exactly-once semantics Event-time windowing Simplicity People want to analyze and manipulate them in many different ways Can’t optimize correctness, latency, and cost at the same time Want a programming model that’s simple and flexible
3
What to design around? Input data will never be complete
Allow user to easily balance correctness, latency, and cost without worrying about underlying implementation Combine batch, micro-batch, streaming systems, and lambda architecture into one tool Needs of current applications Not about performance
4
Goals Give event-time ordered results with flexible windowing to tune correctness, latency, and cost Allow simple pipeline implementation What results are being computed Where in event time they are being computed When in processing time they are materialized How earlier results relate to later refinements Separate needs from underlying implementation
5
Windowing Slicing data up into chunks Aligned: applied across all data
Unaligned: applied on a subset of the data Fixed: set window size (hourly windows) Sliding: window size + slide period Sessions: period of activity (timeout gap)
6
Time Domains Event Time: when event occurred
Processing Time: when event is processed Constantly changing gap between the two Watermark: estimate of earliest event time that has yet to be processed Heuristic: not totally accurate
7
Dataflow Model ParDo: Map GroupByKey: Reduce
8
Windowing Treat all windowing as unaligned: simpler Window Assignment
Optimize under the hood Window Assignment New copy of element for each window Can happen at any point in pipeline (after other transformations) Window Merging (for sessions) GroupByKeyAndWindow DropTimestamps GroupByKey MergeWindows GroupAlsoByWindow ExpandToElements
9
Triggers & Incremental Processing
When to end window and emit results? Watermarks: too fast (late data) or too slow (single slow datum) Feature triggers Determines when to emit results Watermarks Points in processing time Data arrival User defined events On triggering… Discard previous results Accumulate new data on top of previous Accumulate & retract: accumulate and store value, retract previous value Useful if multiple triggers fire on the same window
10
Examples
11
Batch Single global window
12
1-minute triggers Accumulating Discarding
13
Tuple-based
14
Batch & Micro-Batch (Event time)
Batch: Wait for all data Process in event-time Micro-batch: Wait for all data each minute Process in event-time
15
Streaming: Fixed Windows
Wait watermark to pass certain event time Retrigger on late data Poor latency Wait watermark to pass certain event time + processing time based triggers Higher cost, but lower latency than microbatch
16
Session windowing + Retractions
1 min timeout 1 min processing-time window Watermark
17
Implementation & Design
FlumeJava: library for writing parallel pipelines MillWheel: framework for making streaming pipelines Fault tolerance offloaded to these Design Data’s neverending Flexibility: support many kinds of pipelines Improve previous execution engines Make code clear
18
Use Cases Combining streaming and batch pipelines into one implementation Tracking sessions Billing: eliminate extra logic for dealing with late data Aggregating statistics: percentile watermarks Batch processing: eliminate extra logic for early termination Building recommendations: processing-time triggers Anomaly detection: data-driven triggers
19
My own questions What does moving from MillWheel + FlumeJava → Dataflow look like? Are fewer LOC are needed? Can this system support record-at-a-time processing with very low latency? Is there any performance overhead?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.