The Dataflow Model.

The Dataflow Model

Introduction Huge datasets are very common
Current systems all fall short somehow Fault tolerance Scalability Correctness Exactly-once semantics Event-time windowing Simplicity People want to analyze and manipulate them in many different ways Can’t optimize correctness, latency, and cost at the same time Want a programming model that’s simple and flexible

What to design around? Input data will never be complete
Allow user to easily balance correctness, latency, and cost without worrying about underlying implementation Combine batch, micro-batch, streaming systems, and lambda architecture into one tool Needs of current applications Not about performance

Goals Give event-time ordered results with flexible windowing to tune correctness, latency, and cost Allow simple pipeline implementation What results are being computed Where in event time they are being computed When in processing time they are materialized How earlier results relate to later refinements Separate needs from underlying implementation

Windowing Slicing data up into chunks Aligned: applied across all data
Unaligned: applied on a subset of the data Fixed: set window size (hourly windows) Sliding: window size + slide period Sessions: period of activity (timeout gap)

Time Domains Event Time: when event occurred
Processing Time: when event is processed Constantly changing gap between the two Watermark: estimate of earliest event time that has yet to be processed Heuristic: not totally accurate

Dataflow Model ParDo: Map GroupByKey: Reduce

Windowing Treat all windowing as unaligned: simpler Window Assignment
Optimize under the hood Window Assignment New copy of element for each window Can happen at any point in pipeline (after other transformations) Window Merging (for sessions) GroupByKeyAndWindow DropTimestamps GroupByKey MergeWindows GroupAlsoByWindow ExpandToElements

Triggers & Incremental Processing
When to end window and emit results? Watermarks: too fast (late data) or too slow (single slow datum) Feature triggers Determines when to emit results Watermarks Points in processing time Data arrival User defined events On triggering… Discard previous results Accumulate new data on top of previous Accumulate & retract: accumulate and store value, retract previous value Useful if multiple triggers fire on the same window

Examples

Batch Single global window

1-minute triggers Accumulating Discarding

Tuple-based

Batch & Micro-Batch (Event time)
Batch: Wait for all data Process in event-time Micro-batch: Wait for all data each minute Process in event-time

Streaming: Fixed Windows
Wait watermark to pass certain event time Retrigger on late data Poor latency Wait watermark to pass certain event time + processing time based triggers Higher cost, but lower latency than microbatch

Session windowing + Retractions
1 min timeout 1 min processing-time window Watermark

Implementation & Design
FlumeJava: library for writing parallel pipelines MillWheel: framework for making streaming pipelines Fault tolerance offloaded to these Design Data’s neverending Flexibility: support many kinds of pipelines Improve previous execution engines Make code clear

Use Cases Combining streaming and batch pipelines into one implementation Tracking sessions Billing: eliminate extra logic for dealing with late data Aggregating statistics: percentile watermarks Batch processing: eliminate extra logic for early termination Building recommendations: processing-time triggers Anomaly detection: data-driven triggers

My own questions What does moving from MillWheel + FlumeJava → Dataflow look like? Are fewer LOC are needed? Can this system support record-at-a-time processing with very low latency? Is there any performance overhead?

The Dataflow Model.

Similar presentations

Presentation on theme: "The Dataflow Model."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Dataflow Model.

Similar presentations

Presentation on theme: "The Dataflow Model."— Presentation transcript:

Similar presentations

About project

Feedback