The Dataflow Model.

Slides:



Advertisements
Similar presentations
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Advertisements

Roles of Variables with Examples in Scratch
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Computer Engineering 203 R Smith Risk Management 7/ Risk Management The future can never be predicted with 100% accuracy. Failure to plan for risks.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Multiscalar processors
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Ch 4. The Evolution of Analytic Scalability
JS Arrays, Functions, Events Week 5 INFM 603. Agenda Arrays Functions Event-Driven Programming.
資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium on Computer Modeling.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
CS 363 Comparative Programming Languages
NOVA: CONTINUOUS PIG/HADOOP WORKFLOWS. storage & processing scalable file system e.g. HDFS distributed sorting & hashing e.g. Map-Reduce dataflow programming.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 2011 UKSim 5th European Symposium on Computer Modeling and Simulation Speker : Hong-Ji.
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
Silberschatz, Galvin and Gagne  Operating System Concepts UNIT II Operating System Services.
Profiling/Tracing Method and Tool Evaluation Strategy Summary Slides Hung-Hsun Su UPC Group, HCS lab 1/25/2005.
Buffering Techniques Greg Stitt ECE Department University of Florida.
ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
MillWheel Fault-Tolerant Stream Processing at Internet Scale
TensorFlow– A system for large-scale machine learning
MapReduce Compiler RHadoop
N-Tier Architecture.
Applying Control Theory to Stream Processing Systems
SOFTWARE DESIGN AND ARCHITECTURE
Spark Presentation.
Complexity Time: 2 Hours.
Data stream as an unbounded table
Chapter 9 – Real Memory Organization and Management
5.2 Eleven Advanced Optimizations of Cache Performance
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Introduction to Events
Lecture 5: GPU Compute Architecture
ETL Architecture for Real-Time BI
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Gabor Madl Ph.D. Candidate, UC Irvine Advisor: Nikil Dutt
MillWheel: Fault-Tolerant Stream Processing at Internet Scale
Join Processing in Database Systems with Large Main Memories (part 2)
湖南大学-信息科学与工程学院-计算机与科学系
COS 518: Advanced Computer Systems Lecture 11 Daniel Suo
Lecture 5: GPU Compute Architecture for the last time
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Event loops.
Ch 4. The Evolution of Analytic Scalability
Using surface code experimental output correctly and effectively
Big Data Overview.
Slides prepared by Samkit
Architecture for Real-Time ETL
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Event loops 17-Jan-19.
UNIT 5 EMBEDDED SYSTEM DEVELOPMENT
UNIT 5 EMBEDDED SYSTEM DEVELOPMENT
Bringing more value out of automation testing
MAPREDUCE TYPES, FORMATS AND FEATURES
Thomas E. Anderson, Brian N. Bershad,
Data science laboratory (DSLAB)
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

The Dataflow Model

Introduction Huge datasets are very common Current systems all fall short somehow Fault tolerance Scalability Correctness Exactly-once semantics Event-time windowing Simplicity People want to analyze and manipulate them in many different ways Can’t optimize correctness, latency, and cost at the same time Want a programming model that’s simple and flexible

What to design around? Input data will never be complete Allow user to easily balance correctness, latency, and cost without worrying about underlying implementation Combine batch, micro-batch, streaming systems, and lambda architecture into one tool Needs of current applications Not about performance

Goals Give event-time ordered results with flexible windowing to tune correctness, latency, and cost Allow simple pipeline implementation What results are being computed Where in event time they are being computed When in processing time they are materialized How earlier results relate to later refinements Separate needs from underlying implementation

Windowing Slicing data up into chunks Aligned: applied across all data Unaligned: applied on a subset of the data Fixed: set window size (hourly windows) Sliding: window size + slide period Sessions: period of activity (timeout gap)

Time Domains Event Time: when event occurred Processing Time: when event is processed Constantly changing gap between the two Watermark: estimate of earliest event time that has yet to be processed Heuristic: not totally accurate

Dataflow Model ParDo: Map GroupByKey: Reduce

Windowing Treat all windowing as unaligned: simpler Window Assignment Optimize under the hood Window Assignment New copy of element for each window Can happen at any point in pipeline (after other transformations) Window Merging (for sessions) GroupByKeyAndWindow DropTimestamps GroupByKey MergeWindows GroupAlsoByWindow ExpandToElements

Triggers & Incremental Processing When to end window and emit results? Watermarks: too fast (late data) or too slow (single slow datum) Feature triggers Determines when to emit results Watermarks Points in processing time Data arrival User defined events On triggering… Discard previous results Accumulate new data on top of previous Accumulate & retract: accumulate and store value, retract previous value Useful if multiple triggers fire on the same window

Examples

Batch Single global window

1-minute triggers Accumulating Discarding

Tuple-based

Batch & Micro-Batch (Event time) Batch: Wait for all data Process in event-time Micro-batch: Wait for all data each minute Process in event-time

Streaming: Fixed Windows Wait watermark to pass certain event time Retrigger on late data Poor latency Wait watermark to pass certain event time + processing time based triggers Higher cost, but lower latency than microbatch

Session windowing + Retractions 1 min timeout 1 min processing-time window Watermark

Implementation & Design FlumeJava: library for writing parallel pipelines MillWheel: framework for making streaming pipelines Fault tolerance offloaded to these Design Data’s neverending Flexibility: support many kinds of pipelines Improve previous execution engines Make code clear

Use Cases Combining streaming and batch pipelines into one implementation Tracking sessions Billing: eliminate extra logic for dealing with late data Aggregating statistics: percentile watermarks Batch processing: eliminate extra logic for early termination Building recommendations: processing-time triggers Anomaly detection: data-driven triggers

My own questions What does moving from MillWheel + FlumeJava → Dataflow look like? Are fewer LOC are needed? Can this system support record-at-a-time processing with very low latency? Is there any performance overhead?