Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.

Slides:



Advertisements
Similar presentations
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Advertisements

Review: Search problem formulation
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Michael Markovitch Joint work with Gabriel Scalosub Department of Communications Systems Engineering Ben-Gurion University Bounded Delay Scheduling with.
CSE506: Operating Systems Disk Scheduling. CSE506: Operating Systems Key to Disk Performance Don’t access the disk – Whenever possible Cache contents.
Institute of Computer Science University of Wroclaw Geometric Aspects of Online Packet Buffering An Optimal Randomized Algorithm for Two Buffers Marcin.
Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams Presented by: Andy Mason and Sheng Zhong Ahmed M.Ayad and Jeffrey.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Towards Feasibility Region Calculus: An End-to-end Schedulability Analysis of Real- Time Multistage Execution William Hawkins and Tarek Abdelzaher Presented.
Providing Performance Guarantees in Multipass Network Processors Isaac KeslassyKirill KoganGabriel ScalosubMichael Segal EE, TechnionCISCO & CSE, BGU.
Scheduling in Batch Systems
Operator Placement for In-Network Stream Query Processing.
Energy-Efficient Rate Scheduling in Wireless Links A Geometric Approach Yashar Ganjali High Performance Networking Group Stanford University
October 5th, 2005 Jitter Regulation for Multiple Streams David Hay and Gabriel Scalosub Technion, Israel.
6/25/2015Page 1 Process Scheduling B.Ramamurthy. 6/25/2015Page 2 Introduction An important aspect of multiprogramming is scheduling. The resources that.
CSE 421 Algorithms Richard Anderson Lecture 6 Greedy Algorithms.
EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley Asynchronous Distributed Algorithm Proof.
Ecole Polytechnique, Nov 7, Online Job Scheduling Marek Chrobak University of California, Riverside.
An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.
1 Minimizing Latency and Memory in DSMS CS240B Notes By Carlo Zaniolo CSD--UCLA.
Modified from Silberschatz, Galvin and Gagne ©2009 Lecture 8 Chapter 5: CPU Scheduling.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Online Function Tracking with Generalized Penalties Marcin Bieńkowski Institute of Computer Science, University of Wrocław, Poland Stefan Schmid Deutsche.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Chapter 6 CPU SCHEDULING.
Freshness-Aware Scheduling of Continuous Queries in the Dynamic Web Mohamed A. Sharaf Alexandros Labrinidis Panos K. Chrysanthis Kirk Pruhs Advanced Data.
Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
Scheduling. Alternating Sequence of CPU And I/O Bursts.
Competitive Queue Policies for Differentiated Services Seminar in Packet Networks1 Competitive Queue Policies for Differentiated Services William.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Constructions of Optical Priority Queues Jay Cheng Department of Electrical Engineering & Institute of Communications Engineering National Tsing Hua University.
Loss-Bounded Analysis for Differentiated Services. By Alexander Kesselman and Yishay Mansour Presented By Sharon Lubasz
Adaptivity in continuous query systems Luis A. Sotomayor & Zhiguo Xu Professor Carlo Zaniolo CS240B - Spring 2003.
Non-Preemptive Buffer Management for Latency Sensitive Packets Moran Feldman Technion Seffi Naor Technion.
Managing VBR Videos. The VBR Problem Constant quality Burstiness over multiple time scales Difference within and between scenes Frame structure of encoding.
CSE 421 Algorithms Richard Anderson Lecture 8 Greedy Algorithms: Homework Scheduling and Optimal Caching.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Feb 23, 2010 University of Liverpool Minimizing Total Busy Time in Parallel Scheduling with Application to Optical Networks Michele Flammini – University.
Lecture 12 Scheduling Models for Computer Networks Dr. Adil Yousif.
Lecturer 5: Process Scheduling Process Scheduling  Criteria & Objectives Types of Scheduling  Long term  Medium term  Short term CPU Scheduling Algorithms.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
The Stream Model Sliding Windows Counting 1’s
Data Streaming in Computer Networking
Load Shedding CS240B notes.
Packet Forwarding.
Caching Queues in Memory Buffers
Load Shedding Techniques for Data Stream Systems
Chapter 5: CPU Scheduling
Operating System Concepts
Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani
Operating systems Process scheduling.
Greedy Algorithms: Homework Scheduling and Optimal Caching
Scheduling Algorithms to Minimize Session Delays
Process Scheduling B.Ramamurthy 2/23/2019.
Figure Areas in an autonomous system
Query Optimization Minimizing Memory and Latency in DSMS
Uniprocessor scheduling
Load Shedding CS240B notes.
Process Scheduling B.Ramamurthy 4/24/2019.
Process Scheduling B.Ramamurthy 5/7/2019.
Computational Advertising and
Chapter 6: Scheduling Algorithms Dr. Amjad Ali
Adaptive Query Processing (Background)
Presentation transcript:

Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept. Computer Science, Stanford University) SIGMOD 2003, June 9-12, San Diego, CA. This following slides were downloaded as is from: Sincere thanks to all who made the slides!

Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani

Data Streams are Bursty Data stream arrival rates are often: Fast Irregular Examples: Network traffic (IP, telephony, etc.) messages Web page access patterns Peak rate much higher than average rate 1-2 orders of magnitude Impractical to provision system for peak rate

Bursts Create Backlogs Arrival rate temporarily exceeds throughput Queues of unprocessed elements build up Two options when memory fills up Page to disk Slows system, lowers throughput Admission control (i.e. drop packets) Data is lost, answer quality suffers Our goal: reduce memory needed for queueing

Outline Problem Definition Intuition Behind the Solution Chain Scheduling Algorithm Near-Optimality of Chain Scheduling Open Problems

Problem Definition Inputs: Data flow path(s) consisting of sequences of operators For each operator we know: Execution time (per block) Selectivity σ Σ Query #1 σ σ Query #2 Stream Time: t 1 Selectivity: s 1 Time: t 2 Selectivity: s 2 Time: t 3 Selectivity: s 3 Time: t 4 Selectivity: s 4

Progress charts (0,0)(6,0) Time Block Size (0,1) (1,0.5) (4,0.25) Opt1 Opt2 Opt3 σ σ σ

Problem Definition Inputs: Data flow path(s) consisting of sequences of operators For each operator we know: Execution time (per block) Selectivity At each time step: Adversary may add blocks of tuples to initial input queue(s) Scheduler selects one block of tuples Selected block moves one step on its progress chart Objective: Minimize peak memory usage (sum of queue sizes) σ Σ Query #1 σ σ Query #2 Stream Time: t 1 Selectivity: s 1 Time: t 2 Selectivity: s 2 Time: t 3 Selectivity: s 3 Time: t 4 Selectivity: s 4

Main Solution Idea Fast, selective operators release memory quickly Therefore, to minimize memory: Give preference to fast, selective operators Postpone slow, unselective operators Greedy algorithm: Operator priority = selectivity per unit time (s i /t i ) Always schedule the highest-priority available operator Greedy doesn’t quite work… A “good” operator that follows a “bad” operator rarely runs The “bad” operator doesn’t get scheduled Therefore there is no input available for the “good” operator

Bad Example for Greedy Opt1 Opt2 Opt3 Time Block Size Tuples build up here

Chain Scheduling Algorithm Opt1 Opt2 Opt3 Lower envelope Time Block Size

Chain Scheduling Algorithm Calculate lower envelope Priority = |slope of lower envelope segment| Always schedule highest-priority available operator Break ties using operator order in pipeline Favor later operators

FIFO: Example (0,0)(6,0) (0,1) (1,0.5) (4,0.25) Opt1 Opt2 Opt3 Time Block Size

Chain: Example (0,0)(6,0) (0,1) (1,0.5) (4,0.25) Opt1 Opt2 Opt3 Lower envelope Time Block Size

Memory Usage

Chain is Near-Optimal Memory within constant factor of optimal offline algorithm Proof sketch: Greedy scheduling is optimal for convex progress charts “Best” operators are immediately available Lower envelope is convex Lower envelope closely approximates actual progress chart Proof on next slide… Theorem: Given a system with k queries, all operator selectivities ≤ 1, Let C(t) = # of blocks of memory used by Chain at time t. At every time t, any algorithm must use ≥ C(t) - k memory.

Lemma: Lower Envelope is Close to Actual Progress Chart At most one block in the middle of each lower envelope segment Due to tie-breaking rule (Lower envelope + 1) gives upper bound on actual memory usage Additive error of 1 block per progress chart Difference

Performance Comparison spike in memory due to burst

Open Problems Avoid starvation Introduce deadlines (maximum allowable latency) Handle sharing between query plans Shared computation Shared data (queues store pointers) Sliding-window joins between streams Time synchronization complicates scheduling Heuristics proposed in SIGMOD ‘03 paper

Thanks for Listening!