한국기술교육대학교 컴퓨터공학부 민준기.  Stream data ◦ A growing number of applications generate streams of data  Performance measurements in network monitoring and traffic.

Slides:

Advertisements

Similar presentations

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Advertisements

1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.

The Design of the Borealis Stream Processing Engine Daniel J. Abadi1, Yanif Ahmad2, Magdalena Balazinska1, Ug ̆ur C ̧ etintemel2, Mitch Cherniack3, Jeong-Hyon.

Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.

Operating Systems 1 K. Salah Module 2.1: CPU Scheduling Scheduling Types Scheduling Criteria Scheduling Algorithms Performance Evaluation.

CS 311 – Lecture 23 Outline Kernel – Process subsystem Process scheduling Scheduling algorithms User mode and kernel mode Lecture 231CS Operating.

Adaptive Ordering of Pipelined Stream Filters S. Babu, R. Motwani, K. Munagala, I. Nishizawa, and J. Widom In Proc. of SIGMOD 2004, June 2004.

1 Continuously Adaptive Continuous Queries (CACQ) over Streams Samuel Madden, Mehul Shah, Joseph Hellerstein, and Vijayshankar Raman Presented by: Bhuvan.

Adaptive Sampling in Distributed Streaming Environment Ankur Jain 2/4/03.

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms.

MPDS 2003 San Diego 1 Reducing Execution Overhead in a Data Stream Manager Don Carney Brown University Uğur ÇetintemelBrown University Mitch Cherniack.

Monitoring Streams -- A New Class of Data Management Applications Don Carney Brown University Uğur ÇetintemelBrown University Mitch Cherniack Brandeis.

Building a Data Stream Management System Prof. Jennifer Widom Joint project with Prof. Rajeev Motwani and a team of graduate studentshttp://www-db.stanford.edu/stream.

Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.

An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.

Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004.

Monitoring Streams -- A New Class of Data Management Applications Don Carney Brown University Uğur ÇetintemelBrown University Mitch Cherniack Brandeis.

SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.

1 04/18/2005 Flux Flux: An Adaptive Partitioning Operator for Continuous Query Systems M.A. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin UC.

Department of Computer Science Southern Illinois University Edwardsville Dr. Hiroshi Fujinoki and Kiran Gollamudi {hfujino,

Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.

Sensor Data Management: Challenges and (some) Solutions Amol Deshpande, University of Maryland.

MONITORING STREAMS: A NEW CLASS OF DATA MANAGEMENT APPLICATIONS DON CARNEY, U Ğ UR ÇETINTEMEL, MITCH CHERNIACK, CHRISTIAN CONVEY, SANGDON LEE, GREG SEIDMAN,

Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.

Freshness-Aware Scheduling of Continuous Queries in the Dynamic Web Mohamed A. Sharaf Alexandros Labrinidis Panos K. Chrysanthis Kirk Pruhs Advanced Data.

A new model and architecture for data stream management.

Scheduling. Alternating Sequence of CPU And I/O Bursts.

Wireless Sensor Networks In-Network Relational Databases Jocelyn Botello.

Alternating Sequence of CPU And I/O Bursts. Histogram of CPU-burst Times.

1 STREAM: The Stanford Data Stream Management System STanfordstREamdatAManager 陳盈君吳哲維林冠良.

Runtime Optimization of Continuous Queries Balakumar K. Kendai and Sharma Chakravarthy Information Technology Laboratory Department of Computer Science.

Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.

1 11/29/2015 Chapter 6: CPU Scheduling l Basic Concepts l Scheduling Criteria l Scheduling Algorithms l Multiple-Processor Scheduling l Real-Time Scheduling.

1 Continuously Adaptive Continuous Queries (CACQ) over Streams Samuel Madden SIGMOD 2002 June 4, 2002 With Mehul Shah, Joseph Hellerstein, and Vijayshankar.

Eddies: Continuously Adaptive Query Processing Ross Rosemark.

A new model and architecture for data stream management.

1 CS.217 Operating System By Ajarn..Sutapart Sappajak,METC,MSIT Chapter 5 CPU Scheduling Slide 1 Chapter 5 CPU Scheduling.

6.1 CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation.

Aurora: a new model and architecture for data stream management Daniel J. Abadi 1, Don Carney 2, Ugur Cetintemel 2, Mitch Cherniack 1, Christian Convey.

Adaptivity in continuous query systems Luis A. Sotomayor & Zhiguo Xu Professor Carlo Zaniolo CS240B - Spring 2003.

Adaptive Ordering of Pipelined Stream Filters Babu, Motwani, Munagala, Nishizawa, and Widom SIGMOD 2004 Jun 13-18, 2004 presented by Joshua Lee Mingzhu.

Lecture Topics: 11/15 CPU scheduling: –Scheduling goals and algorithms.

Monitoring Streams -- A New Class of Data Management Applications based on paper and talk by authors below, slightly adapted for CS561: Don Carney Brown.

Sep Multiple Query Optimization for Wireless Sensor Networks Shili Xiang Hock Beng Lim Kian-Lee Tan (ICDE 2007) Presented by Shan Bai.

Chapter 4 CPU Scheduling. 2 Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation.

Stream Data Operator Ordering  Query Optimization Query Index.

CPU Scheduling G.Anuradha Reference : Galvin. CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time.

CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.

Basic Concepts Maximum CPU utilization obtained with multiprogramming

Lecturer 5: Process Scheduling Process Scheduling  Criteria & Objectives Types of Scheduling  Long term  Medium term  Short term CPU Scheduling Algorithms.

1 Chapter 5: CPU Scheduling. 2 Basic Concepts Scheduling Criteria Scheduling Algorithms.

S. Sudarshan CS632 Course, Mar 2004 IIT Bombay

COMP3211 Advanced Databases

An overview of Data Streaming

Chapter 6: CPU Scheduling

CPU Scheduling G.Anuradha

Module 5: CPU Scheduling

Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy

Chapter5: CPU Scheduling

Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani

Chapter 6: CPU Scheduling

Query Optimization Minimizing Memory and Latency in DSMS

A. Kemper, R. Kuntschke, and B. Stegmaier

Chapter 6: CPU Scheduling

Module 5: CPU Scheduling

Chapter 6: Scheduling Algorithms Dr. Amjad Ali

Adaptive Query Processing (Background)

Chapter 6: CPU Scheduling

Module 5: CPU Scheduling

Presentation transcript:

한국기술교육대학교 컴퓨터공학부 민준기

 Stream data ◦ A growing number of applications generate streams of data  Performance measurements in network monitoring and traffic management  Call detail records in telecommunications  Transactions in retail chains, ATM operations in banks  Log records generated by Web Servers  Sensor network data ◦ Application characteristics  Massive volumes of data (several terabytes)  Records arrive at a rapid rate

 Traditional Data Processing ◦ Stable Repository ◦ Query the data many times  Stream Data Processing ◦ Data arrives continuously ◦ Data is processed without the benefit of multiple passes ◦ For stream data, users register queries priorly

 Using RDBMS relation inserts triggers materialized views ◦ Data streams as relation inserts, continuous queries as triggers or materialized views ◦ Problems with this approach  Inserts are typically batched, high overhead  Expressiveness: simple conditions (triggers), no built- in notion of sequence (views)  No notion of approximation, resource allocation  Current systems don ’ t scale to large # of triggers

 STREAM[2] ◦ Stanford  Telegraph[3] ◦ Research project in UC Berkeley  AURORA[1] ◦ MIT, Brown University, Brandeis University

 The Stanford Data Stream Management System ◦ Data streams and stored relations ◦ Declarative language for registering continuous queries  CQL ◦ Flexible query plans and execution strategies  Continuous monitoring and reoptimization subsystem ◦ Aggressive sharing of state and computation among queries ◦ Load-shedding by introducing approximation ◦ Tools to monitor and manipulate query plan

Query Plan Property Value Legend Join Selectivity Rate of tuple flow Queue size

 Research project in UC Berkeley  challenges ◦ Adaptivity  eddies : tuple routing and operator scheduling ◦ Shared continuous queries  amortizing query-processing costs by sharing the execution of multiple long-running queries  assumption of Telegraph ’ s design ◦ very volatile, unpredictable environments  internet, sensor networks, wide-area federated S/W including peer-to- peer systems ◦ performance is volatile  data rates change from moment to moment  services speed up, slow down, disappear and reappear over time  code behaves differently from moment to moment  data quality changes from moment to moment

 MIT, Brown University, Brandeis University  Features 1.Designed for Scalablility: 2.QoS-Driven Resource Management 3.Continuous and Historical Queries 4.Stream Storage Management

Scheduler QOS Monitor Box Processors Buffer Storage Manager Persistent Store … q1q1 … q2q2 … qiqi … q1q1 … qnqn … q2q2      Catalog Router inputs outputs

 Query Operators (Boxes) ◦ Simple: FILTER, MAP ◦ Binary: UNION, JOIN ◦ Windowed:AGGREGATE, WSORT App QoS App QoS App QoS        Slide Tumble  

 The properties of stream data varies over time ◦ Adaptiveness to generate an efficient plan with respect to the change of data properties is required ◦ Improve the Performance of Stream Query Processing  Operator Scheduling  (NEXT WEEK)  Operator Ordering  Query Optimization  Query Index

 Operator Scheduling ◦ Select one operator among executable operators  Primitive scheduling  Eddy[4]  Chain[5]  Train[6]  Adaptive Scheduling[7] O1 O3 O2 Stream Source Queue

 Process scheduling From OS ◦ FIFO  Tuples are processed in the order that they arrive  Advantage  A consistent throughput ◦ Round robin  Works by placing all runnable operators in a circular queue and allocating a fixed time slice to each  Advantage  Avoidance of starvation  Disadvantage ◦ Does not adapt at all changing stream conditions  Large Queue size, poor output rate

Eddy : ◦ lottery-type scheduler ◦ Adapting to Long Running Queries  ready bit : indicate which operators can be applied to a tuple  done bit : indicate the operators to which a tuple has already been routed R  (R.a > 10) Eddy  (R.b < 15) R1R1 R1R1 R1R1 R1 a5 b 25 R2 a 15 b ReadyDone aa bb aa bb R  (R.a > 10) Eddy  (R.b < 15) R2R2 R2R2 R2R2 R2R2 R2R2 R2R2 SELECT * FROM R WHERE R.a > 10 AND R.b < 15

 STREAM  Purpose ◦ minimize memory utilization  Assumption ◦ Operator time t ◦ Operator selectivity s

 Progress chart ◦ m+1 operator pointers (t 0,s 0 ),(t 1,s 1 ), … (t m,s m ) ◦ i th operator o i takes t i -t i-1 time with s i /s i-1 selectivity

◦ For a point (t,s) where t i-1 = j >= I, d(t,s,j) = -(s j -s)/t j -t ◦ The steepest derivative D(t,s) = max m>=j>=i d(t,s,j) ◦ Steepest Descent Operator point  SDOP(t,s) = (t b,s b ) where b = min{j | m>= j >=i and d(t,s,j) = D(t,s)} ◦ Lower envelop  Connect the sequence of SDOPs  Chain ◦ Schedule for a single time until the tuple that lies on the segment with the steepest slop in its lower envelope simulation. If there are multiple such tuples, select tuple which has the earliest arrival time ◦ Chain is optimal with respect to memory utilization in single stream query (e.g., simple selections)

 Extending Chain to Joins ◦ (t,s): Process time t and selectivity s ◦ Average number of tuples in S : L S ◦ Window size(time) :t’ ◦ Input size : t’(L R +L S ) ◦ Output size : t’(L R a w(S) +L S a w(R) )  where a w(s) is the semijoin selectivity of stream R with sliding windows for S. ◦ Time for run : t’(L R t R +L S t S )  Where tx is the average time to process a tuple from stream X ◦ Selectivity s for a join  (L R a w(S) +L S a w(R) )/ (L R +L S ) ◦ Processing time t for a join  (L R t R +L S t S )/ L R +L S

 Aurora data stream manager  Two-Level Scheduling ◦ Which query to processing(i.e., select a query)  Static: application-at-a-time  Use various scheduling policies(e.g., round robin)  Dynamic: top-k spanner  QoS-driven ◦ How selected query be processed  Operator scheduling

 Operator scheduling ◦ Traversing query tree ◦ Three goals  Throughput  Latency  Memory requirement ◦ QoS driven scheduling

 Min-Cost(MC) ◦ Optimize per-output-tuple processing cost ◦ Traverse the query tree in post-order  b 4 -b 5 - b 3 -b 2 -b 6 -b 1 ◦ Assume  process cost per tuple p, a box call overhead o  A selectivity is 1  Each operator has a queue with a single tuple  Total cost: 15p+5o Average output latency: 12.5p+o

 Min-Latency(ML) ◦ Average latency of the output tuples can be reduced by producing initial output tuples as fast as possible ◦ Output_cost(b): an estimate of the latency where D(b) is the set of operators downstream from b ◦ Under the same condition of MC  b 1 -b 2 -b 1 -b 6 -b 1 -b 4 -b 2 -b 1 -b 3 -b 2 -b 1 -b 5 -b 3 -b 2 -b 1 ◦ Total cost: 15p+15o ◦ Average latency: 7.17p+7.17o

 Min-Memory(MM) ◦ Maximize the consumption of data per unit time ◦ Expected memory reduction rates for b where tsize(b) is the size of a tuple that reside on b’s input queue ◦ Assume selectivity and cost:  b 1 =(0.9, 2), b 2 =(0.4,2) b 3 =(0.4, 3) b 4 =(1.0, 2) b 5 =(0.4,3), b 6 =(0.6,1)  All tuple size is 1 ◦ Mem_rr: 0.05, 0.3, 0.5, 0, 0.2, 0.4 ◦ Memory requirement  MM(36), MC(39), ML(40)

 QoS driven scheduling  Each operator has priority= (utility, urgency) ◦ Utility(b) = gradient(eol(b))  eol(b) = latency(b) + cost(D(b)) Where D(b) is set of operators downstream from b and cost(D(b)) is an estimate of how long it will take to process Latency(b) is average latency of tuples in input queue ◦ Urgency(b) = -est(b) where est(b) is an indication of how close a operator is to a critical point( a point where QoS changes sharply) Priority(b) = (utility(b), -est(b))  Select operator having the highest utility and choose one having minimum slack time.

 WORCESTER Polytechnic institute ◦ Master thesis  Raindrop system  No superior scheduling  Diverse QoS requirements ◦ Output rate ◦ Intermediate Query size ◦ Tuple Delay  A single requirement for all queries

 Update related statistics periodically.  Algorithm score   s is a mean of a statistics of a scheduler   H is mean for historical category H, (maxH-minH) is spread of values  decay reflects the unreliability of the score of algorithms that have not run for long time. (0 decay < 1)  time is elapse time since  s was updated  If quantifier is maximize, z i = z i, otherwize, z i = 1-z i

 Roulette Wheel strategy ◦ Assign each algorithm a slice of a ciurcular “roulette wheel” with size of the slice being proportional to the individual’s score.  Problem of this work ◦ How to obtain not-runned schedulers’ statistics. ◦ Inaccuracy of the score function  Not runned schedulers for long time  0.5 (due to decay)  Scheduler runs very well  0.5 (since  s==  H)

 [1] D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee,G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams–a new class of data management applications. In Proc. 28th Intl. Conf. on Very Large Data Bases, Aug  [2] A. Arasu, B. Babcock, S. Babu, M. Datar, K. Ito, R. Motwani, I. Nishizawa, U. Srivastava, D. Thomas, R. Varma, J. Widom, J., “Stream: The stanford stream data manager”, IEEE Data Engineering Bulletin, Vol 26, No 1, pp ,  [3]J. M. Hellerstein, M. J. Franklin, S. Chandrasekaran, A. Deshpande, K. Hildrum, S. Madden, V. Raman, V., M. A. Shah, “Adaptive query processing: Technology in evolution”, IEEE Data Engineering Bulletin, Vol 23, No 2, pp. 7-18,  [4] R. Avnur, J. M. Hellerstein, “Eddies: Continuously adaptive query processing”, In Proceedings of ACM SIGMOD Conference, pp ,  [5] Brain Babcock et.al, “Chain: Operator scheduling for Memory minimization in Data Stream Systems,” ACM SIGMOD  [6] Don Carney et.al, “Operator Scheduling in a Data Stream Manager”, VLDB 2003  [7] B. Pielech, “Adaptive scheduling algorithm selection in a streaming query system,” Master thesis, Worcester polytechnic institute,  [8] N Tatbul, U Çetintemel, S Zdonik, M Cherniack, M Stonebraker, “Load shedding in a data stream manager”, VLDB  [9]. Babu, S., Motwani, R., Munagala, K., Nishizawa, I., Widom, J.: Adaptive ordering of pipelined stream filters. In: Proceedings of ACM SIGMOD Conference. (2004) 407–418  [10] S. Madden, M.A. Shah, J.M. Hellerstein, V. Raman, “Continuously adaptive continuous queries over streams”, In Proceedings of ACM SIGMOD Conference,  [11] Jinwon Lee, Seungwoo Kang, Youngki Lee, SangJeong Lee, and Junehwa Song, "BMQ- Processor: A High-Performance Border Crossing Event Detection Framework for Large-scale Monitoring Applications", IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 21, No. 2, pp , February 2009BMQ- Processor: A High-Performance Border Crossing Event Detection Framework for Large-scale Monitoring Applications

 [12] S. Madden et.al., “TAG: Aggregation Service for Ad-Hoc Sensor Networks”, OSDI, 2002  [13] N. Shrivastava et.al., “Medians and Beyond: New Aggregation Techniques for Sensor Networks,” ACM Sensys 2004  [14] N. Trigoni et.al., “Multi-Query Optimization for Sensor Networks” DCOSS 2005  [15]N. Trigoni, et.al., "Routing and Processing Multiple Aggregate Queries in Sensor Networks,“ ACM S enSys,  [16] A. Deshpande et.al., "Model-Driven Data Acquisition in Sensor Networks,“ VLDB,  [17] D. Chu et.al., "Approximate Data Collection in Sensor Networks using Probabilistic Models,“ ICDE, 2006  [18] D. Tulone et. al., “PAQ: Time Series Forecasting For Approximate Query Answering In Sensor Networks,” European Conf. Wireless Sensor Networks, 2006  [19] A. Deligiannakis et.al., “Compressing Historical Information in Sensor Networks,” ACM SIGMOD  [20] A. Jain et.al., “Adaptive Stream Resource Management Using Kalman Filters,” ACM SIGMOD 2004  [21] X. Yang et.al., “In-Network Execution of Monitoring Queries in Sensor Networks,” ACM SIGMOD  [22]M. Stern et.al., “Towards Efficient Processing of General-Purpose Joins in Sensor Networks,” ICDE  [23]A. Pandit et.al, “ Communication-Efficient Implementation of Range-Joins in Sensor Networks,” International Conference on Database Systems for Advanced Applications (DASFAA), 2006  [24] H. Yu et.al, “In-Network Join Processing for Sensor Networks,” APWeb  [25] A. Coman et.al, “On Join Location in Sensor Networks,” MDM  [26] H.S. Lin, J.G. Lee, M.J. Lee, K.Y. Whang, I.Y. Song,” Continuous Query Processing in Data Streams Using Duality of Data and Queries,” ACM SIGMOD  [27] B. Mozafari, C. Zaniolo, “Optimal Load Shedding with Aggregates and Mining Queries,” ICDE 2010.