Download presentation
Presentation is loading. Please wait.
Published byTamsyn Warren Modified over 9 years ago
1
Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data stream management
2
Outline 2 The Aurora stream query algebra Run–time Architecture Introduction
3
Aurora-system architecture Aurora: a new model and architecture for data stream management, a new system to manage data streams for monitoring applications. The fact that a software system must process and react to continual inputs from many sources (e.g., sensors) rather than from human operators requires Aurora - a new DBMS currently under construction at Brandeis University, Brown University, and M.I.T. 3
4
Currently used DB systems Classical DBMS: Passive repository storing data (HADP – human-active, DBMS- passive model) Only current state of data is important Data synchronized; queries have exact answers (no support for approximation) Monitoring applications are difficult to implement in traditional DBMS First, the basic computation model is wrong: DBMSs have a HADP model while monitoring applications often require a DAHP model. Triggers and alerters are second-class citizens Problems with getting required data from historical time series Development of dedicated middleware is expensive Conclusion: these systems are ill suited for applications used to alert human when abnormal situation occurs (expected DAHP model – DBMS-active, human-passive) 4
5
Aurora – main assumptions Data comes from various, uniquely identified data sources (data streams) Each incoming tuple is timestamped Aurora is expected to process incoming streams Tuples are transferred through loop-free, directed graph Outputs from the system are presented to applications Maintains historical storage 5
6
6
7
Aurora system overview 7 Any box can filter stream (select operation) Box can compute stream aggregates applying aggregate function accross a window of values in the stream Output of any box can be an input for several other boxes (split operation) Each box can gather tuples from many inputs (union operation)
8
Aurora query model 8 b1b1 b7b7 b2b2 b6b6 b5b5 b4b4 b3b3 Appl Connection points Storage S1Storage S2 Storage S3 Continuous query View Ad-hoc query „Keep 2 hr” QoS spec Each CP and view should have a persistence specification (e.g. „keep data for 2 hr”) Each output is associated with QoS specification (helps to allocate the processing elements along the path)
9
Queries in the aurora Continuous queries Query continuously processes tuples Output tuples are delivered to an application Ad-hoc queries System will process data and deliver answer from the earliest time stored in the connection point Semantic is the same as continuous query that started execution at t now – (persistence specification) Query continues until explicit termination Views Similar to materialized or partially-materialized views in classical DB systems Application may connect to the end of this path whenever there is a need 9
10
Queries in the aurora Connection points Support for dynamic modification of network Support for data caching (persistence specification) – helpful for ad-hoc queries Connection point without upload stream can be used as a stored data set (like in classical DBMS) Tuples from connection point can be pushed through the system (e.g when connection point is „materialized” and stored tuples are passed as a stream to the downstream nodes) Alternatively, downstream node can pull the data (helpful in the execution of filtering or joining operations) 10
11
Application Domains Online Auctions Network Traffic Management Habitat Monitoring Military Logistics Immersive Environments Road Traffic Monitoring System Monitoring 11
12
SQuAl The Aurora [S]tream [Qu]ery [Al]gebra 7 operators: Order-agnostic (Filter, Map, Union) Order-sensitive (BSort, Aggregate, Join, Resample) Model: A stream is an append-only sequence of tuples with uniform type A stream type has the form: (TS, A 1,…, A n ) Steam tuples have the form: (ts, v 1,…, v n ) A i : application-specific data fields ts: timestamp
13
Order-agnostic operators Input tuples have the form: t = (TS = ts, A 1 = v 1,…, A k = v k ) 3 operators: Filter: similar to relational selection filter on multiple predicates route tuples according to which predicates they satisfy Map: similar to relational projection apply arbitrary functions to tuples (including user- defined functions) Union: merge 2 or more streams of common schema
14
Filter Acts much like a case statement Can be used to route input tuples to alternative streams Form: Filter(P 1,…,P m )(S) Pi: predicates over tuples on the input stream S Its output consists of m + 1 streams Output tuples have the same schema and values as input tuples, including their QoS timestamp
15
Map Is a generalized projection operator Form: Map(B 1 = F 1,…, B m = F m )(S) B i : name of attribute F i : function over tuple on the input stream S Output tuple for each input tuple t has the form: (TS = t.TS, B 1 = F 1 (t),…, B m = F m (t)) Resulting stream can have a different schema than the input stream, but the timestamps of input tuples are preserved in corresponding output tuples
16
Union Is used to merge 2 or more streams into a single output stream Form: Union(S 1,…,S n ) S i : stream, common schema Union can output tuples in any order Output tuples have the same schema and values as input tuples including their QoS timestamps
17
Order-sensitive operators Require order specification arguments Order specification: describes the tuples arrival order they expect Order specifications have the form: Order(On A, Slack n, GroupBy B 1,…,B m ) A, B i : attribute n: non-negative integer 4 operators: Bsort: is an approximate sort operator with semantics equivalent to a bounded pass bubble sort Aggregate: applies a window function to sliding windows over its input stream Join: is a binary operator that resembles a band join applied to infinite streams Resample: is an interpolation operator used to align streams
18
BSort Is an approximate sort operator Form: Bsort(Assuming O)(S) O = Order(On A, Slack n, GroupBy B 1,…,B m ) is a specification of the assumed ordering over the output stream Performs a buffer-based approximate sort Equivalent to n passes of a bubble sort
19
BSort
20
Aggregate Applies “window functions” to sliding windows over its input stream Form: Aggregate(F, Assuming O, Size s, Advance i)(S) F: “window function” (SQL-type aggregate operation, Postgres-style user-defined function) O = Order(On A, Slack n, GroupBy B 1,…,B m ) is an order specification over input stream S s: size of the window (measured in terms of values of A) i: integer, predicate that specifies how to advance the window when it slides Output tuples have the form: (TS = ts, A = a, B 1 = u 1,…, B m = u m ) ++ (F(W)) W: “window” of tuples from the input stream with values of A between a and a + s – 1 ts: the smallest timestamps associated with tuples in W ++: denotes concatenation of 2 tuples
21
Aggregate
22
Slack = 1 or more Blocking: waiting for lost or late tuples to arrive in order to finish window calculations Optional Timeout argument: Aggregate(F, Assuming O, Size s, Advance i, Timeout t)
23
Join Is a binary join operator Form: Join(P, Size s, Left Assuming O 1, Right Assuming O 2 )(S 1, S 2 ) P: predicate over pairs of tuples from input streams S 1 and S 2 s: integer O 1 : order specification on some numeric or time-based attribute of S 1 (A) O 2 : order specification on some numeric or time-based attribute of S 2 (B) For every in-order tuple t in S 1 and u in S 2, the concatenation of t and u (t++u) is output if: |t.A – u.B| ≤ s P holds of t and u The QoS timestamp for the output tuple is the minimum timestamp of t and u
24
Join
25
Resample Is an asymmetric, semijoin-like synchronization operator Can be used to align pairs of streams Form: Resample(F, Size s, Left Assuming O 1, Right Assuming O 2 )(S 1, S 2 ) F: “window function” over S 1 s: integer O 1 : order specification on some numeric or time-based attribute of S 1 (A) O 2 : order specification on some numeric or time-based attribute of S 2 (B) For every tuple t from S 1, output tuple: (B 1 : u.B 1,..., B m : u.B m, A : t.A) + +F(W(t)) W(t) = {u ∈ S 2 |u in order wrt O 2 in S 2 ∧ |t.A − u.B| ≤ s}
26
Resample
27
Run-time architecture Router Scheduler Load Shedder QoS Monitor Storage manager Box Processors Q1Q1 Q2Q2 QiQi QnQn QjQj Buffer Manager Persistent Storage Outputs Inputs
28
Quality of Server - QoS QoS, in general, is a multidimensional function of several attributes of an Aurora system. Response times (production of output tuples) Tuple drops Values produced (importance of produced values) Administrator specifies QoS graphs for output based on one or more of mentioned functions Other types of QoS functions can be defined too
29
QoS graphs Graphs are expected to be normalized Graphs should allow a properly sized network to operate with all outputs in a ‘good zone’ Graphs should be convex (the value-based graph is an exception) 1 0 Delay 1 0 % tuples delivered 1 0 Output value good zone
30
Aurora Storage Manager (ASM) – Queues management There is one queue at the output of each box; this queue is shared by all successor boxes Queues are stored in memory and on disks Queues may change length b2b2 b1b1 time Queue organization Processed tuples
31
Scheduling in Aurora Scheduler (and Aurora) aims to reduce overall tuple execution cost Exploit of two nonlinearities in tuple processing Interbox nonlinearity: Minimaze tuple trashing (if buffer space is not sufficient tuples has to be shuttled between memory and disk) Avoiding to copy data from output to buffer (a possibility of bypassing ASM when one box is scheduled right after another) Intrabox nonlinearity: The cost of tuple processing may decrease as the number of available tuples in the queue increases
32
Scheduling in Aurora Aurora’s approach: (1) have box queues as many tuples as possible, (2) process it at once – train scheduling, and (3) pass them to subsequent boxes without going to disk – superbox scheduling Two goals: (1) minimize number of I/O operations and (2) minimize number of box calls per tuple
33
Scheduler performance Time (ms) 0 50 100 150 200 250 300 Execution costs Scheduling overhead Tuple at a timeTrainsSuperboxes
34
Priorities assignment in Scheduler The latency of each output tuple is the sum of the tuple’s processing delay and its waiting delay (is primarily the function of scheduling) The goal of scheduler: to assign priorities to boxes outputs that maximize the overall QoS The Scheduler’s approach is divided into two aspects: state-based analysis that assigns priorities to outputs and picks for scheduling the output with the highest utility feedback-based analysis that observes overall system and increases the priorities of outputs not doing well (base on QoS graph)
35
Load shedding Reaction to overload Drop is a system level operator that enables to drop randomly tuples from stream at specified rate 1. Load shedding by dropping tuples 2. Load shedding by filtering tuples
36
Load shedding Load shedding by dropping tuples Reduces the amount of Aurora processing by dropping randomly selected tuples at strategic points in the network
37
Load shedding Load shedding by filtering tuples Idea: remove less important tuples rather than randomly chosen It use value-based QoS information
39
Questions 1:Which of the following operators output tuples that have the same schema and values as input tuples? a.Aggregateb. b.BSort (x) c.Filter (x) d.Joine. e.Map f.Resample g.Union (x)
40
Questions 2. What does Aurora's primary run-time architecture include? a.Router b.Storage manager (x) c.Scheduler (x) d.Box processor. e.QoS monitor (x) f.Resample g.Load shedder (x)
41
Three broad application types Aurora addresses three broad application types in a single, unique framework: 1.Real-time monitoring applications continuously monitor the present state of the world and are, thus, interested in the most current data as it arrives from the environment. In these applications, there is little or no need (or time) to store such data. 2.Archival applications are typically interested in the past. They are primarily concerned with processing large amounts of finite data stored in atime-series repository. 3.Spanning applications involve both the present and past states of the world, requiring combining and comparing incoming live data and stored historical data. These applications are the most demanding as there is a need to balance real-time requirements with efficient processing of large amounts of disk-resident data.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.