Presentation is loading. Please wait.

Presentation is loading. Please wait.

S. Sudarshan CS632 Course, Mar 2004 IIT Bombay

Similar presentations


Presentation on theme: "S. Sudarshan CS632 Course, Mar 2004 IIT Bombay"— Presentation transcript:

1 S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
Data Streams S. Sudarshan CS632 Course, Mar 2004 IIT Bombay

2 Overview Two approaches Motwani et al [CIDR03]
Concentrate on query processing, approximations Cherniak et al. [CIDR03] Concentrate on system architecture with data-flow style processing DAG of operators to be created by users

3 Motwani et al. Query language (SQL extension)
Semantics of stream queries Query plans with sharing and approximation Reducing memory requirements for query processing by exploiting constraints on data Techniques for static and dynamic approximation of query results Resource allocation to queries Techniques for data compression

4 Motwani: Language and Semantics
Relations and streams Streams are timestamped Tuple s arrives at time t: <t, s> Istream and Dstream operators Create stream from relation insert/delete Query language implicitly converts from streams to relations and vice versa Any query with stream at outer level gives stream output Window operations convert streams to relations

5 Motwani: Operators and Queues
Create, changemem, run Synopsis Summarizes tuples seen on a stream Create, changemem, insert/delete, query Resource sharing Sharing of synopses

6 Motwani: Constraints Constraints:
Many-one join and referential integrity constraints between two streams Clustered/ordered arrival constraints Can exploit to avoid storing/examining history Implicit addition of now window

7 Motwani: Scheduling Greedy approach:
Schedule operator that consumes largest number of tuples per time unit and produces the fewest tuples Minimize queue lengths

8 Motwani: Approximation
Static and dynamic approximation Static: modify the query Window reduction Sampling rate reduction Dynamic Synopsis compression Sampling Load shedding Resource allocation to maximize precision Simple precision metric: (FP, FN)

9 Motwani: Implementation
Entities Operators, queues, synopses Control tables Attribute value pairs used to control the entity Query plan Network of entities

10 Cherniak et al. Query model Distribution model
Set of relational operations Push based query processing “tumble” aggregate operation outputs results whenever window is complete Distribution model Aurora* vs Medusa

11 Cherniak: Communication Infrastructure
Naming and discovery Catalogs to find streams/queries Each Aurora network binds its inputs and outputs to streams Tuples are routed based on who consumes the stream Single copy of stream per participant Single TCP connection between sites Streams multiplexed within connection Allows control on QOS Remote definition of streams

12 Cherniak: Economic Model
Economic model of load management and sharing Agoric system Source is paid, sink pays

13 Cherniak: Load Management
Redistribution of computation while system is active Periodic repartitioning of computation

14 Cherniak: Re-partitioning
Operator nodes can be reallocated across processor boundaries when required (box sliding) Pairwise interaction Box splitting to parallelize operation Filters to decide which tuples go where subsequent re-merging of streams Merge has to handle groups partially aggregated before split Deciding on split criterion

15 High Availability Basic idea: K-safe
Each server can act as backup for its downstream server Low overhead Tuples are discarded after ensuring they are saved elsewhere if they may be needed again K-safe Data is safe as long as < k sites fail Keep copies of in-transit tuples at k upstream servers

16 Cherniak: Availability
Remote queue truncation technique Flow messages Process pair Sequence numbering of tuples Earliest tuple that a box depends on E.g stateless (e.g. filter) and stateful (e.g. aggregates) stored and sent back periodically to upstream server More complex with split operators Split operator “merges” messages When to truncate: min of all truncation msgs Failure detection and recovery

17 Cherniak: Availability
Process pair techniques Traditional way in distributed systems More msgs during operation, less work during recovery Further extension to virtual machines

18 Charniak: Policy Specification and Guidelines
QoS based control in Aurora* Economic contract based control in Medusa


Download ppt "S. Sudarshan CS632 Course, Mar 2004 IIT Bombay"

Similar presentations


Ads by Google