S. Sudarshan CS632 Course, Mar 2004 IIT Bombay Data Streams S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
Overview Two approaches Motwani et al [CIDR03] Concentrate on query processing, approximations Cherniak et al. [CIDR03] Concentrate on system architecture with data-flow style processing DAG of operators to be created by users
Motwani et al. Query language (SQL extension) Semantics of stream queries Query plans with sharing and approximation Reducing memory requirements for query processing by exploiting constraints on data Techniques for static and dynamic approximation of query results Resource allocation to queries Techniques for data compression
Motwani: Language and Semantics Relations and streams Streams are timestamped Tuple s arrives at time t: <t, s> Istream and Dstream operators Create stream from relation insert/delete Query language implicitly converts from streams to relations and vice versa Any query with stream at outer level gives stream output Window operations convert streams to relations
Motwani: Operators and Queues Create, changemem, run Synopsis Summarizes tuples seen on a stream Create, changemem, insert/delete, query Resource sharing Sharing of synopses
Motwani: Constraints Constraints: Many-one join and referential integrity constraints between two streams Clustered/ordered arrival constraints Can exploit to avoid storing/examining history Implicit addition of now window
Motwani: Scheduling Greedy approach: Schedule operator that consumes largest number of tuples per time unit and produces the fewest tuples Minimize queue lengths
Motwani: Approximation Static and dynamic approximation Static: modify the query Window reduction Sampling rate reduction Dynamic Synopsis compression Sampling Load shedding Resource allocation to maximize precision Simple precision metric: (FP, FN)
Motwani: Implementation Entities Operators, queues, synopses Control tables Attribute value pairs used to control the entity Query plan Network of entities
Cherniak et al. Query model Distribution model Set of relational operations Push based query processing “tumble” aggregate operation outputs results whenever window is complete Distribution model Aurora* vs Medusa
Cherniak: Communication Infrastructure Naming and discovery Catalogs to find streams/queries Each Aurora network binds its inputs and outputs to streams Tuples are routed based on who consumes the stream Single copy of stream per participant Single TCP connection between sites Streams multiplexed within connection Allows control on QOS Remote definition of streams
Cherniak: Economic Model Economic model of load management and sharing Agoric system Source is paid, sink pays
Cherniak: Load Management Redistribution of computation while system is active Periodic repartitioning of computation
Cherniak: Re-partitioning Operator nodes can be reallocated across processor boundaries when required (box sliding) Pairwise interaction Box splitting to parallelize operation Filters to decide which tuples go where subsequent re-merging of streams Merge has to handle groups partially aggregated before split Deciding on split criterion
High Availability Basic idea: K-safe Each server can act as backup for its downstream server Low overhead Tuples are discarded after ensuring they are saved elsewhere if they may be needed again K-safe Data is safe as long as < k sites fail Keep copies of in-transit tuples at k upstream servers
Cherniak: Availability Remote queue truncation technique Flow messages Process pair Sequence numbering of tuples Earliest tuple that a box depends on E.g stateless (e.g. filter) and stateful (e.g. aggregates) stored and sent back periodically to upstream server More complex with split operators Split operator “merges” messages When to truncate: min of all truncation msgs Failure detection and recovery
Cherniak: Availability Process pair techniques Traditional way in distributed systems More msgs during operation, less work during recovery Further extension to virtual machines
Charniak: Policy Specification and Guidelines QoS based control in Aurora* Economic contract based control in Medusa