Download presentation
Presentation is loading. Please wait.
1
Models and Issues in Data Streaming Presented By :- Ankur Jain Department of Computer Science 6/23/03 A list of relevant papers is available at http://www.cs.ucsb.edu/~ankurj/papers.html
2
Outline Data Streaming Introduction to Data Streaming and Sensor Data Introduction to Data Streaming and Sensor Data Challenges Challenges Common Applications Common Applications Active research projects in the area Active research projects in the area Common models and issues Common models and issues Adaptive Filters for Continuous Queries over Distributed Data Streams – {Widom, Olston and Jiang} Presentation of the algorithm Presentation of the algorithm Experimental results and Conclusions Experimental results and Conclusions
3
Data Streams Data input as continuous and ordered data streams Characteristics of Data Streams Characteristics of Data Streams Sequential access Sequential access Bounded main memory Bounded main memory History/arrival order is significant History/arrival order is significant Applications may have real time requirements Applications may have real time requirements Unpredictable/variable data arrival and characteristics Unpredictable/variable data arrival and characteristics Imprecise/noisy data Imprecise/noisy data Continuous queries (CQ) Continuous queries (CQ)
4
Challenges Application stream rates exceed DBMS capacity Management of high stream rates using efficient and adaptive sampling techniques Building efficient query plans Optimize memory/disk usage
5
Applications Network monitoring and traffic engineering Telecom call records Caller ID’s, call destinations Caller ID’s, call destinations Network Security Packet source/destination addresses Packet source/destination addresses Financial Applications Stock Tickers Stock Tickers Sensor Networks Monitoring temperature, sound, light, motion tracking Monitoring temperature, sound, light, motion tracking Manufacturing processes Process monitoring Process monitoring Web logs and click streams Massive data sets Sensor Networks
6
Issues in Sensor Data Streams Communication High variance, limited bandwidth, frequently dropped packets High variance, limited bandwidth, frequently dropped packetsComputation Limited computational capacity and limited memory size Limited computational capacity and limited memory size Uncertainty in sensor readings Low Power Capacity
7
Data Stream Projects The Cougar Project (Cornell) Sensors form a distributed database system Sensors form a distributed database system Cross-layer optimizations (data management layer and the routing layer) Cross-layer optimizations (data management layer and the routing layer) Telegraph Project (Berkeley) Adaptive routing between the sensor nodes Adaptive routing between the sensor nodes Adaptive query processing Adaptive query processing STREAM (STanford stREam dAta Manager) (Stanford) Building a new data stream management system Building a new data stream management system TRAPP (Tradeoff in Replication Precision and Performance) TRAPP (Tradeoff in Replication Precision and Performance) Have adaptive filters at remote sources Constructing adaptive query plans Constructing adaptive query plans Optimize communication costs Optimize communication costs Aurora Project (Brown/MIT)
8
Interesting research areas.. The main issues in data streaming appear to be :- Data integration/fusion Data integration/fusion Adaptive data filtering Adaptive data filtering Some relevant work has been done (from STREAM project to be presented shortly) Managing data on non-linear models Managing data on non-linear models No relevant work done so far Most of the projects consider simple stream problems such as traffic monitoring or stock tickers Consider the problem of data integration from multiple cameras in a parking lot Consider the problem of data integration from multiple cameras in a parking lot Monitoring traffic on a stretch of a freeway Monitoring traffic on a stretch of a freeway Monitoring enemy activity over an area using multiple cameras. Monitoring enemy activity over an area using multiple cameras. Need for adaptive sampling where rapid updates are available from interesting areas such as :- Areas where there is increased enemy activity Areas where there is increased enemy activity Freeway stretch where there is unusually slow/fast traffic Freeway stretch where there is unusually slow/fast traffic Haphazard/suspicious vehicle movement in a parking lot Haphazard/suspicious vehicle movement in a parking lot
9
Adaptive Filters for Continuous Queries over Distributed Data Streams Appeared in Sigmod 2003 Chris Olston, Jing Jiang and Jennifer Widom Stanford University
10
Environment in Consideration Some applications do not require exact precision for their queries. Distributed sources (sensors) at remote locations continuously update streams to a central stream processor Users register continuous queries (CQ) with the central processor with quantitative precision constraints The central processor installs filters at remote locations with bound widths depending on the given precision constraint
11
Goals Reduce the communication overhead incurred in the presence of rapid stream updates Trade precision for communication overhead at a fine granularity The filters should have the capability to adapt to changing conditions to minimize stream rates
12
Example Applications Wireless Sensor Networks Monitoring environmental conditions such as light, temperature, sound etc. Monitoring environmental conditions such as light, temperature, sound etc. Stock quote services Network Traffic Monitoring Network packet arrival logs at router level Network packet arrival logs at router level Online Auctions Wide Area resource accounting Load Balancing for replicated servers
13
A bounded approximate answer is a pair of real values L and H that define an interval [L,H] A precision constraint δ ≥ 0 for a CQ is defined such that 0 ≤ H – L ≤ δ at all times For each remote object O the filter maintains a bound [L o,H o ] of width W O If V is the latest value for O that passed the filter then L o := V – W O / 2 and H o := V + W O / 2 The central stream processor keeps a cached copy of [L o,H o ] based on filtered updates from O’s source Overview
14
Data Sources V 1 updates V 2 updates V n updates.. Filters Bound Shrinking [L 1, H 1 ].. Bound Shrinking [L n, H n ] CQ Evaluator Stream Processor [L 1, H 1 ] [L i, H i ] … [L n, H n ] Bound Cache Precision Manager Bound shrinking Selective growing Intercepts update streams, and forwards those that fall outside its bound Bounded Answers Registers Queries Queries + precision constraints Generates streams of updates THE SYSTEM Maintains copy of bound for each object updates Periodically shrinking bound Reallocates bound width and sends growth messages updates User
15
Algorithm Details Initially the bounds can be set in anyway as long as they meet the precision constraints. (e.g. by uniform allocation) The bounds are reallocated adaptively among the objects participating in each query (bound shrinking and selective growing)
16
Bound Shrinking (Algo. details cont..) Periodically, every T time units, O i ‘s bound width is decreased symmetrically at both the source and the stream coordinator as W i = W i (1 – S), where T (adjustment period) and S (shrink percentage) are determined experimentally Each time the bound width shrinks, the source must reapply the filter to the current data value V i. If this value does not pass the filter the source must put it on the update stream.
17
Bound Growing (Algo. details cont..) Each object is assigned a burden score B i based on its stream transmission cost C i, estimated stream update period P i and the current bound width W i. Each query is assigned a burden target T i by either averaging burden scores or invoking linear solver A deviation value D i is based on difference between burden score and burden target The objects are considered in decreasing deviation and each object is assigned the maximum possible bound growth ∆W i
18
Burden Score and Burden Target (Algo. details cont..) The burden score B i is computed as B i = C i / (P i. W i ) C i is the cost to send a stream update of object O i, W i is the bound width C i is the cost to send a stream update of object O i, W i is the bound width P i = T / N i, N i is the number of updates of O i received by the stream coordinator in the last T time units P i = T / N i, N i is the number of updates of O i received by the stream coordinator in the last T time units The burden target T i is the lowest overall burden required of the objects in the query at all times. For simple cases it is equal to the average of the burden scores of objects in the query Deviation
19
Maximum bound growth (Algo. details cont..) The maximum possible amount by which the bound can be grown is For each nonzero growth value, the precision manager increases the width for O i by setting L i := L i - ∆W i / 2 and H i := H i + ∆W i / 2 After all the growth has been allocated the precision manager sends update messages to all sources whose bound width has been modified (grown)
20
Precision Constraint Adjustments and Latency (Algo. details cont..) If δ j increases then the additional bound width is allocated automatically by the bound growth algorithm If δ j decreases (stronger precision) then the automatic bound shrinking will reduce the answer bound until the requested precision level is reached. For immediate improvement the precision manager needs to the send explicit shrink messages Source filters timestamps all updates transmitted to the stream processor The precision manager timestamps all bound width updates with an adjustment period boundary
21
Experiments The performance of the proposed model was tested for the Network traffic volumes which are of interest for ISP’s for security, billing infrastructure planning. Some example queries include :- Q 1 Monitor the volume of remote login request Q 1 Monitor the volume of remote login request Q 2 Monitor the volume of incoming traffic received within the organization Q 2 Monitor the volume of incoming traffic received within the organization Q 3 Monitor the volume of incoming SYN packets Q 3 Monitor the volume of incoming SYN packets
22
Results Comparison of overall communication cost (does not include growth message communication costs) incurred by the adaptive algorithm against the uniform static allocation measuring cost for 21hrs. The CQ monitors the average traffic level with varying precision constraint δ
23
Results (cont …) Results of comparing the idealized version of the proposed algorithm against the optimized static allocation, using a continuous AVG query over 10 data sources under uniform costs
24
Conclusions Experimental results show that the proposed approach saves communication cost at fine granularity by individually adjusting precision constraints The experiments were based on simple examples of network traffic with a few hosts. The values of S and T were determined experimentally. Effect of variation of T on the on quality of answers is not available. Evaluating S experimentally, may not be feasible in all cases The streamed update period P i = T / N i takes into consideration only the updates in the last T time units. Considering the complete history of updates (Kalman filter) might show interesting results !
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.