Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008.

Similar presentations


Presentation on theme: "Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008."— Presentation transcript:

1 Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

2 2 Converting between Streams & Relations  Stream-to-relation operators:  Sliding window: tuple-based (last N rows) or time-based (within time range)  Partitioned sliding window: does grouping by keys, then does sliding window over that  Is this necessary or minimal?  Relation-to-stream operators:  Istream: stream-ifies any insertions over a relation  Dstream: stream-ifies the deletes  Rstream: stream contains the set of tuples in the relation

3 3 Some Examples  Select * From S1 [Rows 1000], S2 [Range 2 minutes] Where S1.A = S2.A And S1.A > 10  Select Rstream(S.A, R.B) From S [Now], R Where S.A = R.A

4 4 Building a Stream System  Basic data item is the element:  where op 2 {+, -}  Query plans need a few new (?) items:  Queues  Used for hooking together operators, esp. over windows  (Assumption is that pipelining is generally not possible, and we may need to drop some tuples from the queue)  Synopses  The intermediate state an operator needs to carry around  Note that this is usually bounded by windows

5 5 Example Query Plan What’s different here?

6 6 Some Tricks for Performance  Sharing synopses across multiple operators  In a few cases, more than one operator may join with the same synopsis  Can exploit punctuations or “k-constraints”  Analogous to interesting orders  Referential integrity k-constraint: bound of k between arrival of “many” element and its corresponding “one” element  Ordered-arrival k-constraint: need window of at most k to sort  Clustered-arrival k-constraint: bound on distance between items with same grouping attributes

7 7 Query Processing – “Chain Scheduling”  Similar in many ways to eddies  May decide to apply operators as follows:  Assume we know how many tuples can be processed in a time unit  Cluster groups of operators into “chains” that maximize reduction in queue size per unit time  Greedily forward tuples into the most selective chain  Within a chain, process in FIFO order  They also do a form of join reordering

8 8 Scratching the Surface: Approximation  They point out two areas where we might need to approximate output:  CPU is limited, and we need to drop some stream elements according to some probabilistic metric  Collect statistics via a profiler  Use Hoeffding inequality to derive a sampling rate in order to maintain a confidence interval  May need to do similar things if memory usage is a constraint  Are there other options? When might they be useful?

9 9 STREAM in General  “Logical semantics first”  Starts with a basic data model: streams as timestamped sets  Develops a language and semantics  Heavily based on SQL  Proposes a relatively straightforward implementation  Interesting ideas like k-constraints  Interesting approaches like chain scheduling  No real consideration of distributed processing

10 10 Aurora  “Implementation first; mix and match operations from past literature”  Basic philosophy: most of the ideas in streams existed in previous research  Sliding windows, load shedding, approximation, …  So let’s borrow those ideas and focus on how to build a real system with them!  Emphasis is on building a scalable, robust system  Distributed implementation: Medusa

11 11 Queries in Aurora  Oddly: no declarative query language in the initial version! (Added for commercial product)  Queries are workflows of physical query operators (SQuAl)  Many operators resemble relational algebra ops

12 12 Example Query

13 13 Some Interesting Aspects  A relatively simple adaptive query optimizer  Can push filtering and mapping into many operators  Can reorder some operators (e.g., joins, unions)  Need built-in error handling  If a data source fails to respond in a certain amount of time, create a special alarm tuple  This propagates through the query plan  Incorporate built-in load-shedding, RT sched. to support QoS  Have a notion of combining a query over historical data with data from a stream  Switches from a pull-based mode (reading from disk) to a push-based mode (reading from network)

14 14 The Medusa Processor  Distributed coordinator between many Aurora nodes  Scalability through federation and distribution  Fail-over  Load balancing

15 15 Main Components  Lookup  Distributed catalog – schemas, where to find streams, where to find queries  Brain  Query setup, load monitoring via I/O queues and stats  Load distribution and balancing scheme is used  Very reminiscent of Mariposa!

16 16 Load Balancing  Migration – an operator can be moved from one node to another  Initial implementation didn’t support moving of state  The state is simply dropped, and operator processing resumes  Implications on semantics?  Plans to support state migration  “Agoric system model to create incentives”  Clients pay nodes for processing queries  Nodes pay each other to handle load – pairwise contracts negotiated offline  Bounded-price mechanism – price for migration of load, spec for what a node will take on  Does this address the weaknesses of the Mariposa model?

17 17 Some Applications They Tried  Financial services (stock ticker)  Main issue is not volume, but problems with feeds  Two-level alarm system, where higher-level alarm helps diagnose problems  Shared computation among queries  User-defined aggregation and mapping  This is the main application for the commercial version (StreamBase)  Linear road (sensor monitoring)  Traffic sensors in a toll road – change toll depending on how many cars are on the road  Combination of historical and continuous queries  Environmental monitoring  Sliding-window calculations

18 18 Lessons Learned  Historical data is important – not just stream data  (Summaries?)  Sometimes need synchronization for consistency  “ACID for streams”?  Streams can be out of order, bursty  “Stream cleaning”?  Adaptors (and also XML) are important  … But we already knew that!  Performance is critical  They spent a great deal of time using microbenchmarks and optimizing

19 19 Sensors and Sensor Networks  Trends:  Cameras and other sensors are very cheap  Microprocessors and microcontrollers can be very small  Wireless networks are easy to build  Why not instrument the physical world with tiny wireless sensors and networks?  Vision: “Smart dust”  Berkeley motes, RF tags, cameras, camera phones, temperature sensors, etc.  Today we already see pieces of this:  Penn buildings and SCADA system  250+ surveillance cameras through campus

20 20 What Can We Do with Sensor Networks?  Many “passive” monitoring applications:  Environmental monitoring:  temperature in different parts of a building  air quality  etc.  Law enforcement:  Video feeds and anomalous behavior  Research studies:  Study ocean temperature, currents  Monitor status of eggs in endangered birds’ nests  ZebraNet  Fun:  Record sporting events or performances from every angle (video & audio)  Ultimately, build reactive systems as well: robotics, Mars landers, …

21 21 Some Challenges  Highly distributed!  May have thousands of nodes  Know about a few nodes within proximity; may not know location  Nodes’ transmissions may interfere with one another  Power and resource constraints  Most of these devices are wireless, tiny, battery-powered  Can only transmit data every so often  Limited CPU, memory – can’t run sophisticated code  High rate of failure  Collisions, battery failures, sensor calibration, …

22 22 The Target Platform  Most sensor network research argues for the Berkeley mote as a target platform:  Mote: 4MHz, 8-bit CPU  128KB RAM  512KB Flash memory  40kbps radio, 100 ft range  Sensors:  Light, temperature, microphone  Accelerometer  Magnetometer http://robotics.eecs.berkeley.edu/~pister/SmartDust/

23 23 Sensor Net Data Acquisition First: build routing tree Second: begin sensing and aggregation

24 24 Sensor Net Data Acquisition (Sum) 5 5 5 5 5 5 5 8 5 5 5 5 5 5 5 5 5 7 First: build routing tree Second: begin sensing and aggregation (e.g., sum)

25 25 Sensor Net Data Acquisition (Sum) 5 5 5 5 5 5 5 8 5 5 5 5 5 5 5 5 5 10 15 20 5 25 60 55 20 10 5 13 8 18 5 30 23 35 7 5 85 First: build routing tree Second: begin sensing and aggregation (e.g., sum)

26 26 Sensor Network Research  Routing: need to aggregate and consolidate data in a power-efficient way  Ad hoc routing – generate routing tree to base station  Generally need to merge computation with routing  Robustness: need to combine info from many sensors to account for individual errors  What aggregation functions make sense?  Languages: how do we express what we want to do with sensor networks?  Many proposals here

27 27 A First Try: Tiny OS and nesC  TinyOS: a custom OS for sensor nets, written in nesC  Assumes low-power CPU  Very limited concurrency support: events (signaled asynchronously) and tasks (cooperatively scheduled)  Applications built from “components”  Basically, small objects without any local state  Various features in libraries that may or may not be included  interface Timer { command result_t start(char type, uint32_t interval); command result_t stop(); event result_t fired(); }

28 28 Drawbacks of this Approach  Need to write very low-level code for sensor net behavior  Only simple routing policies are built into TinyOS – some of the routing algorithms may have to be implemented by hand  Has required many follow-up papers to fill in some of the missing pieces, e.g., Hood (object tracking and state sharing), …

29 29 An Alternative  “Much” of the computation being done in sensor nets looks like what we were discussing with STREAM  Today’s sensor networks look a lot like databases, pre-Codd  Custom “access paths” to get to data  One-off custom-code  So why not look at mapping sensor network computation to SQL?  Not very many joins here, but significant aggregation  Now the challenge is in picking a distribution and routing strategy that provides appropriate guarantees and minimizes power usage

30 30 TinyDB and TinySQL  Treat the entire sensor network as a universal relation  Each type of sensor data is a column in a global table  Tuples are created according to a sample interval (separated by epochs)  (Implications of this model?)  SELECT nodeid, light, temp FROM sensors SAMPLE INTERVAL 1s FOR 10s

31 31 Storage Points and Windows  Like Aurora, STREAM, can materialize portions of the data:  CREATE STORAGE POINT recentlight SIZE 8 AS (SELECT nodeid, light FROM sensors SAMPLE INTERVAL 10s)  and we can use windowed aggregates:  SELECT WINAVG(volume, 30s, 5s) FROM sensors SAMPLE INTERVAL 1s

32 32 Events  ON EVENT bird-detect(loc): SELECT AVG(light), AVG(temp), event.loc FROM sensors AS s WHERE dist(s.loc, event.loc) < 10m SAMPLE INTERVAL 2s FOR 30s  How do we know about events?  Contrast to UDFs? triggers?

33 33 Power and TinyDB  Cost-based optimizer tries to find a query plan to yield lowest overall power consumption  Different sensors have different power usage  Try to order sampling according to selectivity (sounds familiar?)  Assumption of uniform distribution of values over range  Batching of queries (multi-query optimization)  Convert a series of events into a stream join – does this resemble anything we’ve seen recently?  Also need to consider where the query is processed…

34 34 Dissemination of Queries  Based on semantic routing tree idea  SRT build request is flooded first  Node n gets to choose its parent p, based on radio range from root  Parent knows its children  Maintains an interval on values for each child  Forwards requests to children as appropriate  Maintenance:  If interval changes, child notifies its parent  If a node disappears, parent learns of this when it fails to get a response to a query

35 35 Query Processing  Mostly consists of sleeping!  Wake briefly, sample, and compute operators, then route onwards  Nodes are time synchronized  Awake time is proportional to the neighborhood size (why?)  Computation is based on partial state records  Basically, each operation is a partial aggregate value, plus the reading from the sensor

36 36 Load Shedding & Approximation  What if the router queue is overflowing?  Need to prioritize tuples, drop the ones we don’t want  FIFO vs. averaging the head of the queue vs. delta-proportional weighting  Later work considers the question of using approximation for more power efficiency  If sensors in one region change less frequently, can sample less frequently (or fewer times) in that region  If sensors change less frequently, can sample readings that take less power but are correlated (e.g., battery voltage vs. temperature)  Thursday, 4:30PM, DB Group Meeting, I’ll discuss some of this work

37 37 The Future of Sensor Nets?  TinySQL is a nice way of formulating the problem of query processing with motes  View the sensor net as a universal relation  Can define views to abstract some concepts, e.g., an object being monitored  But:  What about when we have multiple instances of an object to be tracked? Correlations between objects?  What if we have more complex data? More CPU power?  What if we want to reason about accuracy?


Download ppt "Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008."

Similar presentations


Ads by Google