Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008.

Slides:



Advertisements
Similar presentations
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Advertisements

A Hierarchical Multiple Target Tracking Algorithm for Sensor Networks Songhwai Oh and Shankar Sastry EECS, Berkeley Nest Retreat, Jan
1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
Decentralized Reactive Clustering in Sensor Networks Yingyue Xu April 26, 2015.
Traffic Shaping Why traffic shaping? Isochronous shaping
Sensor Network Platforms and Tools
한국기술교육대학교 컴퓨터 공학 김홍연 TinyDB : An Acquisitional Query Processing System for Sensor Networks. - Samuel R. Madden, Michael J. Franklin, Joseph M. Hellerstein,
Overview: Chapter 7  Sensor node platforms must contend with many issues  Energy consumption  Sensing environment  Networking  Real-time constraints.
Joint work with Svilen Mihaylov, Marie Jacob, Mengmeng Liu, Sudipto Guha, Boon Thau Loo DMSN 2008 August 24, 2008 Zachary G. Ives University of Pennsylvania.
The Design of the Borealis Stream Processing Engine Brandeis University, Brown University, MIT Magdalena BalazinskaNesime Tatbul MIT Brown.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
1 Next Century Challenges: Scalable Coordination in sensor Networks MOBICOMM (1999) Deborah Estrin, Ramesh Govindan, John Heidemann, Satish Kumar Presented.
An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations Presenter: Liyan Zhang Presentation of ICS
Aurora Proponent Team Wei, Mingrui Liu, Mo Rebuttal Team Joshua M Lee Raghavan, Venkatesh.
A Survey of Wireless Sensor Network Data Collection Schemes by Brett Wilson.
Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005.
Generic Sensor Platform for Networked Sensors Haywood Ho.
Chapter 13 Embedded Systems
Chapter 14 The Second Component: The Database.
The Design of an Acquisitional Query Processor For Sensor Networks Samuel Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong Presentation.
Directed Diffusion: A Scalable and Robust Communication Paradigm for Sensor Networks Charlmek Intanagonwiwat Ramesh Govindan Deborah Estrin Presentation.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.
Data Management for Sensor Networks Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 4, 2005.
Sorting and Query Processing Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 29, 2005.
TAG: a Tiny Aggregation Service for Ad-Hoc Sensor Networks Paper By : Samuel Madden, Michael J. Franklin, Joseph Hellerstein, and Wei Hong Instructor :
Using Probabilistic Models for Data Management in Acquisitional Environments Sam Madden MIT CSAIL With Amol Deshpande (UMD), Carlos Guestrin (CMU)
IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.
The Design of an Acquisitional Query Processor For Sensor Networks Samuel Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong.
TinyOS By Morgan Leider CS 411 with Mike Rowe with Mike Rowe.
Power Save Mechanisms for Multi-Hop Wireless Networks Matthew J. Miller and Nitin H. Vaidya University of Illinois at Urbana-Champaign BROADNETS October.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
March 6th, 2008Andrew Ofstad ECE 256, Spring 2008 TAG: a Tiny Aggregation Service for Ad-Hoc Sensor Networks Samuel Madden, Michael J. Franklin, Joseph.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
An Intelligent and Adaptable Grid-Based Flood Monitoring and Warning System Phil Greenwood.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
Sensor Database System Sultan Alhazmi
The Design of an Acquisitional Query Processor for Sensor Networks CS851 Presentation 2005 Presented by: Gang Zhou University of Virginia.
Query Processing for Sensor Networks Yong Yao and Johannes Gehrke (Presentation: Anne Denton March 8, 2003)
Sensor Data Management and XML Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 19, 2008.
REED: Robust, Efficient Filtering and Event Detection in Sensor Networks Daniel Abadi, Samuel Madden, Wolfgang Lindner MIT United States VLDB 2005.
1 REED: Robust, Efficient Filtering and Event Detection in Sensor Networks Daniel Abadi, Samuel Madden, Wolfgang Lindner MIT United States VLDB 2005.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
Dr. Sudharman K. Jayaweera and Amila Kariyapperuma ECE Department University of New Mexico Ankur Sharma Department of ECE Indian Institute of Technology,
Xiong Junjie Node-level debugging based on finite state machine in wireless sensor networks.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
7 Strategies for Extracting, Transforming, and Loading.
Aggregation and Secure Aggregation. Learning Objectives Understand why we need aggregation in WSNs Understand aggregation protocols in WSNs Understand.
Triggers and Streams Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 28, 2005.
W. Hong & S. Madden – Implementation and Research Issues in Query Processing for Wireless Sensor Networks, ICDE 2004.
 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.
Smart Sleeping Policies for Wireless Sensor Networks Venu Veeravalli ECE Department & Coordinated Science Lab University of Illinois at Urbana-Champaign.
Aggregation and Secure Aggregation. [Aggre_1] Section 12 Why do we need Aggregation? Sensor networks – Event-based Systems Example Query: –What is the.
Building Wireless Efficient Sensor Networks with Low-Level Naming J. Heihmann, F.Silva, C. Intanagonwiwat, R.Govindan, D. Estrin, D. Ganesan Presentation.
The Design of an Acquisitional Query Processor For Sensor Networks Samuel Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong Presentation.
Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –
- Pritam Kumat - TE(2) 1.  Introduction  Architecture  Routing Techniques  Node Components  Hardware Specification  Application 2.
TAG: a Tiny AGgregation service for ad-hoc sensor networks Authors: Samuel Madden, Michael J. Franklin, Joseph M. Hellerstein, Wei Hong Presenter: Mingwei.
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
Wireless Sensor Networks
Introduction to Wireless Sensor Networks
Distributed database approach,
Wireless Sensor Network Architectures
The Design of an Acquisitional Query Processor For Sensor Networks
Distributing Queries Over Low Power Sensor Networks
P2P Integration, Concluded, and Data Stream Processing
REED : Robust, Efficient Filtering and Event Detection
Aggregation.
Presentation transcript:

Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

2 Converting between Streams & Relations  Stream-to-relation operators:  Sliding window: tuple-based (last N rows) or time-based (within time range)  Partitioned sliding window: does grouping by keys, then does sliding window over that  Is this necessary or minimal?  Relation-to-stream operators:  Istream: stream-ifies any insertions over a relation  Dstream: stream-ifies the deletes  Rstream: stream contains the set of tuples in the relation

3 Some Examples  Select * From S1 [Rows 1000], S2 [Range 2 minutes] Where S1.A = S2.A And S1.A > 10  Select Rstream(S.A, R.B) From S [Now], R Where S.A = R.A

4 Building a Stream System  Basic data item is the element:  where op 2 {+, -}  Query plans need a few new (?) items:  Queues  Used for hooking together operators, esp. over windows  (Assumption is that pipelining is generally not possible, and we may need to drop some tuples from the queue)  Synopses  The intermediate state an operator needs to carry around  Note that this is usually bounded by windows

5 Example Query Plan What’s different here?

6 Some Tricks for Performance  Sharing synopses across multiple operators  In a few cases, more than one operator may join with the same synopsis  Can exploit punctuations or “k-constraints”  Analogous to interesting orders  Referential integrity k-constraint: bound of k between arrival of “many” element and its corresponding “one” element  Ordered-arrival k-constraint: need window of at most k to sort  Clustered-arrival k-constraint: bound on distance between items with same grouping attributes

7 Query Processing – “Chain Scheduling”  Similar in many ways to eddies  May decide to apply operators as follows:  Assume we know how many tuples can be processed in a time unit  Cluster groups of operators into “chains” that maximize reduction in queue size per unit time  Greedily forward tuples into the most selective chain  Within a chain, process in FIFO order  They also do a form of join reordering

8 Scratching the Surface: Approximation  They point out two areas where we might need to approximate output:  CPU is limited, and we need to drop some stream elements according to some probabilistic metric  Collect statistics via a profiler  Use Hoeffding inequality to derive a sampling rate in order to maintain a confidence interval  May need to do similar things if memory usage is a constraint  Are there other options? When might they be useful?

9 STREAM in General  “Logical semantics first”  Starts with a basic data model: streams as timestamped sets  Develops a language and semantics  Heavily based on SQL  Proposes a relatively straightforward implementation  Interesting ideas like k-constraints  Interesting approaches like chain scheduling  No real consideration of distributed processing

10 Aurora  “Implementation first; mix and match operations from past literature”  Basic philosophy: most of the ideas in streams existed in previous research  Sliding windows, load shedding, approximation, …  So let’s borrow those ideas and focus on how to build a real system with them!  Emphasis is on building a scalable, robust system  Distributed implementation: Medusa

11 Queries in Aurora  Oddly: no declarative query language in the initial version! (Added for commercial product)  Queries are workflows of physical query operators (SQuAl)  Many operators resemble relational algebra ops

12 Example Query

13 Some Interesting Aspects  A relatively simple adaptive query optimizer  Can push filtering and mapping into many operators  Can reorder some operators (e.g., joins, unions)  Need built-in error handling  If a data source fails to respond in a certain amount of time, create a special alarm tuple  This propagates through the query plan  Incorporate built-in load-shedding, RT sched. to support QoS  Have a notion of combining a query over historical data with data from a stream  Switches from a pull-based mode (reading from disk) to a push-based mode (reading from network)

14 The Medusa Processor  Distributed coordinator between many Aurora nodes  Scalability through federation and distribution  Fail-over  Load balancing

15 Main Components  Lookup  Distributed catalog – schemas, where to find streams, where to find queries  Brain  Query setup, load monitoring via I/O queues and stats  Load distribution and balancing scheme is used  Very reminiscent of Mariposa!

16 Load Balancing  Migration – an operator can be moved from one node to another  Initial implementation didn’t support moving of state  The state is simply dropped, and operator processing resumes  Implications on semantics?  Plans to support state migration  “Agoric system model to create incentives”  Clients pay nodes for processing queries  Nodes pay each other to handle load – pairwise contracts negotiated offline  Bounded-price mechanism – price for migration of load, spec for what a node will take on  Does this address the weaknesses of the Mariposa model?

17 Some Applications They Tried  Financial services (stock ticker)  Main issue is not volume, but problems with feeds  Two-level alarm system, where higher-level alarm helps diagnose problems  Shared computation among queries  User-defined aggregation and mapping  This is the main application for the commercial version (StreamBase)  Linear road (sensor monitoring)  Traffic sensors in a toll road – change toll depending on how many cars are on the road  Combination of historical and continuous queries  Environmental monitoring  Sliding-window calculations

18 Lessons Learned  Historical data is important – not just stream data  (Summaries?)  Sometimes need synchronization for consistency  “ACID for streams”?  Streams can be out of order, bursty  “Stream cleaning”?  Adaptors (and also XML) are important  … But we already knew that!  Performance is critical  They spent a great deal of time using microbenchmarks and optimizing

19 Sensors and Sensor Networks  Trends:  Cameras and other sensors are very cheap  Microprocessors and microcontrollers can be very small  Wireless networks are easy to build  Why not instrument the physical world with tiny wireless sensors and networks?  Vision: “Smart dust”  Berkeley motes, RF tags, cameras, camera phones, temperature sensors, etc.  Today we already see pieces of this:  Penn buildings and SCADA system  250+ surveillance cameras through campus

20 What Can We Do with Sensor Networks?  Many “passive” monitoring applications:  Environmental monitoring:  temperature in different parts of a building  air quality  etc.  Law enforcement:  Video feeds and anomalous behavior  Research studies:  Study ocean temperature, currents  Monitor status of eggs in endangered birds’ nests  ZebraNet  Fun:  Record sporting events or performances from every angle (video & audio)  Ultimately, build reactive systems as well: robotics, Mars landers, …

21 Some Challenges  Highly distributed!  May have thousands of nodes  Know about a few nodes within proximity; may not know location  Nodes’ transmissions may interfere with one another  Power and resource constraints  Most of these devices are wireless, tiny, battery-powered  Can only transmit data every so often  Limited CPU, memory – can’t run sophisticated code  High rate of failure  Collisions, battery failures, sensor calibration, …

22 The Target Platform  Most sensor network research argues for the Berkeley mote as a target platform:  Mote: 4MHz, 8-bit CPU  128KB RAM  512KB Flash memory  40kbps radio, 100 ft range  Sensors:  Light, temperature, microphone  Accelerometer  Magnetometer

23 Sensor Net Data Acquisition First: build routing tree Second: begin sensing and aggregation

24 Sensor Net Data Acquisition (Sum) First: build routing tree Second: begin sensing and aggregation (e.g., sum)

25 Sensor Net Data Acquisition (Sum) First: build routing tree Second: begin sensing and aggregation (e.g., sum)

26 Sensor Network Research  Routing: need to aggregate and consolidate data in a power-efficient way  Ad hoc routing – generate routing tree to base station  Generally need to merge computation with routing  Robustness: need to combine info from many sensors to account for individual errors  What aggregation functions make sense?  Languages: how do we express what we want to do with sensor networks?  Many proposals here

27 A First Try: Tiny OS and nesC  TinyOS: a custom OS for sensor nets, written in nesC  Assumes low-power CPU  Very limited concurrency support: events (signaled asynchronously) and tasks (cooperatively scheduled)  Applications built from “components”  Basically, small objects without any local state  Various features in libraries that may or may not be included  interface Timer { command result_t start(char type, uint32_t interval); command result_t stop(); event result_t fired(); }

28 Drawbacks of this Approach  Need to write very low-level code for sensor net behavior  Only simple routing policies are built into TinyOS – some of the routing algorithms may have to be implemented by hand  Has required many follow-up papers to fill in some of the missing pieces, e.g., Hood (object tracking and state sharing), …

29 An Alternative  “Much” of the computation being done in sensor nets looks like what we were discussing with STREAM  Today’s sensor networks look a lot like databases, pre-Codd  Custom “access paths” to get to data  One-off custom-code  So why not look at mapping sensor network computation to SQL?  Not very many joins here, but significant aggregation  Now the challenge is in picking a distribution and routing strategy that provides appropriate guarantees and minimizes power usage

30 TinyDB and TinySQL  Treat the entire sensor network as a universal relation  Each type of sensor data is a column in a global table  Tuples are created according to a sample interval (separated by epochs)  (Implications of this model?)  SELECT nodeid, light, temp FROM sensors SAMPLE INTERVAL 1s FOR 10s

31 Storage Points and Windows  Like Aurora, STREAM, can materialize portions of the data:  CREATE STORAGE POINT recentlight SIZE 8 AS (SELECT nodeid, light FROM sensors SAMPLE INTERVAL 10s)  and we can use windowed aggregates:  SELECT WINAVG(volume, 30s, 5s) FROM sensors SAMPLE INTERVAL 1s

32 Events  ON EVENT bird-detect(loc): SELECT AVG(light), AVG(temp), event.loc FROM sensors AS s WHERE dist(s.loc, event.loc) < 10m SAMPLE INTERVAL 2s FOR 30s  How do we know about events?  Contrast to UDFs? triggers?

33 Power and TinyDB  Cost-based optimizer tries to find a query plan to yield lowest overall power consumption  Different sensors have different power usage  Try to order sampling according to selectivity (sounds familiar?)  Assumption of uniform distribution of values over range  Batching of queries (multi-query optimization)  Convert a series of events into a stream join – does this resemble anything we’ve seen recently?  Also need to consider where the query is processed…

34 Dissemination of Queries  Based on semantic routing tree idea  SRT build request is flooded first  Node n gets to choose its parent p, based on radio range from root  Parent knows its children  Maintains an interval on values for each child  Forwards requests to children as appropriate  Maintenance:  If interval changes, child notifies its parent  If a node disappears, parent learns of this when it fails to get a response to a query

35 Query Processing  Mostly consists of sleeping!  Wake briefly, sample, and compute operators, then route onwards  Nodes are time synchronized  Awake time is proportional to the neighborhood size (why?)  Computation is based on partial state records  Basically, each operation is a partial aggregate value, plus the reading from the sensor

36 Load Shedding & Approximation  What if the router queue is overflowing?  Need to prioritize tuples, drop the ones we don’t want  FIFO vs. averaging the head of the queue vs. delta-proportional weighting  Later work considers the question of using approximation for more power efficiency  If sensors in one region change less frequently, can sample less frequently (or fewer times) in that region  If sensors change less frequently, can sample readings that take less power but are correlated (e.g., battery voltage vs. temperature)  Thursday, 4:30PM, DB Group Meeting, I’ll discuss some of this work

37 The Future of Sensor Nets?  TinySQL is a nice way of formulating the problem of query processing with motes  View the sensor net as a universal relation  Can define views to abstract some concepts, e.g., an object being monitored  But:  What about when we have multiple instances of an object to be tracked? Correlations between objects?  What if we have more complex data? More CPU power?  What if we want to reason about accuracy?