Sensor Data Management Egemen Tanin Department of Computer Science and Software Engineering University of Melbourne
Goals & Fundamental Observations Goal: Improve sensor network lifetime AND: Maintain the current DBMS abstraction and facilities while introducing algorithms to run queries efficiently Add new capabilities for emerging applications such as summarization of data for getting rid of irrelevant data Observations: Communication in sensors is much more energy hungry than computation Sensor Networks are made up of simple devices with no extensive data storage
Additional challenges The overall system is very volatile Changes in environment conditions can render readings inaccessible Failure of nodes cannot be easily fixed Nodes can run low on power over time Data is dynamic New data is being appended all the time Serving multiple queries concurrently is problematic Sensors are very limited on physically what they can observe at a given time
Fundamental Approaches Collect all the sensor data to one or more data centers Use a classical DBMS Energy inefficiencies due to redundant data collection, central point of failure, hot spots near root, has to collect data at the highest frequency for all potential queries and all the time Current DBMSs not fast enough for high-update applications Many facilities are redundant: RDBMS were built 25+ years ago Lack of certain convenient operations, e.g., continuous queries Rebuild a DBMS for sensor networks and fix some of the problems on a central setting? Still energy inefficient due to centralization
Approaches Contd. In-network storage and processing along with a capability to inject and collect data from any- where in the network for any number of centers Already implied by communication costs dominating the computation costs in the network But storage limitations require eliminating some data Fundamentally different than current commercial RDBMSs
Query Classification for Sensor Networks Continuous queries: that commonly span some long period of time Snapshot queries: that collect data about now or some other point in time Historical queries: collect summary data about past
Additional Operators Use of only some of the sensors Aggregation of data from multiple sensors Correlation of data from sensors
Example Query SELECT min(humidity), town FROM sensors WHERE state = ‘Queensland’ GROUP BY town HAVING max(temperature) > 30 DURATION [now, now min] SAMPLING PERIOD 30 min
Extending SQL Example: Cougar Sensor Network Database System (by Yong Yao and Johannes Gehrke) Uses SQL like interface After in-network processing, data is fed to a center Optimizes for both resource usage and reaction time Assumes that sensors are time synchronized Each type of sensor is represented as an Abstract Data Type (ADT) Each sensor is then an object of that ADT Relations are virtual and append-only relations
Cougar Contd. Has SELECT, FROM, WHERE, GROUP BY, HAVING, DURATION, and EVERY clauses Now extended to have Gaussian ADTs (GADTs) to run probabilistic queries as sensors collect data with noise from physical phenomena: SELECT * FROM sensors WHERE sensor.temp.prob([10,20] >= 0.6) ‘Get the temperature data from sensors if it is ±5 of 15 degrees with at least 60 percent probability’
Execution Steps Broadcast the query to the network Collect data back Not all data may be relevant and summarization of data may be utilized Further analysis on a central system can be done if needed later Note: Either a human or an automated system can be the origin of the query
Data Collection Energy x Delay is the main composite metric Methods: Direct Independent Transmission PEGASIS (Power-Efficient Gathering for Sensor Information Systems) Binary Chain-based Scheme Chain-based Three-level Scheme Directed Diffusion Tree-based Schemes Multi-path Schemes Hybrids
Direct Independent Transmission Each node transmits to a center independently Very energy inefficient Nodes must watch out for collision and take turns Hence the last message can be transmitted after a significant delay First response may be very fast
PEGASIS By Stephanie Lindsey and Cauligi Raghavendra Assumes all nodes know the location of every other node All nodes should be able to transmit data to the center in one hop A greedy algorithm is used to construct a chain of sensor nodes starting farthest from the center The chain is formed a priori After every hop, data aggregation can be done Leadership is transferred sequentially May be energy efficient but delay is O(n)
PEGASIS To Center Leader End Start Sensors
Binary Chain-based Scheme By the same authors from PEGASIS It is a chain-based scheme like PEGASIS Nodes are classified into levels All nodes receiving a message at one level rise to the next level At each level, number of nodes is halved This is a CDMA only scheme (to prevent collisions) Delay is O(log n)
Binary Chain-based Scheme Contd. To Center Step 1 Step 2 Step 3 Step 4
Chain-based Three-level Scheme By the same authors from PEGASIS For non-CDMA settings binary does not work Again, a chain, like PEGASIS, is formed but the network is partitioned into groups that are far away from each other for simultaneous transmissions Within a group, nodes transmit at the same time One node of the group aggregates and goes to the next level In the next level, all nodes are divided into two groups Finally, all send to one node which sends to a center
Directed Diffusion By Chalermek Intanagonwiwat and Ramesh Govidan and Deborah Estrin Consists of: Interest propagation E.g., location=[(100,100),(10, 200)], temperature=[10,20] Gradient setup Data delivery along reinforced path
Directed Diffusion Contd.
Tree-based Schemes A routing tree rooted at a base station is used The tree, that is utilized to distribute the query, is also utilized to collect the data Example, TinyDB (by Samuel Madden and Michael Franklin and Joseph Hellerstein and Wei Hong)
TinyDB Contd. Uses an epoch-based mechanism Main disadvantage is that it can loose large subtrees/data due to central point of failure
Extensions Report data only if it has changed from the previous report or consider whether a re-report will effect the final aggregation at all Adapting to changing conditions in the network:
Multi-path Schemes To prevent failures, the same sensor value can be sent along multiple paths The main disadvantage is that the final value now may contain an approximation rather than an exact value E.g., by Suman Nath and Philip Gibbons and Srinivasan Seshan and Zachary Anderson:
Hybrids E.g., By Amit Manjhi and Suman Nath and Philip Gibbons Benefits of both a tree mechanism as well as a multi-path mechanism: Base Station Tree Multi-path
Storing Data versus Data Collection Rather than collecting data from individual sensors for every given query, sensors can be made to store their data in the network for point retrieval at a later time Similar to creating rendezvous points
Geographic Hash Tables (GHTs) By Sylvia Ratnasamy and Brad Karp and Li Yin and Fang Yu and Deborah Estrin and Ramesh Govindan and Scott Shenker Assumes each node knows its location Limited to point queries Hashes keys to geographic locations Stores a key-value pair on a sensor closest to the location Geographic routing is used to access this data with a key later on Replication on nearby nodes can be used for load sharing and failure resistance Regions of data, rather than individual sensor readings, can also be hashed as an extension The idea, in general, similar to publish-subscribe
GHTs Contd. Storage Query Source Query Source Data Source x x
Range Queries GHTs do not work for range queries A similar approach to Binary Chain-based Schemes can be used to one dimensional settings but storage, rather than collection is the goal: Road Sensors a b d c f e g h i j badc ca b fehg ge f d
Multidimensional Indexing For multidimensional indexing, we can use: Grid files with multidimensional range hashing Quadtrees with block hashing It is less clear how to map R-trees or k-d trees using hashing In general, research on this front is at its infancy Load balancing as well as minimizing communication overhead is a critical issue
DIMENSIONS System By Deepak Ganesan and Ben Greenstein and Denis Perelyubskiy and Deborah Estrin and John Heidemann
DIFS System By Benjamin Greenstein and Deborah Estrin and Ramesh Govindan and Sylvia Ratnasamy and Scott Shenker A multi-rooted method Nodes hold histograms Even load distribution I.e., we have many roots
Fractional Cascading By Jie Gao and Leonidas Guibas and John Hershberger and Li Zhang Request are commonly local, i.e., from a given node GHTs can store data afar Hence: Keep a fraction of distant data and keep detailed local data (use exponential decay)
Locality Preserving Hashing: DIM system By Xin Li and Young Jin Kim and Ramesh Govindan and Wei Hong
Additional Issues: Data Aging Algorithms are needed for summarizing aging data on sensors: E.g., DIMENSIONS uses a monotonically decreasing function to discard data over time by creating new summaries
Summary and Future Directions In-network processing is gaining momentum Either collect data in an efficient manner Or store data by creating good rendezvous-based mechanisms Complex data aggregation mechanisms for sophisticated data analysis is commonly cited as a good research direction Subquery generation and subquery trading is also a good research direction Indexing with complex query processing is also at its infancy