Sensor Data Management and XML Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 19, 2008.

Slides:



Advertisements
Similar presentations
System Integration and Performance
Advertisements

Chapter 10: Designing Databases
Berkeley dsn declarative sensor networks problem David Chu, Lucian Popa, Arsalan Tavakoli, Joe Hellerstein approach related dsn architecture status  B.
Fast Algorithms For Hierarchical Range Histogram Constructions
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
Sensor Network Platforms and Tools
한국기술교육대학교 컴퓨터 공학 김홍연 TinyDB : An Acquisitional Query Processing System for Sensor Networks. - Samuel R. Madden, Michael J. Franklin, Joseph M. Hellerstein,
2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.
Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
1 Next Century Challenges: Scalable Coordination in sensor Networks MOBICOMM (1999) Deborah Estrin, Ramesh Govindan, John Heidemann, Satish Kumar Presented.
Information Retrieval in Practice
Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Self-Tuning and Self-Configuring Systems Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 16, 2005.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Database management concepts Database Management Systems (DBMS) An example of a database (relational) Database schema (e.g. relational) Data independence.
Recap of Feb 25: Physical Storage Media Issues are speed, cost, reliability Media types: –Primary storage (volatile): Cache, Main Memory –Secondary or.
UNIVERSITY OF SOUTHERN CALIFORNIA Embedded Networks Laboratory 1 Wireless Sensor Networks Ramesh Govindan Lab Home Page:
XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005.
The Design of an Acquisitional Query Processor For Sensor Networks Samuel Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong Presentation.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Data Management for Sensor Networks Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 4, 2005.
Overview of Search Engines
Introduction To Databases IDIA 618 Fall 2014 Bridget M. Blodgett.
CHAPTER 9 DATABASE MANAGEMENT © Prepared By: Razif Razali.
Database System Development Lifecycle © Pearson Education Limited 1995, 2005.
File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.
Databases C HAPTER Chapter 10: Databases2 Databases and Structured Fields  A database is a collection of information –Typically stored as computer.
Anatomy of a Native XML Base Management System By Yaojun Wu.
Systems analysis and design, 6th edition Dennis, wixom, and roth
Introduction. 
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.
The Design of an Acquisitional Query Processor For Sensor Networks Samuel Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.
Database Management 9. course. Execution of queries.
March 6th, 2008Andrew Ofstad ECE 256, Spring 2008 TAG: a Tiny Aggregation Service for Ad-Hoc Sensor Networks Samuel Madden, Michael J. Franklin, Joseph.
Architecture Rajesh. Components of Database Engine.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Sensor Database System Sultan Alhazmi
The Design of an Acquisitional Query Processor for Sensor Networks CS851 Presentation 2005 Presented by: Gang Zhou University of Virginia.
Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
REED: Robust, Efficient Filtering and Event Detection in Sensor Networks Daniel Abadi, Samuel Madden, Wolfgang Lindner MIT United States VLDB 2005.
1 REED: Robust, Efficient Filtering and Event Detection in Sensor Networks Daniel Abadi, Samuel Madden, Wolfgang Lindner MIT United States VLDB 2005.
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
XML and Database.
Dr. Sudharman K. Jayaweera and Amila Kariyapperuma ECE Department University of New Mexico Ankur Sharma Department of ECE Indian Institute of Technology,
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Chapter 5 Index and Clustering
XML Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 25, 2008.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.
The Design of an Acquisitional Query Processor For Sensor Networks Samuel Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong Presentation.
Managing Data Resources File Organization and databases for business information systems.
Intro to MIS – MGS351 Databases and Data Warehouses
Module 11: File Structure
Distributed database approach,
The Design of an Acquisitional Query Processor For Sensor Networks
MANAGING DATA RESOURCES
Database management concepts
Database management concepts
Indexing 4/11/2019.
REED : Robust, Efficient Filtering and Event Detection
Presentation transcript:

Sensor Data Management and XML Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 19, 2008

Administrivia  By next Tuesday, please me with a status report on your project  … We are well under a month from the deadline!  For next time:  Please read & review the TurboXPath paper 2

3 Sensor Networks: Target Platform  Most sensor network research argues for the Berkeley mote as a target platform:  Mote: 4MHz, 8-bit CPU  128B RAM (original)  512B Flash memory (original)  40kbps radio, 100 ft range  Sensors:  Light, temperature, microphone  Accelerometer  Magnetometer

4 Sensor Net Data Acquisition First: build routing tree Second: begin sensing and aggregation

5 Sensor Net Data Acquisition (Sum) First: build routing tree Second: begin sensing and aggregation (e.g., sum)

6 Sensor Net Data Acquisition (Sum) First: build routing tree Second: begin sensing and aggregation (e.g., sum)

7 Sensor Network Research  Routing: need to aggregate and consolidate data in a power-efficient way  Ad hoc routing – generate routing tree to base station  Generally need to merge computation with routing  Robustness: need to combine info from many sensors to account for individual errors  What aggregation functions make sense?  Languages: how do we express what we want to do with sensor networks?  Many proposals here

8 A First Try: Tiny OS and nesC  TinyOS: a custom OS for sensor nets, written in nesC  Assumes low-power CPU  Very limited concurrency support: events (signaled asynchronously) and tasks (cooperatively scheduled)  Applications built from “components”  Basically, small objects without any local state  Various features in libraries that may or may not be included  interface Timer { command result_t start(char type, uint32_t interval); command result_t stop(); event result_t fired(); }

9 Drawbacks of this Approach  Need to write very low-level code for sensor net behavior  Only simple routing policies are built into TinyOS – some of the routing algorithms may have to be implemented by hand  Has required many follow-up papers to fill in some of the missing pieces, e.g., Hood (object tracking and state sharing), …

10 An Alternative  “Much” of the computation being done in sensor nets looks like what we were discussing with STREAM  Today’s sensor networks look a lot like databases, pre-Codd  Custom “access paths” to get to data  One-off custom-code  So why not look at mapping sensor network computation to SQL?  Not very many joins here, but significant aggregation  Now the challenge is in picking a distribution and routing strategy that provides appropriate guarantees and minimizes power usage

11 TinyDB and TinySQL  Treat the entire sensor network as a universal relation  Each type of sensor data is a column in a global table  Tuples are created according to a sample interval (separated by epochs)  (Implications of this model?)  SELECT nodeid, light, temp FROM sensors SAMPLE INTERVAL 1s FOR 10s

12 Storage Points and Windows  Like Aurora, STREAM, can materialize portions of the data:  CREATE STORAGE POINT recentlight SIZE 8 AS (SELECT nodeid, light FROM sensors SAMPLE INTERVAL 10s)  and we can use windowed aggregates:  SELECT WINAVG(volume, 30s, 5s) FROM sensors SAMPLE INTERVAL 1s

13 Events  ON EVENT bird-detect(loc): SELECT AVG(light), AVG(temp), event.loc FROM sensors AS s WHERE dist(s.loc, event.loc) < 10m SAMPLE INTERVAL 2s FOR 30s

14 Power and TinyDB  Cost-based optimizer tries to find a query plan to yield lowest overall power consumption  Different sensors have different power usage  Try to order sampling according to selectivity (sounds familiar?)  Assumption of uniform distribution of values over range  Batching of queries (multi-query optimization)  Convert a series of events into a stream join with a table  Also need to consider where the query is processed…

15 Dissemination of Queries  Based on semantic routing tree idea  SRT build request is flooded first  Node n gets to choose its parent p, based on radio range from root  Parent knows its children  Maintains an interval on values for each child  Forwards requests to children as appropriate  Maintenance:  If interval changes, child notifies its parent  If a node disappears, parent learns of this when it fails to get a response to a query

16 Query Processing  Mostly consists of sleeping!  Wake briefly, sample, and compute operators, then route onwards  Nodes are time synchronized  Awake time is proportional to the neighborhood size (why?)  Computation is based on partial state records  Basically, each operation is a partial aggregate value, plus the reading from the sensor

17 Load Shedding & Approximation  What if the router queue is overflowing?  Need to prioritize tuples, drop the ones we don’t want  FIFO vs. averaging the head of the queue vs. delta-proportional weighting  Later work considers the question of using approximation for more power efficiency  If sensors in one region change less frequently, can sample less frequently (or fewer times) in that region  If sensors change less frequently, can sample readings that take less power but are correlated (e.g., battery voltage vs. temperature)

18 The Future of Sensor Nets?  TinySQL is a nice way of formulating the problem of query processing with motes  View the sensor net as a universal relation  Can define views to abstract some concepts, e.g., an object being monitored  But:  What about when we have multiple instances of an object to be tracked? Correlations between objects? (Joins)  What if we have more complex data? More CPU power?  What if we want to reason about accuracy?

19 XML: A Format of Many Uses  XML has become the standard for data interchange, and for many document representations  Sometimes we’d like to store it…  Collections of text documents, e.g., the Web, doc DBs  … How would we want to query those?  IR/text queries, path queries, XQueries?  Interchanging data  SOAP messages, RSS, XML streams  Perhaps subsets of data from RDBMSs  Storing native, database-like XML data  Caching  Logging of XML messages

20 XML: Hierarchical Data and Its Challenges  It’s not normalized…  It conceptually centers around some origin, meaning that navigation becomes central to querying and visualizing  Contrast with E-R diagrams  How to store the hierarchy?  Complex navigation may include going up, sideways in tree  Updates, locking  Optimization  Also, it’s ordered  May restrict order of evaluation (or at least presentation)  Makes updates more complex  Many of these issues aren’t unique to XML  Semistructured databases, esp. with ordered collections, were similar  But our efforts in that area basically failed…

21 Two Ways of Thinking of XML Processing  XML databases (today)  Hierarchical storage + locking (Natix, TIMBER, BerkeleyDB, Tamino, …)  Query optimization  “Streaming XML” (next time)  RDBMS  XML export  Partitioning of computation between source and mediator  “Streaming XPath” engines  The difference is in storage (or lack thereof)

22 XML in a Database  Use a legacy RDBMS  Shredding [Shanmugasundaram+99] and many others  Path-based encodings [Cooper+01]  Region-based encodings [Bruno+02][Chen+04]  Order preservation in updates [Tatarinov+02], …  What’s novel here? How does this relate to materialized views and warehousing?  Native XML databases  Hierarchical storage (Natix, TIMBER, BerkeleyDB, Tamino, …)  Updates and locking  Query optimization (e.g., that on Galax)

23 Query Processing for XML  Why is optimization harder?  Hierarchy means many more joins (conceptually)  “traverse”, “tree-match”, “x-scan”, “unnest”, “path”, … op  Though typically parent-child relationships  Often don’t have good measure of “fan-out”  More ways of optimizing this  Order preservation limits processing in many ways  Nested content ~ left outer join  Except that we need to cluster a collection with the parent  Relationship with NF 2 approach  Tags (don’t really add much complexity except in trying to encode efficiently)  Complex functions and recursion  Few real DB systems implement these fully  Why is storage harder?  That’s the focus of Natix, really

24 The Natix System  In contrast to many pieces of work on XML, focuses on the bottom layers, equivalent to System R’s RSS  Physical layout  Indexing  Locking/concurrency control  Logging/recovery

25 Physical Layout  What are our options in storing XML trees?  At some level, it’s all smoke-and-mirrors  Need to map to “flat” byte sequences on disk  But several options:  Shred completely, as in many RDBMS mappings  Each path may get its own contiguous set of pages  e.g., vectorized XML [Buneman et al.]  An element may get its 1:1 children  e.g., shared inlining [Shanmugasundaram+] and [Chen+]  All content may be in one table  e.g., [Florescu/Kossmann] and most interval encoded XML  We may embed a few items on the same page and “overflow” the rest  How collections are often stored in ORDBMS  We may try to cluster XML trees on the same page, as “interpreted BLOBs”  This is Natix’s approach (and also IBM’s DB2)  Pros and cons of these approaches?

26 Challenges of the Page-per-Tree Approach  How big of a tree?  What happens if the XML overflows the tree?  Natix claims an adaptive approach to choosing the tree’s granularity  Primarily based on balancing the tree, constraints on children that must appear with a parent  What other possibilities make sense?  Natix uses a B+ Tree-like scheme for achieving balance and splitting a tree across pages

27 Example Split point in parent page Note “proxy” nodes

28 That Was Simple – But What about Updates?  Clearly, insertions and deletions can affect things  Deletion may ultimately require us to rebalance  Ditto with insertion  But insertion also may make us run out of space – what to do?  Their approach: add another page; ultimately may need to split at multiple levels, as in B+ Tree  Others have studied this problem and used integer encoding schemes (plus B+ Trees) for the order

29 Does this Help?  According to general lore, yes  The Natix experiments in this paper were limited in their query and adaptivity loads  But the IBM people say their approach, which is similar, works significantly better than Oracle’s shredded approach

30 There’s More to Updates than the Pages  What about concurrency control and recovery?  We already have a notion of hierarchical locks, but they claim:  If we want to support IDREF traversal, and indexing directly to nodes, we need more  What’s the idea behind SPP locking?

31 Logging  They claim ARIES needs some modifications – why?  Their changes:  Need to make subtree updates more efficient – don’t want to write a log entry for each subtree insertion  Use (a copy of) the page itself as a means of tracking what was inserted, then batch-apply to WAL  “Annihilators”: if we undo a tree creation, then we probably don’t need to worry about undoing later changes to that tree  A few minor tweaks to minimize undo/redo when only one transaction touches a page

32 Annihilators

33 Assessment  Native XML storage isn’t really all that different from other means of storage  There are probably some good reasons to make a few tweaks in locking  Optimization remains harder  A real solution to materialized view creation would probably make RDBMSs come close to delivering the same performance, modulo locking