HiFi: Network-centric Query Processing in the Physical World SAP Research Forum February 2005 Mike Franklin UC Berkeley
Mike Franklin UC Berkeley EECS Introduction Receptors everywhere! Wireless sensor networks, RFID technologies, digital homes, network monitors,... Large-scale deployments will be as High Fan-In Systems
Mike Franklin UC Berkeley EECS High Fan-in Systems Large numbers of receptors = large data volumes Hierarchical, successive aggregation The “Bowtie”
Mike Franklin UC Berkeley EECS High Fan-in Example (SCM) Receptors Warehouses, Stores Dock doors, Shelves Regional Centers Headquarters
Mike Franklin UC Berkeley EECS Properties High Fan-In, globally-distributed architecture. Large data volumes generated at edges. Filtering and cleaning must be done there. Successive aggregation as you move inwards. Summaries/anomalies continually, details later. Strong temporal focus. Strong spatial/geographic focus. Streaming data and stored data. Integration within and across enterprises.
Mike Franklin UC Berkeley EECS Design Space: Time Filtering, Cleaning, Alerts Monitoring, Time-series Data mining (recent history) Archiving (provenance and schema evolution) On-the-fly processing Disk-based processing Stream/Disk Processing Time Scale seconds years
Mike Franklin UC Berkeley EECS Design Space: Geography Filtering, Cleaning, Alerts Monitoring, Time-series Data mining (recent history) Archiving (provenance and schema evolution) Geographic Scope local global Several Readers Regional Centers Central Office
Mike Franklin UC Berkeley EECS Design Space: Resources Filtering, Cleaning, Alerts Monitoring, Time-series Data mining (recent history) Archiving (provenance and schema evolution) Individual Resources tiny huge Devices Stargates/ Desktops Clusters/ Grids
Mike Franklin UC Berkeley EECS Design Space: Data Filtering, Cleaning, Alerts Monitoring, Time-series Data mining (recent history) Archiving (provenance and schema evolution) Degree of Detail Aggregate Data Volume Dup Elim history: hrs Interesting Events history: days Trends/Archive history: years
Mike Franklin UC Berkeley EECS State of the Art Current approaches: hand-coded, script-based expensive, one-off, brittle, hard to deploy and keep running Piecemeal/stovepipe systems Each type of receptor (RFID, sensors, etc) handled separately Standards-efforts not addressing this: Protocol design bent Different “data models” at each level Reinventing “query languages” at each level No end-to-end, integrated middleware for managing distributed receptor data
Mike Franklin UC Berkeley EECS HiFi A data management infrastructure for high fan-in environments Uniform Declarative Framework Every node is a data stream processor that speaks SQL-ese stream-oriented queries at all levels Hierarchical, stream-based views as an organizing principle
Mike Franklin UC Berkeley EECS Why Declarative? (database dogma) Independence: data, location, platform Allows the system to adapt over time Many optimization opportunities In a complex system, automatic optimization is key. Also, optimization across multiple applications. Simplifies Programming ???
Mike Franklin UC Berkeley EECS Building HiFi
Mike Franklin UC Berkeley EECS Integrating RFID & Sensors (the “loudmouth” query)
Mike Franklin UC Berkeley EECS A Tale of Two Systems TinyDB Declarative query processing for wireless sensor networks In-network aggregation Released as part of TinyOS Open Source Distribution TelegraphCQ Data stream processor Continuous, adaptive query processing with aggressive sharing Built by modifying PostgreSQL Open source “beta” release out now; new release soon
Mike Franklin UC Berkeley EECS The Network is the Database: Basic idea: treat the sensor net as a “virtual table”. System hides details/complexities of devices, changing topologies, failures, … System is responsible for efficient execution. Developed on TinyOS/Motes SELECT MAX(mag) FROM sensors WHERE mag > thresh SAMPLE PERIOD 64ms App Sensor Network TinyDB Query, Trigger Data TinyDB
Mike Franklin UC Berkeley EECS TelegraphCQ: Data Stream Monitoring Streaming Data Network monitors Sensor Networks, RFID News feeds, Stock tickers, … B2B and Enterprise apps Trade Reconciliation, Order Processing etc. (Quasi) real-time flow of events and data Manage these flows to drive business processes. Can mine flows to create and adjust business rules. Can also “tap into” flows for on-line analysis.
Mike Franklin UC Berkeley EECS Data Stream Processing Queries Queries Data Traditional Database Data Stream Processor Result Tuples Data streams are unending Continuous, long running queries Real-time processing Data
Mike Franklin UC Berkeley EECS Windowed Queries SELECT S.city, AVG(temp) FROM SOME_STREAM S [range by ‘5 seconds’ slide by ‘5 seconds’] WHERE S.state = ‘California’ GROUP BY S.city “I want to look at 5 seconds worth of data” “I want a result tuple every 5 seconds” A typical streaming query Result Tuple(s) Data Stream Result Tuple(s) … Window Window Clause
Mike Franklin UC Berkeley EECS TelegraphCQ Architecture Proxy TelegraphCQ Front End Planner Parser Listener Mini-Executor Catalog TelegraphCQ Wrapper ClearingHouse Wrappers Query Plan Queue Eddy Control Queue Query Result Queues } Shared Memory Shared Memory Buffer Pool Disk Split TelegraphCQ Back End Modules Scans CQEddy Split TelegraphCQ Back End Modules Scans CQEddy
Mike Franklin UC Berkeley EECS The HiFi System TelegraphCQ TinyDB Stargates Sensor Networks & RFID Readers RFID Wrappers PC
Mike Franklin UC Berkeley EECS Basic HiFi Architecture HiFi Glue DSQP HiFi Glue DSQP MDR Hierarchical federation of nodes Each node: Data Stream Query Processor (DSQP) HiFi Glue Views drive system functionality Metadata Repository (MDR) HiFi Glue DSQP HiFi Glue DSQP Management Query Planning Archiving Internode coordination and communication
Mike Franklin UC Berkeley EECS HiFi Processing Pipelines The CSAVA Framework Multiple Receptors Single TupleWindow CSAVA Generalization Arbitrate Clean Smooth Validate Analyze Join w/Stored Data On-line Data Mining
Mike Franklin UC Berkeley EECS CSAVA Processing Clean CREATE VIEW cleaned_rfid_stream AS (SELECT receptor_id, tag_id FROM rfid_stream rs WHERE read_strength >= strength_T)
Mike Franklin UC Berkeley EECS CSAVA: Processing Clean Smooth CREATE VIEW smoothed_rfid_stream AS (SELECT receptor_id, tag_id FROM cleaned_rfid_stream [range by ’5 sec’, slide by ’5 sec’] GROUP BY receptor_id, tag_id HAVING count(*) >= count_T)
Mike Franklin UC Berkeley EECS CSAVA: Processing Clean Smooth Arbitrate CREATE VIEW arbitrated_rfid_stream AS (SELECT receptor_id, tag_id FROM smoothed_rfid_stream rs [range by ’5 sec’, slide by ’5 sec’] GROUP BY receptor_id, tag_id HAVING count(*) >= ALL (SELECT count(*) FROM smoothed_rfid_stream [range by ’5 sec’, slide by ’5 sec’] WHERE tag_id = rs.tag_id GROUP BY receptor_id))
Mike Franklin UC Berkeley EECS CSAVA: Processing Arbitrate Validate CREATE VIEW validated_tags AS (SELECT tag_name, FROM arbitrated_rfid_stream rs [range by ’5 sec’, slide by ’5 sec’], known_tag_list tl WHERE tl.tag_id = rs.tag_id Clean Smooth
Mike Franklin UC Berkeley EECS CSAVA: Processing Validate CREATE VIEW tag_count AS (SELECT tag_name, count(*) FROM validated_tags vt [range by ‘5 min’, slide by ‘1 min’] GROUP BY tag_name Analyze Arbitrate Clean Smooth
Mike Franklin UC Berkeley EECS Ongoing Work Bridging the physical-digital divide VICE – A “Virtual Device” Interface Hierarchical query processing Automatic Query planning & dissemination Complex event processing Unifying event and data processing
Mike Franklin UC Berkeley EECS Virtual Device (VICE) Layer “Metaphysical* Data Independence” *The branch of philosophy that deals with the ultimate nature of reality and existence. (name due to Shawn Jeffery)
Mike Franklin UC Berkeley EECS The Virtues of VICE A simple RFID Experiment 2 Adjacent Shelves, 8 ft each 10 EPC-tagged items each, plus 5 moved between them. RFID antenna on each shelf.
Mike Franklin UC Berkeley EECS Ground Truth
Mike Franklin UC Berkeley EECS Raw RFID Readings
Mike Franklin UC Berkeley EECS After VICE Processing Under the covers (in this case): Cleaning, Smoothing, and Arbitration
Mike Franklin UC Berkeley EECS Other VICE Uses Once you have the right abstractions: “Soft Sensors” Quality and lineage streams Pushdown of external validation information Power management and other optimizations Data Archiving Model-based sensing “Non-declarative” code …
Mike Franklin UC Berkeley EECS Hierarchical Query Processing “I provide raw readings for Soda Hall” “I provide avg daily values for Berkeley” “I provide avg weekly values for California” “I provide national monthly values for the US” Continuous and Streaming Automatic placement and optimization Hierarchical Temporal granularity vs. geographic scope Sharing of lower-level streams
Mike Franklin UC Berkeley EECS Complex Event Processing Needed for monitoring and actuation Key to prioritization (e.g., of detail data) Exploit duality of data and events Shared Processing “Semantic Windows” Challenge: a single system that simultaneously handles events spanning seconds to years.
Mike Franklin UC Berkeley EECS Next Steps Archiving and Detail Data Dealing with transient overloads Rate matching between stored and streaming data Scheduling large archive transfers System design & deployment Tools for provisioning and evaluating receptor networks System monitoring & management Leverage monitoring infrastructure for introspection
Mike Franklin UC Berkeley EECS Conclusions Receptors everywhere High Fan-In Systems Current middleware solutions are complex & brittle Uniform declarative framework is the key The HiFi project is exploring this approach Our initial prototype Leveraged TelegraphCQ and TinyDB Demonstrated RFID/multiple sensor integration Validated the HiFi approach We have an ambitious on-going research agenda See for more info.
Mike Franklin UC Berkeley EECS Acknowledgements Team HiFi: Shawn Jeffery, Sailesh Krishnamurthy, Frederick Reiss, Shariq Rizvi, Eugene Wu, Nathan Burkhart, Owen Cooper, Anil Edakkunni Experts in VICE: Gustavo Alonso, Wei Hong, Jennifer Widom Funding and/or Reduced-Price Gizmos from NSF, Intel, UC MICRO program, and Alien Technologies