SSDS: Data for Science A Walkthrough of Proposed SSDS Capabilities 4 April 2002 John Graybeal
Topics What you want to hear: What data is in SSDS How to access data How to display data How to command instruments What else you should know: How easy to use is it? Are we sure the data’s OK? –Raw data always available? –Is it reliable? Is time right? What if there’s a problem? –Can we tell what happened? –Can we gracefully recover? Is data distributable/secure? What aren’t you getting?
What Data is Available? 1.All data produced by MOOS instruments Data is available ‘right away’ if sent to shore, or Data could be loaded later, directly from device 2.Other data which has been submitted to SSDS Submitted data must follow basic ISI/SSDS guidelines Can be brand new (e.g., calibrations), or derived (e.g., from other SSDS data) 3.“Metadata” (descriptive info) about the aboveMetadata Notes –SSDS should not replicate external data stores –Someday could re-process existing MBARI data –Operational data can also be sent to SSDS and ingested
Metadata “Explained” Metadata is just “data about other data” –My metadata may be your science data, or vice- versa 4 metadata types MOOS will handle (≈static) –Packet headers (source, timestamp, sequence) –Packet descriptions (item 1=“Depth”, 2=“Lat”) –Device (data source) descriptions –Rich science metadata (status, calibration info) Everything else is ‘just data’ Wherever possible, we’ll try to keep it simple
How To Access the Data? Ask (catalog) for data of interest (search by device, date, data item name, or combination) Choose a data set (sets?) of interest, click to access –Probably multiple text formats—what’s important? (ASCII CSV? ODV? netCDF? other?) –Do you need to monitor or process ‘streaming’ data? What more advanced features are needed? Desired? –Displaying same item across multiple data sets? –Selecting specific items or times within data set? –Processed data products…Sub-setting or interpolating data by time or item? Averaging? Filtering? …? –Combining 2 data sets using time as reference?
How to Display the Data? Basic plots will be available via web interface –Quick look in the truest sense –We don’t want to create yet another plotting program Data will be available to existing tools –Minimum capability is usable files (ASCII, netCDF, ?) –Ideal is to embed SSDS data access directly into tools In this model, software within Matlab (for example) can open anything in the archive Browsing from within application would be a big plus Some (many) tools may do this for free; others we can ‘help’ Before discussing further, you should understand the way we want SSDS (and MOOS) to work
MOOS Data Architecture Devices Observing Platform Shore Side Data System User Applications (User Tools) Data Presentation Communications Archiving Applications/ Interfaces Data line 1 more data last data OceanSideShoreSide Cataloging
How to Access Instrument (by the way, it’s not an SSDS task) Devices Observing Platform Shore Side Data System User Applications (User Tools) Data Presentation Communications Archiving Applications/ Interfaces Data line 1 more data last data OceanSideShoreSide Cataloging
How Data Access Works Devices Observing Platform Shore Side Data System User Applications (User Tools) Data Presentation Communications Archiving Applications/ Interfaces Data line 1 more data last data OceanSideShoreSide Cataloging
How Data Access Works 1.SSDS automatically notified of instrument information –Instrument qualification and installation on MOOS –Instrument configuration (default settings, changes) –Data record descriptions (syntactic and semantic) –Arrival of new data records 2.SSDS automatically catalogs, archives all arriving data 3.Users search catalog for data of interest –References to archived data returned with search results –Source data can be accessed via the references 4.User can then view (or subscribe to?) the source data –Various formats provided, including basic plots –Connections to advanced presentation packages supported
Topics What you want to hear: What data is in SSDS How to access data How to display data How to command instruments What else you should know: How easy to use is it? Are we sure the data’s OK? –Raw data always available? –Is it reliable? Is time right? What if there’s a problem? –Can we tell what happened? –Can we gracefully recover? Is data distributable/secure? What aren’t you getting?What aren’t you getting?
How easy to use is it? The Hard Part I: Providing ISI instrument drivers –Templates should be available, useful for most devices The Hard Part II: Describe your data streams –Must define instrument data streams before deploying –Even this can be easy (define your data as a “blob”; but…) Steps to get data should be pretty easy (1-step?) –Find it in catalog (may be many items with similar names) –Ask for it in your favorite basic format –Plug it in to your favorite application MOOS/ISI/SSDS makes many things simple –Timestamps: synchronous, reliable, available –Data transfer, archive, backup all handled automatically –Operational relationships (particularly location) tracked
Are we sure the data is OK? Raw data always available? –The system is designed around this core concept –Even if SSDS dies, raw data won’t go away Is data reliable (what you see is what was sent)? –Same software for ALL data communication and management -- excellent reliability, less work Is time base correct for the data? –Uniform time base for all MOOS/ISI components –Of course, you have to send data via ISI data paths If you keep it in the instrument, ISI can’t timestamp it
What if there’s a problem? Can we tell what happened (and avoid it)? –Certain systematic information will be available Other data arrivals from device/platform/observatory Indications of instrument events, reconfigurations –Operational data can be sent and maintained Transfer rates, connection reliability, power status Systemic events and errors Can we gracefully recover? Yes! (within reason) –All the transferred raw data is kept in SSDS –All the instrument’s raw data is saved on wet side –System designed for graceful data (re)processing
Is data distributable? Is data secure? Request: Give colleagues access to ‘my’ data –Model A: Everyone has access to all data (w/fuzz) –Model B: MBARI Internal vs MBARI External Option1: Make ‘your’ data available externally Option 2: Bring them to MBARI Option 3: Send them a report of your data –Model C: Configurable data access security Notionally follow Unix (self, group, other) model Note this model costs more (amount TBD) to implement (Note: Access security is also central to confidently enforcing proprietary periods.)
What aren’t you getting? Totally transparent way of doing business –Some accommodation to infrastructure is required Very low latencies in data streaming, archiving –Latency may be from sensor to shore, and from shore to archival interface –Total latency not to exceed 1 hour (?) Domain-specific data (re-)processing Advanced data merging and reprocessing Sophisticated data plotting/analysis via web interface (High-bandwidth, always-on access to device) A perfect, fully functional system on day 1
Data Mgt Architecture
Conclusions SSDS should improve data management for all users –At minimum, easier access to your own data and plots –Straightforward access to all MBARI data (references) –More reliable data storage, time references, metadata links –Better long-term usability (gives us more time) Development will be incremental –Full-featured release targeted for MSE 2003 –Prototypes will exist before then (soon!), but may evolve –Features will grow with third-party solutions Many questions about first-order science priorities –Which general-purpose functions do you really need? –What are most useful data formats? application interfaces? –How important is fine-grained access security?