Presentation is loading. Please wait.

Presentation is loading. Please wait.

ODM present and futures for internal discussion 5 September 2007 Ilya Zaslavsky, Dave Valentine, Tom Whitenack, Catharine van Ingen.

Similar presentations


Presentation on theme: "ODM present and futures for internal discussion 5 September 2007 Ilya Zaslavsky, Dave Valentine, Tom Whitenack, Catharine van Ingen."— Presentation transcript:

1 ODM present and futures for internal discussion 5 September 2007 Ilya Zaslavsky, Dave Valentine, Tom Whitenack, Catharine van Ingen

2 Preamble We’ve now had experience using the first version of the ODM information model including the WaterOneFlow services, WaterML schema and ODM database schema. We’ve learned a lot. It’s time to start work on the next version while continuing to support the testbed and other users Before beginning that work, let’s look at what we’ve learned and what has changed.

3 PART I: LESSONS LEARNED

4 ODM is an Information Model ODM defines a canonical information model and semantics for hydrologic observations ODM also provides a relational implementation of the model, tuned to locally-collected observations typically under control of a PI There are many application scenarios, for which ODM information model shall be useful including: –data discovery via site catalogs, –data transformations and versioning within DB, –managing streaming data, –long-term preservation of observations data, –community annotation of hydrologic data and model results,… Within the same application scenario the data content may differ from canonical ODM recommendations

5 Learning: The Good News Semantics behind ODM is accepted by the community. ODM services are standardized. ODM service creation and tuning is simple. Pairing a data model with a relational database has provided a general storage schema for observations. –Enables data curation and archive. The testbeds have enough to get started. –Services from key agencies, DASH, and Hydroseek are up and running. –We will learn a lot from the early community use.

6 Learning: The Bad News The ODM schema is –complex-. –Even we get confused as to what the various schema elements mean. –Many scenarios need only a subset of the schema. For example, data discovery doesn’t need the DataValues table and real time data loading doesn’t need Samples. The ODM variable structure is confusing. –Even we get confused as to how to fill out (or search on) the various category or quality fields. –Unit conversions, temporal granularity and other data attributes seem hidden. ODM lacks collection versioning objects and other provenance mechanisms. –Present ODM structure is central around the DataValues table. Unless a database has DataValues, we can’t mix discovery information (site and variables) from different sources in the same database or service. For example, today we need separate services for NWIS daily and NWIS realtime. –Collections are difficult, and possibility at the wrong granularity. Relating individual data points can be confusing, and such relationships will not scale to very large data sets. –ODM tries to work around idiosyncrasies of individual collections – but this needs to be matched by adding provenance info, to stay true to sources –In “some” scenarios, we need to keep several versions of the data, with all transformations and roll-back abilities

7 Enabling Community Development To date, CUAHSI HIS has been developed by relatively few people working closely rather than a loosely connected community of small projects. –We’ve covered many scenarios and needs. Enabling access to agency data repositories via web services is a major step forward The test beds and other users can leverage the infrastructure to avoid writing everything from scratch –We haven’t thought of everything. An example is the photos table in the BearRiverOD sample database. It’s time to enable all of the community to extend the infrastructure and develop tools. –The challenge is how to enable organic growth without organic chaos and confusion.  We need a general checklist and simple tests to ensure that we are all “community aware”  Some chaos is unavoidable, but we shall plan for it

8 PART 2: PROPOSAL

9 The Proposal Focus on data discovery, access, and analysis scenarios Layer the information model –Each layer can be used to solve a specific set of scenarios. –Layers build on core and on one another. –All tools written to a specific layer have guarantees of completeness (and a test suite) Define both web service and database schema interfaces. –Direct access to database tables used when necessary for speed or necessity with SSIS, SSAS or other applications. –Build compliance tests to ensure robustness of the interface. Define a few “rules of the road” for community extensions. –How methods, tables, or vocabularies can be added –How to share extensions with others Start with the immediately useful and relatively well understood scenarios. –First such extension demonstrates the process –Document what we’re already doing and saying today

10 ODM Layers High Level Overview Web Service Applications Database Applications ODMSensor Streaming Data Loaders ODMProject Data Assembly and Analysis ODMCatalog Data Discovery and Catalogs ODMNext ??? Data Archive, Education, Publication Web Service methods Database tables and views ODMCore

11 Put the Series Catalog at the Center Needed for all scenarios from discovery through archive. Adding provenance at the series gives clear tracking while being lighter weight than at the data value. –Include data source information (where did the data come from?) –Include change information (when was the data series created, last changed, by whom?) –Identify vocabularies for variables and sites allows translation and abstraction. Build out from here for each scenario adding tables and rows. The trade-off: overhead of dealing with a series of one

12 Example simplified activities I’ve been archiving realtime discharge data and want to compare it to the daily discharge data I just downloaded. –Different data series distinguishes them NWIS reports turbidity, grainsize, and suspended sediment measurments at different stations and times. I want to use all of them to get the best suspended sediment estimates I can. –New data series contains computed values from originals I’m analyzing phosphorus dynamics –Need to convert/aggregate different measures I’m computing evapotranspiration –New derived data series depends on series containing air temperature, radiation, latent heat and precipitation I want to plot discharge over time and tag the gage by agency, so need to extend the Matlab object.

13 A Few Observations The lower level database design is not intended as human friendly –Separate the information model from the database model –Views and automatically generated machine schemas can be built to present the “Information model” that is ODM. Plan now for localization –Language and character sets change as you move around the world –Location descriptions and addresses change (e.g. Name and Address Markup Language) While web service calls are stateless, applications often need to preserve state across calls –Use identifiers (analogous to cookies) for efficiency and robustness –At the same time, avoid using identifiers for persistency Writing software is cheap; maintaining software is expensive –Resist the urge to write more software than you have to! (individual software not a deliverable in infrastructure projects) –Leverage, reuse, and document the software you have especially infrastructure software; follow standards and common models

14 Example checklist items Do all controlled vocabularies have a default (none) or (unknown) value? Does the implementation have at least one data source? Does a testbed have a data source identifying itself? Are all ODM tables present? Are any additional tables not prefixed by ODM? Does the series catalog correctly represent the contents of the Data Series table?

15 Next Steps Define ODMCore, a.k.a. WaterOneFlow Tables Define ODMCatalog proposal (David Valentine and friends) –Supports data discovery from agency web sites and testbed network servers. –Used to implement all catalog servers. –Web Service interface a subset of WaterML –Database tables include those necessary for DASH Define ODMSensor proposal (Jeff Horsburgh and friends) –Supports data streaming from real time sensors. –Includes sensor configurations and definitions –WebService extensions (for monitoring) –Only those database tables and columns necessary for streaming assuming the initial configuration (variables, sites, etc) exist. Compare these proposals –What would it mean to migrate? –What’s still to do? No change to testbeds at this time.

16 PART III: INITIAL STEPS

17 Usage impact of changes Allow for management of data loading, for both individual and aggregated (National) data sets Make it easier to populate the database. –Smaller core set of information Develop appropriate tables which manage data for the for simplified activities –suggested methods and table extensions for simplified activities

18 SeriesCatalog at the Center The series catalog is the primary collection object or “data folder” All provenance and versioning happens here –Larger groupings for analysis or archive can be constructed via a spline table –M:N derivations tracked by the same spline table replacing the DerivedFrom table

19 Sources near the Center All data series are associated with a source – Includes agencies, individual investigators, data archives, etc Each source can define a variable and site vocabulary – Translation tables between sources are built up over time – A given source can always reuse an existing vocabulary – The authoritative list of sources is kept somewhere

20 SeriesCatalog Changes Columns that duplicate foreign key table columns such as VariableCode are removed – A view can (and should) be used – All foreign keys are identity columns to allow localization New provenance information added to track create and modify actions Source replaced by indirect through Site and Variable DataType and General Category replaced by indirect through Variable

21 Provenance additions ProcessGroup table holds descriptive information for one or more data series DataSeries and Process group have CreatedDate/ModifiedDate/ LastChangerID summary information –A log table to track all changes can be added if desired

22 Site Changes Site properties separated from site identifier – A view can be used for common properties Properties may be reported in different units (eg multiple spatial reference datums) Properties may change over time (eg resurvey may change latitude/longitude) The properties of interest depend on the science (eg including IGBP class)


Download ppt "ODM present and futures for internal discussion 5 September 2007 Ilya Zaslavsky, Dave Valentine, Tom Whitenack, Catharine van Ingen."

Similar presentations


Ads by Google