BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid [ ]= Bryan Lawrence (on behalf of a big team, and note also a substantial piece of work with specific authorship included herein)
UKOLN, July 2006 Outline Introduction to the BADC Motivation Standards –Feature Types Taxonomy Overall Architecture NDG Products –Discovery Portal –Data Extractor –MOLES (NumSim relationship with NMM) –CSML CSML –Description –Prototyping in MarineXML –Round-Tripping Vocabulary Issues IN NDG (Hughes, Kondapalli, Lowry) NDG Timeline
UKOLN, July 2006 BADC Role The BADC role is to assist UK researchers to locate, access and interpret atmospheric data and to ensure the long-term integrity of atmospheric data produced by NERC projects. –Facilitation and Curation/Preservation!
UKOLN, July 2006 BADC Data Holdings A BADC dataset is an aggregation of data files, documents and metadata sharing common administrative policies. These policies could be file validation, access control or retention schemes. Datasets vary from TBs in millions of files to a few MBs in a single file. There are presently over 100 datasets.
UKOLN, July 2006 BADC User examples Atmospheric chemistry models. Pollution chemistry measurement campaigns.
UKOLN, July 2006 User examples Bird feeding habits.
UKOLN, July 2006 User examples Radio communication modelling. Wind power research. A & E influenza cases.
UKOLN, July 2006 User examples Castle mortar decay. Discomfort indices.
UKOLN, July 2006 User Diversity These stats quite old now, but the distributions haven’t changed.
UKOLN, July 2006 Climate in – A graphic Illustration Figures from Gary Strand, NCAR, ESG website March 2006, 2.5 PB Typically, two-thirds of this data will never see the light of the day: why? No one can remember what it was, or, if they can remember that, where it is!
UKOLN, July Data as Evidence What McIntyre got right:
UKOLN, July 2006 Data Retention Policies University of Cambridge Research Division: Data generated in the course of research should be kept securely in paper or electronic format, as appropriate. Back-up records should always be kept for data stored on a computer. The [AMRC] considers a minimum of ten years to be an appropriate period. However, research based on clinical samples or relating to public health may require longer storage to allow for long-term follow-up to occur. [AMRC: Association of Medical Research Charities] University of Oxford Research Services Office: A successful laboratory notebook allows for ready verification of quality and integrity of research data and enables another investigator to reproduce the procedure which has been documented and get the same result. …. A successful laboratory notebook allows for ready verification of quality and integrity of research data and enables another investigator to reproduce the procedure which has been documented and get the same result. Natural Environment Research Council: … Scientists will frequently process the data they have collected selectively, or with specific application packages, in order to prepare material for publication in the scientific literature. But the full value of the data collected may only be realised if the entire dataset is subjected to generic processing (eg to ensure calibration and adequate quality control) and is sufficiently documented to allow others to re-use it at a later date. The original collector may be the only person in a position to undertake such work, and so to unlock the full potential of the data. Those holding data collected under NERC funding will be expected to cooperate in validating and publishing them in their entirety - when this can be justified in terms of their scientific value - rather than merely creaming off a subset for immediate publication in the literature. …
UKOLN, July British Atmospheric Data Centre British Oceanographic Data Centre Complexity + Volume + Remote Access = Grid Challenge NCAR
UKOLN, July 2006 NDG Assumptions 1.No one would change their data storage systems! 2.Need to support a wide range of “metadata- maturity”! 3.No NDG-wide user management system possible. It is illegal to share user information without each and every user agreeing … implies no way of having one virtual organisation with common user management! With a large enough group it is impossible to agree on common roles that could be associated with access control. … but we want single-sign on … and trust relationships between data providers …
UKOLN, July 2006 Integration – semantics Want interdisciplinary semantic access to information, not abstract data –getData(potential temperature from ERA-40 dataset in North Atlantic from 1990 to 2000) –not: getData(“era40.nc”, ‘PTMP’, 20:50, 300:340, 190:200) –or even worse: for j=1990:2000 getData(“era40_”+j+“.nc”, ‘PTMP’, 20:50, 300:340) Lossy is OK! –Care less about completeness of representation than semantic unification
UKOLN, July 2006 Standards ISO 19101: Geographic information – Reference model A geospatial dataset… …consists of features and related objects… …in a defined logical structure… …delivered through services… …and described by metadata.
UKOLN, July 2006 Standards Geographic ‘features’ –“abstraction of real world phenomena” [ISO 19101] –Type or instance –Encapsulate important semantics in universe of discourse –“Something you can name” Application schema –Defines semantic content and logical structure –ISO standards provide toolkit: spatial/temporal referencing geometry (1-, 2-, 3-D) topology dictionaries (phenomena, units, etc.) –GML – canonical encoding [from ISO “Geographic information – Rules for Application Schema”]
UKOLN, July 2006 Architecture: NDG Metadata Taxonomy … not one schema, not one solution! CSML NCML+CF MOLES THREDDS DIF -> ISO19115 CLADDIER
UKOLN, July 2006 Architecture: Deployment Data Providers NDG Core Services Users NDG GUI Interface(s) Vocab Services
UKOLN, July 2006 Architecture: Deployment NDG Core Services Users NDG GUI Interface(s) Vocab Services
UKOLN, July 2006 Architecture: Deployment Users NDG GUI Interface(s) Vocab Services
UKOLN, July 2006 Architecture: Deployment Users Vocab Services
Current Status
UKOLN, July 2006 MOLES: implementation Core linking concept is the deployment Deployment Activity on behalf of an Activity of a Data Production Toolat an Observation Station that produces a Data Entity Data Production Tool Observation Station Data Entity Each of the main metadata objects has security data attached to it. This means that this can be applied to queries on the metadata Links the metadata records into a structure that can be turned into a navigable structure
UKOLN, July 2006 Simulators as data production tools: NumSim NDG Products: NumSim
UKOLN, July 2006 NumSim Example
UKOLN, July 2006 International Discovery - Climate
UKOLN, July 2006 NDG “Pseudo-Demo” EXPLOITING DISCOVERY WEB SERVICE (running interface on my laptop last night)
UKOLN, July 2006 More Browse Scrolling Down
UKOLN, July 2006 MOLES Navigation Actually, this is where we plan to use NumSim and Numerical Model Metadata
UKOLN, July 2006 MOLES to Secure Dx
UKOLN, July 2006 NDG Authentication Offering up trusted host list …
UKOLN, July 2006 Data Extractor
UKOLN, July 2006 NDG-A: Climate Science Modelling Language Aims: –provide semantic integration mechanism for NDG data –explore new standards-based interoperability framework –emphasise content, not container Design principles: –offload semantics onto parameter type (‘phenomenon’, observable, measurand) e.g. wind-profiler, balloon temperature sounding –offload semantics onto CRS e.g. scanning radar, sounding radar –‘sensible plotting’ as discriminant ‘in-principle’ unsupervised portrayal –explicitly aim for small number of weakly-typed features (in accordance with governance principle and NDG remit)
UKOLN, July 2006 Climate Science Modelling Language CSML feature types –defined on basis of geometric and topologic structure CSML feature type DescriptionExamples TrajectoryFeatu re Discrete path in time and space of a platform or instrument. ship’s cruise track, aircraft’s flight path PointFeatureSingle point measurement. raingauge measurement ProfileFeature Single ‘profile’ of some parameter along a directed line in space. wind sounding, XBT, CTD, radiosonde GridFeatureSingle time-snapshot of a gridded field.gridded analysis field PointSeriesFeat ure Series of single datum measurements. tidegauge, rainfall timeseries ProfileSeriesFe ature Series of profile-type measurements. vertical or scanning radar, shipborne ADCP, thermistor chain timeseries GridSeriesFeat ure Timeseries of gridded parameter fields. numerical weather prediction model, ocean general circulation model
UKOLN, July 2006 Climate Science Modelling Language CSML feature types –examples... ProfileSeriesFeature ProfileFeature GridFeature
UKOLN, July 2006 Climate Science Modelling Language Numerical array descriptors –provides ‘wrapper’ architecture for legacy data files –‘Connected’ to data model numerical content through ‘xlink:href’ Three subtypes: –InlineArray –ArrayGenerator –FileExtract (NASAAmes, NetCDF, GRIB) Composite design pattern for aggregation
UKOLN, July 2006 Climate Science Modelling Language Inline array Array generator 5 2 udunits.xml#degreeC float s/10/9/ge udunits.xml#minute float 0:5:50000
UKOLN, July 2006 Climate Science Modelling Language File extract 526 double /data/BADC/macehead/mh cf1 CFC radar_data.nc az double /e40/ggas rsn.grb
UKOLN, July 2006 XML Parser SeeMyDENC Data Dictionary S52 Portrayal Library SENC Marine GML (NDG) Feature Types XML Biological Species Chl-a from Satellite Modelled Hydrodynamics XSLT For each XSD (for the source data) there is an XSLT to translate the data to the Feature Types (FT) defined by CSML. The FT’s and XSLT are maintained in a ‘MarineXML registry’ The FTs can then be translated to equivalent FTs for display in the ECDIS system XSLT Features in the source XSD must be present in the data dictionary. XSD XML The result of the translation is an encoding that contains the marine data in weakly typed (i.e. generic) Features XSLT Phenomena in the XSD must have an associated portrayal ECDIS acts as an example client for the data. Data from different parts of the marine community conforming to a variety of schema (XSD) Measured Hydrodynamics S-57v3 GML XML XSD XML XSD Feature described using S-57v3.1Application Schema can be imported and are equivalent to the same features in CSML’ Slide adapted from Kieran Millard (AUKEGGS, 2005) MarineXML Testbed
UKOLN, July 2006 Biological sampling station with attributes for the species sampled at each Grid of Chl-a from the MERIS instrument on ENVISAT Predicted and measured wave climate timeseries (height, direction and period) Vectors of currents from instruments MarineXML Testbed Slide adapted from Kieran Millard (AUKEGGS, 2005)
UKOLN, July 2006 The Concept of re-using Features Here structured XML is converted to plain ascii text in the form required for a numerical model HTML warning service pages are generated ‘on the fly’ XML can also be converted to SVG to display data graphically Here the same XML is converted to the SENC format used in a proprietary tool for viewing electronic navigation charts. All this requires agreement on standards Slide adapted from Kieran Millard (AUKEGGS, 2005)
UKOLN, July 2006 CSML Present Other User Example: Norwegian Met Office
UKOLN, July 2006 CSML Round Tripping - 1 Managing semantics UGAS GML app schema XML GML dataset instance conceptual model Conforms to New Dataset Application produces parser V1.0 will be in NDG Alpha
UKOLN, July 2006 CSML Round Tripping - 2 Managing data - 1 parser V1.0 in NDG Alpha GML dataset scanner V1.0 in NDG Alpha GML app schema XML instance CF Dataset Application produces CF
UKOLN, July 2006 Managing Data CF Dataset GML dataset scanner XSLT ISO19115 XML PUBLISH DECISION PROCESSES CF Dataset Define Dataset Add Information
UKOLN, July 2006 CSML (and friends): software stack
UKOLN, July 2006 Future of CSML Some new features: (Swath, Ragged ProfileSeries) But more importantly, factor out the storage and introduce the concept of “processing affordance”. (see
UKOLN, July 2006 Architecture: Deployment
Vocabulary Management for NERC DataGrid Michael Hughes, V.Siva Kondapalli and Roy Lowry
UKOLN, July 2006 Vocabulary Presentation Outline Problem and Solution NERC DataGrid Vocabulary Model Vocabulary Technical Governance Vocabulary Content Governance Mappings and Thesaurus Server Potential Role of Local Mappings
UKOLN, July 2006 The Problem NERC DataGrid cannot function operationally without metadata and data semantic interoperability This will never be achieved without: –Readily accessible standard terms whose meaning is clearly understood –Readily accessible semantic maps both within and between lists of standard terms –Semantic maps between local terms and standard terms
UKOLN, July 2006 The Solution? Implementation of a Vocabulary Server Building OWL ontologies mapping between domain-relevant de-facto standard vocabularies Deploying the ontologies through a Web Service thesaurus server Making tools available for users to build and deploy local ontologies
UKOLN, July 2006 NDG Vocabulary Model The vocabulary resource is built from Entries –The representation of a single object in the real world comprising: Key - A bit pattern that represents an entity. It must be unique, permanent and free from semantics. Term – Text used to label the entity to facilitate human recognition. Abbreviation – An shortened version of the term for use where space is tight. Target size is bytes. Definition – Text that unambiguously specifies the entity. –Entries are aggregated into Lists (entity class or subclass e.g. UK post towns) –Lists are aggregated into Constraints (entity class e.g. post towns of the world)
UKOLN, July 2006 Vocabulary Technical Governance The story so far –Lists are available as flat ASCII files or XML documents as URLs e.g. ftp://ftp.pol.ac.uk/pub/bodc/jgofs/datadict/new/parameter_group.csv –Some (BODC, SEA-SEARCH) include keys –Some (CF, BODC) include definitions –None are properly versioned
UKOLN, July 2006 Vocabulary Technical Governance Versioning should: –Provide a unique label for each instantiation of the list –Enable any previous instantiation of the list to be recreated –Provide timestamp information for creation and modification of every object in the vocabulary system Delivery should –Be from the master, not a copy –Be accessible to software agents to allow automated synchronisation of local copies –Have a ‘hotline’ to content governance
UKOLN, July 2006 Vocabulary Technical Governance NERC DataGrid Vocabulary Server –Back End Fully automated record archive, timestamps and version numbering. Live April (of 115) lists publicly accessible. –Front End Web Service API. Live June XML list downloads from website (July 2006?). Web-form tools (August 2006?).
UKOLN, July 2006 Vocabulary Content Governance Standard lists need to respond to ever expanding user requirements Change needs to be rapid or users lose interest Standard lists need to maintain information quality and internal consistency Content governance has to resolve these conflicting requirements
UKOLN, July 2006 Vocabulary Content Governance Content governance in oceanographic and atmospheric domains is based on: –Moderated discussion lists –‘Benign Dictator’ and well-meaning volunteers Variable success depending on right people having ‘spare’ time at the right moments More formalism underpinned by more resources required But need to be careful about going too far or levels of service become unacceptable
UKOLN, July 2006 Mappings and Thesaurus Server There will never be a single list for a given topic Term mapping therefore an essential part of semantic interoperability Marine Metadata Interoperability ( have developed tooling and trialled mappings in the measurement phenomena arenahttp://marinemetdata.org
UKOLN, July 2006 Mappings and Thesaurus Server MMI approach –Harmonise lists to be mapped in OWL (Voc2OWL tool) –Map on basis of ‘same as’, ‘broader than’ and ‘narrower than’ relationships (VINE tool) –Place a Web Service API over the map to implement a term or thesaurus server
UKOLN, July 2006 Mappings and Thesaurus Server NERC DataGrid Plans –Use MMI technology plus domain expertise available in BODC, BADC and their user communities to build a complete map between BODC Parameter Discovery Vocabulary (300 terms) CF Standard Names (5-600 terms) GCMD Parameter Valids (2-300 relevant terms) –Incorporate this map into the NDG Discovery Service to facilitate smart searching (e.g. ‘pigments’ finds dataset labelled ‘chlorophyll’) through MMI Web Service –Integrate ontology maintenance into source list maintenance
UKOLN, July 2006 Role of Local Mappings There will always be local terms and understanding ‘Pigment data sets’ could mean: –Chlorophyll OR carotenoids OR phaeopigments –Chlorophyll AND carotenoids AND phaeopigments Depends on point of view
UKOLN, July 2006 Role of Local Mappings Possible solution to this: –User builds an ontology reflecting local perception of the mapping between local terms and standard terms –Discovery or data integration tools use ontology as a ‘plug-in’ allowing user to operate with local terminology Tools (e.g. VINE) could be made available to facilitate this
UKOLN, July 2006 NDG Timeline NDG2 runs until September 2007: NDG-Alpha (June 2006) –Not all components are in place (particularly delivery broker) –Not many products are deployable by non-NDG participants (too much hard work installing things that haven’t been optimised for installation) –Discovery portal is usable, linking to NCAR data etc, but isn’t very user friendly (options not obvious etc). NDG-Beta (Feb 2007) –Most components should work, but deployment of software may still be difficult by non-participants NDG-Prod (Jun 2007) –Should be deployable and far more user friendly (spending from Feb-June working on deployment and friendliness, no new functionality) Last few months working on sustainability etc