AUKEGGS Canberra, Exposing legacy file-based data (interop-for-files) Andrew Woolf CCLRC Rutherford Appleton Laboratory
AUKEGGS Canberra, Outline Introduction The feature model as integration key An interoperability approach for files xlink review and proposed profile for legacy data Examples Issues
AUKEGGS Canberra, Introduction Much ‘earth-science’ data exists as large legacy file-stores –e.g. ECMWF: 2 Pb of file-based data –e.g British Atmospheric Data Centre: 40 Tb of file- based data Interoperability demands common approaches BUT, multitude of formats masks commonality –netCDF, HDF4, HDF5, GRIB, NASA Ames, PP,...
AUKEGGS Canberra, Introduction File-centred data management focusses on the container rather than content File API is fundamental point of reference –binary format details not always exposed or guaranteed –public API may be only supported access mechanism –often implemented as performant optimised native library Conclusion: can’t/shouldn’t migrate
AUKEGGS Canberra, Want to expose information, not format... Introduction
AUKEGGS Canberra, Introduction Information structures may be composed across files
AUKEGGS Canberra, The feature model Common pattern with file-data: –need to integrate information structures across multiple files –(relational tables provide this implicitly) Semantics provide an integration key –e.g. an oceanographer and meteorologist can share a conversation about data despite format differences
AUKEGGS Canberra, The feature model
AUKEGGS Canberra, A model for file-based interoperability Retain file-based persistence format Supplement with feature-based conceptual model ‘Cast’ legacy data onto conceptual model –interoperableData = (featureModel) legacyData Legacy file data + GML-encoded conceptual ‘metadata’ = ‘interoperable view’ –may be exposed through W*S
AUKEGGS Canberra, A model for file-based interoperability GML provides conceptual feature ‘skeleton’ File provides ‘flesh’ GML ‘by-reference’ pattern for property values –uses simple xlink –“The value of a GML property that carries an xlink:href attribute is the resource returned by traversing the link”
AUKEGGS Canberra, xlink review extended xlink [role] [title] local resource D [role] [title] [label] remote resource C [href] [role] [title] [label] remote resource B [href] [role] [title] [label] local resource A [role] [title] [label] arc 1 [arcrole] [title] [show] [actuate] arc 2 arc 3
AUKEGGS Canberra, xlink review simple xlink [role] [title] local resource [role] [title] [label] remote resource [href] [role] [title] [label] arc [arcrole] [title] [show] [actuate]
AUKEGGS Canberra, xlink review ‘role’ (URI): –indicates a property of the remote resource –must be a URI reference that “identifies some resource that describes the intended property” ‘arcrole’ (URI): –describes the “meaning of the arc’s ending resource relative to its starting resource” –corresponds to RDF notion of a property starting-resource HAS arc-role ending-resource
AUKEGGS Canberra, extended xlink xlink patterns for files GML feature instance Aggregation semantics determined by xlink arc traversal rules
AUKEGGS Canberra, simple xlink xlink patterns for files GML feature instance Aggregation semantics determined by storage descriptor
AUKEGGS Canberra, xlink proposal href examples: –netCDF#variable –RDBMS#SQLQuery –GRIBFile#recordNumber –CSMLStorageDescriptor#arrayID <someGMLElement xlink:arcrole="hasRemoteContentEmbeddedAt#localXpath" xlink:href="storageDescriptor#portion" xlink:role="storageSchemaIdentifier" xlink:show="embed" xlink:actuate="onRequest | onLoad"/>
AUKEGGS Canberra, Example GML CR –ISO CV_ReferenceableGrid x y Geodetic longitude x Linear Geodetic latitude x y Linear
AUKEGGS Canberra, Example netCDF ASCII dump: netcdf myfile { dimensions: x = 8 ; y = 5 ; variables: float lon(x) ; lon:long_name = “longitude” ; lon:units = “degrees_east” ; float lat(x,y) ; lat:long_name = “latitude” ; lat:units = “degrees_north” ; float temp(x,y) ; temp:coordinates = “lon lat” ; temp:long_name = “temperature” ; temp:units = “degC” ; data: lon = 13.5, 24.9, 32.4, 37.7, 41.5, 46.8, 54.4, 65.7 ; lat = 53.1, 48.7, 46.2, 44.7, 43.9, 43.3, 43.1, 44.0, 46.2, 43.2, 41.5,...
AUKEGGS Canberra, Example Geodetic longitude x Linear <gml:coordAxisValues xlink:arcrole=“ xlink:href=“myfile.nc#lon” xlink:role=“ xlink:show=“embed”>
AUKEGGS Canberra, Issues Need to ‘get as close as possible’ to target –‘merge’ semantics consistent with GML? (Opportunity: no best practice for GML yet!) “If both a link and content are present in an instance of a property element, then the object found by traversing the xlink:href link shall be the normative value of the property. The object included as content shall be used by the data recipient only if the remote instance cannot be resolved; this may be considered to be a "cached" version of the object.” [GML ]
AUKEGGS Canberra, Issues xlink:href (URI) for remote resource fragment (format- specific) –e.g. RDBMS#SQLQuery, netCDF#variable, etc... xlink:role (URI) for resource format –e.g. reference PRONOM-type format repository? implied conversion to GML target content type xlink:arcrole (URI) for ‘embed remote content’ semantics –‘insert at relative XPath’ essential simple xlink can’t handle multiple resources –application-specific ‘storage descriptor’ schemas for file aggregation semantics
AUKEGGS Canberra, Conclusion Presented a profile for xlink with files in absence of current best practice Meets key practical requirements –retain file-based persistence formats –provide interoperability ‘wrapper’ –focus on logical content, not container (feature model) Semantic governance at appropriate points Enables powerful, scalable mechanism for real data –e.g. large meteorological datasets