Infrastructure requirements for linked e-science The requirements of the agINFRA VRC for e-infrastructures. Miguel-Angel Sicilia University of Alcalá,

Infrastructure requirements for linked e-science The requirements of the agINFRA VRC for e-infrastructures. Miguel-Angel Sicilia University of Alcalá, Spain msicilia@uah.es agINFRA coordinator

Why agINFRA?

Why sharing data? Sharing research data is “an intrincate and difficult problem” (Borgman, 2011, JASIST) Not much data sharing may be taking place – with exceptions in some domains. Sharing takes different forms, from private data exchange to posting on-line, and including journal supplementary materials. There are few standards for giving shared data the required computational semantics to build automated tools. …however reusing data is at the core of the principles of the scientific method … and a major concern for scientists and policy makers.

What kinds of data? Primary data: –Structured data, e.g. datasets as tables –Digitized data: images, videos, etc. Secondary data –Elaborations of the primary, e.g. a dendogram Provenance information, including authors, their organizations and projects Methods and procedures followed Reports, including papers Secondary documents, e.g. training resources Metadata about the above EML

PhysicalDataFormat Access and Distribution LogicalDataModel MethodsCoverage: Space, Time, Taxa Identity and Discovery Information A … modular Extensible comprehensive Ecological Metadata Language Sharing example: EML From: Matthew B. Jones, “Data, Metadata, and Ontology in Ecology”

EML Model: Attribute structure Describes data tables and their variables/attributes a typical data table with 10 attributes –some metadata are likely apparent, other ambiguous –definitions need to be explicit, as well as data typing YEAR MONTH DATE SITE TRANSECT SECTION SP_CODE SIZE OBS_CODE NOTES 2001 8 2001-08-22 ABUR 1 0-20 CLIN 5 06. 2001 8 2001-08-22 ABUR 1 21-40 OPIC 11 06. 2001 8 2001-08-22 ABUR 1 21-40 OPIC 10 06. 2001 8 2001-08-22 ABUR 1 21-40 OPIC 14 06. 2001 8 2001-08-22 ABUR 1 21-40 OPIC 7 06. 2001 8 2001-08-22 ABUR 1 21-40 OPIC 19 06. 2001 8 2001-08-22 ABUR 1 21-40 COTT 5 06. 2001 8 2001-08-22 ABUR 2 0-20 CLIN 5 06. 2001 8 2001-08-22 ABUR 2 21-40 NF 0 06. 2001 8 2001-08-27 AHND 1 0-20 NF 0 03. Species Codes Value bounds Date Format Code definitions From: Matthew B. Jones, “Data, Metadata, and Ontology in Ecology”

Example (i) Air temperature at Lake Hoare –Approximate location in –Temporal extent in –Method in human readable form: “sample sensors every 30 seconds and send summary statistics […] to solid-state storage modules every 10 minutes” –Instruments at least provide a recognizable change: 1993-1994 - 1999-2000: Campbell Scientific 207 temp/rh probe. 1999-2000 - present: Campbell Scientific 107 temp probe.

Example (ii) Entity name: Air_Temperature_Units Units are correctly specified using range and precision.

The good and bad of EML Metadata schemas as EML: –Provide raw data, e.g. text tables with a possibility of relating them –Provide reasonable support for measurement units and instruments –Is comprehensive in the description of the context. But: –Do not reference entities and attributes formally – requires a human to identify them. –Provenance information is not linked to other systems. –Methods and procedures are in textual form, along with other information items. It is a good vehicle for sharing, but still does not support computation, contrast and repetition.

Example (iii) The dataset can be made semantically rich by adding some mappings to existing ontologies. –NASA SWEET (Semantic Web for Earth and Environmental Terminology) is a candidate. Entity name to an appropriate ontology term: SWEET: Temperature a ThermodynamicProperty for Characteristic (attribute definition in EML) SWEET: Atmosphere a PlanetaryRealm for Entity Measurement conditions refining attribute definition – requires new definitions in SWEET.

Example (iv) Make the entity concrete: –E = #Realm[#partOf #Atmosphere] [boundedBy ] Further classify the entity: E boundedBy #Lake Make the measurement concrete: –M = #Temperature[measurementCondition #Altitude 3 m] –M #measuredBy #Instrument[commercialName = “Campbell Scientific 107 temp probe”] Relation between E and M (and unit expression) already in OBOE (an observation ontology), concretely an observation that has ofEntity E and hasMeasurement M’ ofCharacteristic M.

Example (v) The entity is unambiguously expressed. Refines the incorrect use of “air temperature” (attribute measured instead of entity) Makes formal the expression of measurement conditions.

Example (vi) The mapping enables different matching for datasets. Entity matching: –Measurements for “atmosphere segments at the same latitude” –Approximate matching “at similar latitude” Measurement matching: –Measurements equivalent to M “with similar precision” – requires a detailed model of instruments. –Measurements other than M for entities like E (“atmospheric regions bounded by a lake”) –All measurements of M in the temporal scale of 1990- 2010. All the above can be expressed in triple query languages as SPARQL.

agINFRA – the linked data view (i) The above can be achieved through tools that progressively help in refining metadata into more formal representations. Sharing can be enhanced via linked data, i.e. using RDF(S) combined with terminologies/ontologies.

agINFRA and data EML LTER node FAO rep. triplification (to RDF) Bootstrapping (concept identification, automated tagging, etc.) Concept/KOS server (with mappings) Exposure (virtual data INFRA layer) … … Service registry (agINFRA RING)

Why linked data is not enough? Linked data is only a set of conventions for publishing semantically rich data on the Web. Allows expressing data in relation to ontologies But a LD endpoint does not necessarily: –Support computation beyond SPARQL queries –Support high traffic –Be reliable and robust –Be scalable –Provide services explicitly targeted for researchers Does not support full lifecycle across datasets –see Bechhofer et al. (2010) “Why Linked Data is Not Enough for Scientists”

The complete picture

Which are the requirements (for infrastructures)?

Two sources of requirements KOS maintenance and use –Storage: distributed, heterogeneous, replicated?. –Harmonization: mapping, multiple representations. –KOS retrieval: bulk, navigation using structure, free (SPARQL) –Evolution: bulk update, lazy clients. KOS-enabled processing: – Dataset management – Schema management – Retrieval: bulk, distributed query (SPARQL) – Research support: tools, instrumentation, scripts – Meta-analysis: dataset alignment, contrast – Replication: workflow

Example

Example: search Two demanding processes: –Traversal of large terminologies –Search on large and distributed metadata (triple) stores Introduces a requirement on high availability of concept (KOS) servers Scalability in RDF seacrh – using cluster algoritmhs as MapReduce? Navigation – how to support reliable links between systems? Building massive metadata repositories or implementing a distributed search protocol?

Example: repetition (i) Checking the model of decrease of temperature in Doran et al. (2002) Extend and repeat automatically with new data (same entity) Mix with observations from nearby places (different entity, same characteristic)

Example: repetition (ii) The data is semantically identified… –…but what about the objectives/methods? Following the previous example: –Hypothesis: “TemperatureSeries of E is Growing[Decreasing]” (a classification of #DynamicPropertySeries) The assertions of the hypothesis are generated outside the formal ontology language. –More general hypothesis “TemperatureSeries(?t) of AtmospherePart(?a) -> Growing(?t)” E is #partOf #Atmosphere, should be true for all the transitively related parts. –What happens with the outcome of the rule and of the computational mechanism?

Example: repetition (iii) Requirements –Define dynamic properties of measurements: growing disjointWith decreasing –Define techniques for generating the properties of the series, in this case Regression requires a model of regression methods and parameters that can be used for e.g. generating MatLab or R scripts. –Define rules with general hypothesis. –These will generate facts as Decreasing(?t) that produce an inconsistency when reasoning! …obviously this is not exhausting all the cases.

Final remarks agINFRA is aimed at developing a linked data infrastructure Linked data exposure is just the basic sharing mechanism. Requirements for infrastructure are derived from the commitment from linked data and shared semantics.

References Borgman, Christine L. (2011, submitted). The conundrum of sharing research data. Journal of the America Society for Information Science and Technology. Bechhofer, S., Ainsworth, J., Bhagat, J., Buchan, I., Couch, P., Cruickshank, D., Delderfield, M., Dunlop, I., Gamble, M., Goble, C., Michaelides, D., Missier, P., Owen, S., Newman, D., De Roure, D. and Sufi, S. (2010) Why Linked Data is Not Enough for Scientists. In: Sixth IEEE e–Science conference (e- Science 2010), December 2010, Brisbane, Australia.

Infrastructure requirements for linked e-science The requirements of the agINFRA VRC for e-infrastructures. Miguel-Angel Sicilia University of Alcalá,

Similar presentations

Presentation on theme: "Infrastructure requirements for linked e-science The requirements of the agINFRA VRC for e-infrastructures. Miguel-Angel Sicilia University of Alcalá,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Infrastructure requirements for linked e-science The requirements of the agINFRA VRC for e-infrastructures. Miguel-Angel Sicilia University of Alcalá,

Similar presentations

Presentation on theme: "Infrastructure requirements for linked e-science The requirements of the agINFRA VRC for e-infrastructures. Miguel-Angel Sicilia University of Alcalá,"— Presentation transcript:

Similar presentations

About project

Feedback