Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory
Scientific computing develop and operate computing infrastructure - HPC, PB Datastore, s/w, data management… Funds and operates large scale science for UK Research base - physics, astronomy - chemistry, materials ESO: Alma Array STFC
Major Science Facilities Big Science Particle Physics - exploring the very small Space Science - exploring the very large Small Science Understanding the world around us at a molecular level Lasers, Neutron & Light Source – ISIS & Diamond
Facilities Support Big Facilities for Small Science Diamond ISIS CLF
Science at STFC Facilities data Computing Analysis Modelling knowledge beam sampleImaging detector Neutrons and photons Provide complementary views of matter: Photons “see” electric charge – high atomic number nuclei Neutrons “see” nucleons – especially hydrogen atoms
The science we do - Structure of materials Fitting experimental data to model Bioactive glass for bone growth Structure of cholesterol in crude oil Hydrogen storage for zero emission vehicles Magnetic moments in electronic storage ~30,000 user visitors each year in Europe: –physics, chemistry, biology, medicine, –energy, environmental, materials, culture –pharmaceuticals, petrochemicals, microelectronics Longitudinal strain in aircraft wing Diffraction pattern from sample Visit facility on research campus Place sample in beam Billions of € of investment –c. £400M for DLS –+ running costs Over high impact publications per year in Europe –But so far no integrated data repositories –Lacking sustainability & traceability
Similar architecture use for DLS Scaling is a constant concern Data rates keep increasing 70TB per month and rising Tailored ICAT Reengineered StorageD
Proposals Once awarded beamtime at ISIS, an entry will be created in ICAT that describes your proposed experiment. Experiment Data collected from your experiment will be indexed by ICAT (with additional experimental conditions) and made available to your experimental team Analysed Data You will have the capability to upload any desired analysed data and associate it with your experiments. Publication Using ICAT you will also be able to associate publications to your experiment and even reference data from your publications. B-lactoglobulin protein interfacial structure Example ISIS Proposal GEM – High intensity, high resolution neutron diffractometer H2-(zeolite) vibrational frequencies vs polarising potential of cations Central Facility Secure access to user’s data Flexible data searching Scalable and extensible architecture Integration with analysis tools Access to high- performance resources Linking to other scientific outputs Data policy aware
Investigation PublicationKeywordTopic Sample Sample Parameter Dataset Dataset Parameter Datafile Datafile Parameter Investigator Related Datafile Parameter Authorisation Core Scientific Metadata Model (CSMD) The Core Metadata model forms the information model for ICAT. Designed to describe facilities based experiments in Structural Science.
TopCat
DOI’s for Data Publication
Is this enough? What we have so far is good for: –us to manage data –users to access their own data –citation of raw data But –Traceability and Validation? –Reuse of the data? Need to make context more explicit –Focussing on the dataset is the wrong subject of discourse
Support the wider Facilities Lifecycle Proposal Approval Scheduling Experiment Data storage Record Publication Scientist submits application for beamtime Facility committee approves application Facility registers, trains, and schedules scientist’s visit Scientists visits, facility run’s experiment Subsequent publication registered with facility Raw data filtered, and stored Data analysis Tools for processing made available As in PanData-ODI – D6.1 (which has much more detail)
Publishing Investigations So what we want is a record of EXPERIMENTS not data. Thus want the record of the context –The experimental intention and actors –The instruments and configurations used –The sample –The environmental parameters and context –The Raw Data Thus we want to publish a record of the whole INVESTIGATION –Can get most of this this from what we have The Investigation becomes a “first class” research object –Published –Identified and treated as a single entity –Cited and credited –Record of the output of the facility Analogous to a Journal Article –Investigation as the unit of discourse for scientific facilities. But also as an access point for validation and reuse –Because we have a record of what actually happened.
Our DataCite entries are in fact Investigations (red is for “data” notion, and green is for “investigation”)
“DataCite abuse” As we have seen, we use DataCite for Investigations, with Datasets only referred from them. Other data curators sometimes use DataCite for Publications (“documents”) that contain data: So “data” DOIs tend to resolve either into Investigations or Publications Extend the Resource Type Also may not want to have a landing page for all DOIs
Research Objects Represent the “investigation” as a Research Object –Research Objects (ROs) are semantically rich aggregations of resources that bring together data, methods and people in scientific investigations. Their goal is to create a class of artifacts that can encapsulate our digital knowledge and provide a mechanism for sharing and discovering assets of reusable research and scientific knowledge and elsewhere (WorkFlow4Ever) Represent Investigation as a Research Object –Build a graph structure for the links in the research object. –Using an RDF representation, URIs –Publish as a linked data object Bechhofer, et. al. Why Linked Data is Not Enough for Scientists, Proceedings of the 10th IEEE e-Science Conference, Brisbane, Australia (2010) Arif Shaon, Sarah Callaghan, Bryan Lawrence, Brian Matthews. Opening up Climate Research: a linked data approach to publishing data provenance 7 th Int Digital Curation Conference (2011).
RDF representation of CSMD model Investigation An investigation or experiment Facility An experimental facility Dataset A collection of data files and part of an investigation Datafile A data file
After proposal: Initialise the Research Object Investigation #n DOI:STFC.xxx.n :instrument :investigator :n a csmd:Investigation ; csmd:investigation_doi doi:stfc.xxx.n csmd:investigation_investigationUser :iu1 ; csmd:investigation_instrument :inst1. :iu1 a csmd:investigationUser ; csmd:investigationUser_user :u1. :u1 a csmd:User. :inst1 a csmd:Instrument.
After the experiment Experimental Data Metadata Investigation #n DOI:STFC.xxx.n :dataset :instrument :investigator Own metadata format (CSMD) More or less what ICAT currently supports Adds extra details on parameters, datasets, formats etc. :sample Data Storage
Linking Publication into Investigation Raw Data Repository Publication Repository :dataset :publication :investigator cito:cites Investigation #n DOI:STFC.xxx.n :instrument :sample Publication Store
Raw Data Repository Derived Data Repository Publication Repository :dataset :publication :investigator Investigation #n DOI:STFC.xxx.n :instrument :sample Note that derived data could be on a different site :relatedDataset Linking the derived data into the Investigation
Linking the software into the Investigation :dataset :relatedDataset :publication :investigator W3C Prov ontology Assume that the software is in a repository Software Package 1 cito:cites :inputDataset :outputDataset :application Software Repository Investigation #n DOI:STFC.xxx.n :instrument :sample
Generate Landing page from RO
Setting the Boundary: It depends on your Point of View Investigations Extended Publication E-Portfolio
Setting a boundary : OAI-ORE
Preserving Investigations Now becomes preserving the research object. –Preserving a linked data graph –Persistency of identifiers –Managing integrity of external artefacts. –Link checking –Copying and mirrorign – checking consistency Representation Information to give more context on the objects –And on the aggregate as a whole PDI (Provenance, Integrity etc) on the whole aggregate object –As well as components
Adding Preservation Information – Rep Info for various items :dataset :relatedDataset :publication :investigator Would probably be more Work into a RepInfo Repository Would also have a RepInfo Network :application Investigation #n DOI:STFC.xxx.n :instrument :sample Instrument description (website) Raw data format description (e.g. NeXus) Parameter description (e.g. NXDL, Con Vocab) Software classification Software description Sample description Analysed data format description Publication format description
Adding Preservation Information – Rep Info for the whole aggregate :dataset :relatedDataset :publication :investigator :application Investigation #n DOI:STFC.xxx.n :instrument :sample Software classification CSMD Vocabulary description
Summary Investigation appropriate unit of discourse for facilities science –Publishable, Citable, Reportable –Can be used as a vehicle for validation and reuse Basic principles of building research objects for facilities science –Follow research lifecycle –Consider Investigation a RO “seed” –Apply Linked Data principles –Re-use existing vocabularies and ontologies –Share ROs via recognizable data formats and APIs Applicable beyond Facilities –Other analogous objects: –“experiments”, “observations”, “studies” The subject of preservation –How do we maintain the integrity of Investigation objects?
Thank You Questions?