Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory.

Slides:



Advertisements
Similar presentations
Grey Literature, Institutional Repositories and the Organisational Context Simon Lambert, Brian Matthews & Catherine Jones Business & Information Technology.
Advertisements

S.J. Coles a*, M.B. Hursthouse a, R.A. Stephenson a, P. Cliff b, E. Lyon b, M. Patel b J. Downing c & P. Murray-Rust.
Data and Publication Discovery Brian Matthews, Information Management Group, STFC Rutherford Appleton Laboratory CLADDIER workshop, Chilworth, Southampton,
Towards an information model for I2S2
EBankII Workshop 1 Making Scientific Data Openly Available Simon Coles School of Chemistry, University of Southampton.
I2S2 - Infrastructure for Integration in Structural Sciences Cross-Institutional Pilot
I2S2 - Infrastructure for Integration in Structural Sciences Information Model Development Workshop RAL 11 th February 2010
ICAT + Information Model Brian Matthews Scientific Information Group E-Science Centre STFC Rutherford Appleton Laboratory
PaN-data WP7 - Integration Brian Matthews STFC-e-Science.
A multi-level metadata approach for a Public Sector Information data infrastructure Nikos Houssos 1,2, Brigitte Jörg 1,3, Brian Matthews 4 1 euroCRIS 2.
Object Re-Use and Exchange Mellon Retreat, Nassau Inn, Princeton, NJ, March Herbert Van de Sompel, Carl Lagoze The OAI Object Re-Use & Exchange.
EPrints Workshop, January eBank UK: Dissemination of research data using EPrints Simon Coles, School of Chemistry, University of Southampton.
Future Access to the Scientific and Cultural Heritage – A shared Responsibility Birte Christensen-Dalsgaard State and University Library.
"Keeping alert: issues to know today for long-term digital preservation with repositories" Neil Beagrie Fedora Users Group Open Repositories Southampton.
Science as an Open Enterprise: Open Data for Open Science Professor Brian Collins CB, FREng UCL, June 2012 Emerging conclusions from a Royal Society Policy.
A Semantic Workflow Mechanism to Realise Experimental Goals and Constraints Edoardo Pignotti, Peter Edwards, Alun Preece, Nick Gotts and Gary Polhill School.
Elements of a Data Management Plan Alison Boyer Environmental Sciences Division Oak Ridge National Laboratory.
Scientists are Sensitive too: Some Issues in Research ethics arising from Data Sharing Brian Matthews Scientific Information Group Scientific Computing.
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
EPSRC expectations on research data: What researchers need to know 12/03/2015 Masud Khokhar and Hardy Schwamm.
Beyond a Data Portal: A Collaborative Environment for the Deep Carbon Science Communities Han Wang, Yu Chen, Patrick West, John Erickson, Xiaogang Ma,
Agenda: DMWG SM policy status ESIP meeting recap Reminder - DM Webinar Series New and updated web pages on DM website Metadata Training Sessions CDI meeting.
Managing the Record of Research At the Smithsonian Using SIdora SAA Research Forum August 12, 2014.
Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 Sumit Purohit 2
Integrated e-Infrastructure for Scientific Facilities Kerstin Kleese van Dam STFC- e-Science Centre Daresbury Laboratory
Considering Open Access – Digital Preservation of arts research data: AKA Managing your “stuff” Open Repositories Conference 2015 Main Strand Dr Robin.
Dataset Citation: From Pilot to Production Mark Martin Assistant Director, Office of Scientific and Technical Information U.S. Department of Energy.
ECHO DEPository Project: Highlight on tools & emerging issues The ECHO DEPository Project is a 3-year digital preservation research and development project.
Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory.
VO Sandpit, November 2009 Environmental Data Archival: Practices and Benefits crib sheet Graham Parton With many thanks to Dr.
Context and Linking in the Research Lifecycle CERIF and other standards Catherine Jones Scientific Information Group Scientific Computing Department STFC.
E-Science for the SKA WF4Ever: Supporting Reuse and Reproducibility in Experimental Science Lourdes Verdes-Montenegro* AMIGA and Wf4Ever teams Instituto.
Joint agINFRA & SCI-BUS workshop, 30/05/2013, Budapest, Hungary FP 7-INFRASTRUCTURES programme agINFRA Joint agINFRA & SCI-BUS workshop agINFRA.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Cross-linking and Referencing Data and Publications in CLADDIER Brian Matthews, E-Science Centre, STFC Rutherford Appleton Laboratory.
The Faster Research Cycle Interoperability for better science Brian Matthews, Leader, Information Management Group, E-Science Centre, STFC Rutherford Appleton.
Jamie Hall (ILL). SciencePAD Persistent Identifiers Workshop PANData Software Catalogue January 30th 2013 Jamie Hall Developer IT Services, Institut Laue-Langevin.
U.S. Department of the Interior U.S. Geological Survey CDI Webinar Series 2013 Data Management at the National Climate Change and Wildlife Science Center.
UKOLN is supported by: Digital Preservation Benefits Tools Project Dissemination Workshop Dr Liz Lyon, Associate Director, UK Digital Curation Centre Director,
VIVO and Scholarly Repositories: Synergistic Opportunities.
DataONE: Preserving Data and Enabling Data-Intensive Biological and Environmental Research Bob Cook Environmental Sciences Division Oak Ridge National.
Technical Update 2008 Sandy Payette, Executive Director Eddie Shin, Senior Developer April 3, 2008 Open Repositories 2008, Fedora User Group.
Metadata for structural science Workshop on research metadata in context Nijmegen, 7–8 September 2010 Simon Lambert STFC e-Science UK.
TopCAT Use Cases Priorities User Interface 1 ICAT developer workshop, August 2009 Laurent Lerusse – STFC
Data in Context Co-chairs: Brigitte Jörg, Keith Jeffery RDA 3rd Plenary, March, 26th - 28th, 2014 Dublin.
Dataset citation Clickable link to Dataset in the archive Sarah Callaghan (NCAS-BADC) and the NERC Data Citation and Publication team
PaNdata ODI Open Data Infrastructure INFRA : Data infrastructures for e-Science PaNdata-ODI will develop, deploy and operate an Open Data Infrastructure.
DOE Data Management Plan Requirements
CombeDay Making Data Openly Available Simon Coles.
11 Researcher practice in data management Margaret Henty.
Linking Embargoed Datasets: A Plan for Improving How Research Data Can Be Shared, Linked and Tracked Arlington, VA, November 19, 2015 Anita de Waard VP.
Experimental Context, Publishing and Research Objects Brian Matthews STFC.
Data Citation Implementation Pilot Workshop
ICAT Status Alistair Mills Project Manager Scientific Computing Department.
Research Data Management 26 th April 2016 Federica Fina, Data Scientist, University of St Andrews Library.
Usecases: 1.ISIS Neutron Source 2.DP for HEP Matthew Viljoen STFC, UK APARSEN-EGI workshop: preserving big data for research Amsterdam Science Park 4-6.
The Earth System Curator Metadata Infrastructure for Climate Modeling Rocky Dunlap Georgia Tech.
Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.
Enhancements to Galaxy for delivering on NIH Commons
NRF Open Access Statement
An Approach to Software Preservation
Joslynn Lee – Data Science Educator
Open Science Approaches to Modelling & Simulation
Document, Index, Discover, Access
Persistent Identifiers Implementation in EOSDIS
Research Data Context Preservation in SCAPE
Joseph JaJa, Mike Smorul, and Sangchul Song
Publishing software and data
Final review 24th Nov 2014 Brussels
Brian Matthews STFC EOSCpilot Brian Matthews STFC
Presentation transcript:

Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Scientific computing develop and operate computing infrastructure - HPC, PB Datastore, s/w, data management… Funds and operates large scale science for UK Research base - physics, astronomy - chemistry, materials ESO: Alma Array STFC

Major Science Facilities Big Science Particle Physics - exploring the very small Space Science - exploring the very large Small Science Understanding the world around us at a molecular level Lasers, Neutron & Light Source – ISIS & Diamond

Facilities Support Big Facilities for Small Science Diamond ISIS CLF

Science at STFC Facilities data Computing Analysis Modelling knowledge beam sampleImaging detector Neutrons and photons Provide complementary views of matter: Photons “see” electric charge – high atomic number nuclei Neutrons “see” nucleons – especially hydrogen atoms

The science we do - Structure of materials Fitting experimental data to model Bioactive glass for bone growth Structure of cholesterol in crude oil Hydrogen storage for zero emission vehicles Magnetic moments in electronic storage ~30,000 user visitors each year in Europe: –physics, chemistry, biology, medicine, –energy, environmental, materials, culture –pharmaceuticals, petrochemicals, microelectronics Longitudinal strain in aircraft wing Diffraction pattern from sample Visit facility on research campus Place sample in beam Billions of € of investment –c. £400M for DLS –+ running costs Over high impact publications per year in Europe –But so far no integrated data repositories –Lacking sustainability & traceability

Similar architecture use for DLS Scaling is a constant concern Data rates keep increasing 70TB per month and rising Tailored ICAT Reengineered StorageD

Proposals Once awarded beamtime at ISIS, an entry will be created in ICAT that describes your proposed experiment. Experiment Data collected from your experiment will be indexed by ICAT (with additional experimental conditions) and made available to your experimental team Analysed Data You will have the capability to upload any desired analysed data and associate it with your experiments. Publication Using ICAT you will also be able to associate publications to your experiment and even reference data from your publications. B-lactoglobulin protein interfacial structure Example ISIS Proposal GEM – High intensity, high resolution neutron diffractometer H2-(zeolite) vibrational frequencies vs polarising potential of cations Central Facility Secure access to user’s data Flexible data searching Scalable and extensible architecture Integration with analysis tools Access to high- performance resources Linking to other scientific outputs Data policy aware

Investigation PublicationKeywordTopic Sample Sample Parameter Dataset Dataset Parameter Datafile Datafile Parameter Investigator Related Datafile Parameter Authorisation Core Scientific Metadata Model (CSMD) The Core Metadata model forms the information model for ICAT. Designed to describe facilities based experiments in Structural Science.

TopCat

DOI’s for Data Publication

Is this enough? What we have so far is good for: –us to manage data –users to access their own data –citation of raw data But –Traceability and Validation? –Reuse of the data? Need to make context more explicit –Focussing on the dataset is the wrong subject of discourse

Support the wider Facilities Lifecycle Proposal Approval Scheduling Experiment Data storage Record Publication Scientist submits application for beamtime Facility committee approves application Facility registers, trains, and schedules scientist’s visit Scientists visits, facility run’s experiment Subsequent publication registered with facility Raw data filtered, and stored Data analysis Tools for processing made available As in PanData-ODI – D6.1 (which has much more detail)

Publishing Investigations So what we want is a record of EXPERIMENTS not data. Thus want the record of the context –The experimental intention and actors –The instruments and configurations used –The sample –The environmental parameters and context –The Raw Data Thus we want to publish a record of the whole INVESTIGATION –Can get most of this this from what we have The Investigation becomes a “first class” research object –Published –Identified and treated as a single entity –Cited and credited –Record of the output of the facility Analogous to a Journal Article –Investigation as the unit of discourse for scientific facilities. But also as an access point for validation and reuse –Because we have a record of what actually happened.

Our DataCite entries are in fact Investigations (red is for “data” notion, and green is for “investigation”)

“DataCite abuse” As we have seen, we use DataCite for Investigations, with Datasets only referred from them. Other data curators sometimes use DataCite for Publications (“documents”) that contain data: So “data” DOIs tend to resolve either into Investigations or Publications Extend the Resource Type Also may not want to have a landing page for all DOIs

Research Objects Represent the “investigation” as a Research Object –Research Objects (ROs) are semantically rich aggregations of resources that bring together data, methods and people in scientific investigations. Their goal is to create a class of artifacts that can encapsulate our digital knowledge and provide a mechanism for sharing and discovering assets of reusable research and scientific knowledge and elsewhere (WorkFlow4Ever) Represent Investigation as a Research Object –Build a graph structure for the links in the research object. –Using an RDF representation, URIs –Publish as a linked data object Bechhofer, et. al. Why Linked Data is Not Enough for Scientists, Proceedings of the 10th IEEE e-Science Conference, Brisbane, Australia (2010) Arif Shaon, Sarah Callaghan, Bryan Lawrence, Brian Matthews. Opening up Climate Research: a linked data approach to publishing data provenance 7 th Int Digital Curation Conference (2011).

RDF representation of CSMD model Investigation An investigation or experiment Facility An experimental facility Dataset A collection of data files and part of an investigation Datafile A data file

After proposal: Initialise the Research Object Investigation #n DOI:STFC.xxx.n :instrument :investigator :n a csmd:Investigation ; csmd:investigation_doi doi:stfc.xxx.n csmd:investigation_investigationUser :iu1 ; csmd:investigation_instrument :inst1. :iu1 a csmd:investigationUser ; csmd:investigationUser_user :u1. :u1 a csmd:User. :inst1 a csmd:Instrument.

After the experiment Experimental Data Metadata Investigation #n DOI:STFC.xxx.n :dataset :instrument :investigator Own metadata format (CSMD) More or less what ICAT currently supports Adds extra details on parameters, datasets, formats etc. :sample Data Storage

Linking Publication into Investigation Raw Data Repository Publication Repository :dataset :publication :investigator cito:cites Investigation #n DOI:STFC.xxx.n :instrument :sample Publication Store

Raw Data Repository Derived Data Repository Publication Repository :dataset :publication :investigator Investigation #n DOI:STFC.xxx.n :instrument :sample Note that derived data could be on a different site :relatedDataset Linking the derived data into the Investigation

Linking the software into the Investigation :dataset :relatedDataset :publication :investigator W3C Prov ontology Assume that the software is in a repository Software Package 1 cito:cites :inputDataset :outputDataset :application Software Repository Investigation #n DOI:STFC.xxx.n :instrument :sample

Generate Landing page from RO

Setting the Boundary: It depends on your Point of View Investigations Extended Publication E-Portfolio

Setting a boundary : OAI-ORE

Preserving Investigations Now becomes preserving the research object. –Preserving a linked data graph –Persistency of identifiers –Managing integrity of external artefacts. –Link checking –Copying and mirrorign – checking consistency Representation Information to give more context on the objects –And on the aggregate as a whole PDI (Provenance, Integrity etc) on the whole aggregate object –As well as components

Adding Preservation Information – Rep Info for various items :dataset :relatedDataset :publication :investigator Would probably be more Work into a RepInfo Repository Would also have a RepInfo Network :application Investigation #n DOI:STFC.xxx.n :instrument :sample Instrument description (website) Raw data format description (e.g. NeXus) Parameter description (e.g. NXDL, Con Vocab) Software classification Software description Sample description Analysed data format description Publication format description

Adding Preservation Information – Rep Info for the whole aggregate :dataset :relatedDataset :publication :investigator :application Investigation #n DOI:STFC.xxx.n :instrument :sample Software classification CSMD Vocabulary description

Summary Investigation appropriate unit of discourse for facilities science –Publishable, Citable, Reportable –Can be used as a vehicle for validation and reuse Basic principles of building research objects for facilities science –Follow research lifecycle –Consider Investigation a RO “seed” –Apply Linked Data principles –Re-use existing vocabularies and ontologies –Share ROs via recognizable data formats and APIs Applicable beyond Facilities –Other analogous objects: –“experiments”, “observations”, “studies” The subject of preservation –How do we maintain the integrity of Investigation objects?

Thank You Questions?