Requirements Gathering (work in progress) Manjula Patel, UKOLN & DCC I2S2 Models Workshop 11 th February 2010 STFC, RAL, Didcot

Slides:



Advertisements
Similar presentations
Data Publishing Service Indiana University Stacy Kowalczyk April 9, 2010.
Advertisements

DRIVER Building a worldwide scientific data repository infrastructure in support of scholarly communication 1 JISC/CNI Conference, Belfast, July.
DRIVER Long Term Preservation for Enhanced Publications in the DRIVER Infrastructure 1 WePreserve Workshop, October 2008 Dale Peters, Scientific Technical.
The benefits of more effective research data management in UK universities Neil Beagrie Charles Beagrie Limited MRD Programme Conference Birmingham March.
A centre of expertise in digital information management IMS Digital Repositories Interoperability Andy Powell UKOLN,
Philip LordDigital Archiving Consultancy Alison Macdonald Digital Archiving Consultancy Liz LyonDigital Curation Centre David GiarettaDigital Curation.
A centre of expertise in data curation and preservation DCC/NeSC eScience Workshop, June 2008 Working in partnership with the eScience community This work.
S.J. Coles a*, M.B. Hursthouse a, R.A. Stephenson a, P. Cliff b, E. Lyon b, M. Patel b J. Downing c & P. Murray-Rust.
© S.J. Coles 2006 Usability WS, NeSC Jan 06 Enabling the reusability of scientific data: Experiences with designing an open access infrastructure for sharing.
© S.J. Coles 2006 Digital Repositories as a Mechanism for the Capture, Management and Dissemination of Chemical Data Simon Coles School of Chemistry, University.
Data and Publication Discovery Brian Matthews, Information Management Group, STFC Rutherford Appleton Laboratory CLADDIER workshop, Chilworth, Southampton,
S.J. Coles a*, J.G. Frey a, M.B. Hursthouse a, L. Carr b & C.J. Gutteridge b. a School of Chemistry, University of Southampton, UK.; b School of Electronics.
© S.J. Coles 2006 Digital Repositories as a Mechanism for the Capture, Management and Dissemination of Chemical Data Simon Coles School of Chemistry, University.
RCUK, Octiber Archiving research data and research publications. Dr Leslie Carr, Intelligence, Agents Multimedia, University of Southampton Dr Simon.
Supporting education and research Repositories in Context Digital repositories as components of an integrated infrastructure for education Leona Carpenter.
Collection-level description & collection management: tool for the trade or information trade-off? Collection Description Focus Workshop 4 Newcastle, 8.
The PREMIS Data Dictionary Michael Day Digital Curation Centre UKOLN, University of Bath JORUM, JISC and DCC.
A centre of expertise in digital information management UKOLN is supported by: Curating the Scientific Record: The Challenges Ahead Dr.
Towards an information model for I2S2
Joint Information Systems Committee Digital Library Services BL/JISC Workshop Rachel Bruce JISC Programme Director The Digital Library and its Services,
A centre of expertise in digital information management UKOLN is supported by: British Academy e-Resources Policy Review: UKOLN Report.
Federation The eCrystals Federation Dr Simon Coles, University of Southampton, UK Dr Liz Lyon, UKOLN, University of Bath, UK Open Repositories 2008, University.
A centre of expertise in digital information management UKOLN is supported by: UK Perspectives on the Curation and Preservation of Scientific.
UK Digital Curation Centre : enabling research data management at the coalface Dr Liz Lyon Associate Director DCC / Director UKOLN University of Bath,
Federation eCrystals Federation: Open Repositories for Data-driven Science Dr Liz Lyon, UKOLN, University of Bath, UK Dr Simon Coles, University of Southampton,
A centre of expertise in digital information management UKOLN is supported by: Memory institutions and the social fabric of the Web Dr.
UKOLN is supported by: JISC Information Environment update Repositories and Preservation Programme meeting, October 24-25, 2006 Rachel Heery UKOLN
The Australian National Data Service Ross Wilkinson Feb 26 th
© S.J. Coles 2006 Institutional Data Repositories for Chemistry Simon Coles School of Chemistry, University of Southampton, U.K.
EBankII Workshop 1 Making Scientific Data Openly Available Simon Coles School of Chemistry, University of Southampton.
I2S2 - Infrastructure for Integration in Structural Sciences Cross-Institutional Pilot
I2S2 - Infrastructure for Integration in Structural Sciences Information Model Development Workshop RAL 11 th February 2010
Digital Repositories: interoperability & common services Closing Remarks Dr Liz Lyon, UKOLN, University of Bath, UK
Collection-level description & the Information Landscape: users evaluate strategies for resource discovery Collection Description Focus Workshop 5 Cambridge,
A centre of expertise in data curation and preservation DigCCur2007 Symposium, Chapel Hill, N.C., April 18-20, 2007 Co-operation for digital preservation.
APA CONFERENCE, FRASCATI 6 November 2012 Data management planning at the DCC Martin Donnelly Digital Curation Centre University of Edinburgh.
ICAT + Information Model Brian Matthews Scientific Information Group E-Science Centre STFC Rutherford Appleton Laboratory
Collection description & Collection Description Focus JISC/DNER Moving Image & Sound Cluster Steering Group meeting, HEFCE Office, London, 24 September.
An introduction to collections and collection-level description Collection-Level Description & NOF-digitise projects NOF-digitise programme seminar, London,
PaN-data WP7 - Integration Brian Matthews STFC-e-Science.
The UK Research Data Service Project Jean Sykes Librarian and Director of IT Services London School of Economics.
University of Southampton, U.K.
EPrints Workshop, January eBank UK: Dissemination of research data using EPrints Simon Coles, School of Chemistry, University of Southampton.
New DFG Information Infrastructure Projects Dr. Stefan Winkler-Nees; Birmingham, 28. March 2011 New DFG Information Infrastructure Projects.
Co-funded by the European Union under FP7-ICT Alliance Permanent Access to the Records of Science in Europe Network Co-ordinated by aparsen.eu #APARSEN.
Saturday 1 SN4CI. November 2005SNAC2 Words (used across 3 or more groups) Defined: community, scope Identifying: developers, early adopters, mechanism.
Elements of a Data Management Plan Bill Michener University Libraries University of New Mexico Data Management Practices for.
Caring and Sharing Collaboration in Digital Curation outside North America Ross Harvey Simmons College, Boston Curation Matters: 17 June 2010.
Towards a European network for digital preservation Ideas for a proposal Mariella Guercio, University of Urbino.
EBank UK: linking scientific data, scholarly communication and learning Michael Day and Rachel Heery UKOLN, University of Bath
Manjula Patel Scaling-up to Integrated Research Data Management Workshop 6 th International Digital Curation Conference Holiday Inn, Mart Plaza Chicago,
Context and Linking in the Research Lifecycle CERIF and other standards Catherine Jones Scientific Information Group Scientific Computing Department STFC.
Life Cycle Models & Principles Jake Carlson Associate Professor of Library Science Data Services Specialist Purdue University Libraries.
UKOLN is supported by: Digital Preservation Benefits Tools Project Dissemination Workshop Dr Liz Lyon, Associate Director, UK Digital Curation Centre Director,
Metadata for structural science Workshop on research metadata in context Nijmegen, 7–8 September 2010 Simon Lambert STFC e-Science UK.
UKOLN is supported by: Introduction to UKOLN Dr Liz Lyon, Director UKOLN, University of Bath, UK Grand Challenge Meeting, June a centre.
Preservation metadata and the Cedars project Michael Day UKOLN: UK Office for Library and Information Networking University of Bath
CombeDay Making Data Openly Available Simon Coles.
11 Researcher practice in data management Margaret Henty.
Infrastructure Breakout What capacities should we build now to manage data and migrate it over the future generations of technologies, standards, formats,
Open Science (publishing) as-a-Service Paolo Manghi (OpenAIRE infrastructure) Institute of Information Science and Technologies Italian Research Council.
Research Data Management 26 th April 2016 Federica Fina, Data Scientist, University of St Andrews Library.
CESSDA SaW Training on Trust, Identifying Demand & Networking
GISELA & CHAIN Workshop Digital Cultural Heritage Network
Research Data Context Preservation in SCAPE
eCrystals Federation: Open Repositories for global Open Science
Metadata for digital long-term preservation
GISELA & CHAIN Workshop Digital Cultural Heritage Network
Bird of Feather Session
eCrystals Federation: Open Repositories for global Open Science
Presentation transcript:

Requirements Gathering (work in progress) Manjula Patel, UKOLN & DCC I2S2 Models Workshop 11 th February 2010 STFC, RAL, Didcot

Outline D1.2 Requirements Report –Requirements Analysis Desk study Data Management Planning Tools –Immersive Studies –Gap Analysis

Objectives Identify requirements for a data-driven research infrastructure –Understand localised data management practices –Understand data management infrastructure in large centralised facilities Examine 3 complementary infrastructure axes: Scale and complexity: small lab equipment; institutional Installations; large scale facilities e.g. DLS & ISIS, STFC Inter-discipline: research across domain boundaries Data lifecycle: data flows and data transformations

Desk Study The Data Imperative, Managing the UKs research data for future use, UKRDS The UK Research Data Feasibility Study, Report and Recommendations to HEFCE, UKRDS,19th Dec 2008 Dealing with Data: Roles, Rights, Responsibilities and Relationships, Consultancy Report to JISC, Liz Lyon, 19th June 2007 Open Science at Web-Scale: Optimising Participation and Predictive Potential, Consultative Report to JISC and DCC, Liz Lyon, 6th November 2009 Advocacy to benefit from changes: Scholarly communications discipline-based advocacy, Final report prepared for JISC Scholarly Communications Group by Lara Burns, Nicki Dennis, Deborah Kahn and Bill Town, Publishing Directions, 9 th April 2009 Stewardship of digital research data: a framework of principles and guidelines, Responsibilities of research institutions and funders, data managers, learned societies and publishers, RIN, Jan 2008 To Share or not to Share: Publication and Quality Assurance of Research Data Outputs, Report commissioned by the Research Information Network (RIN), June 2008

Desk Study Patterns of information use and exchange: case studies of researchers in the life sciences. A report by the Research Information Network and the British Library, November 2009 Infrastructure Planning and Curation, A Comparative Study of International approaches to enabling the sharing of Research Data, Prepared by Raivo Ruusalepp for the JISC and DCC, 30th November 2008 Data Dimensions: Disciplinary Differences in Research Data Sharing, Reuse and Long term Viability. A comparative review based on sixteen case studies, A report commissioned by the DCC and SCARP Project, Key Perspectives Ltd, 18th January 2010 Harnessing the Power of Digital Data for Science and Society, Report of the Interagency Working Group on Digital Data to the Committee on Science of the National Science and Technology Council, Jan 2009 ParseInsight (Insight into digital preservation of research output in Europe), Survey Report, 9th Dec 2009 Chemistry for the Next Decade and Beyond, International Perceptions of the UK Chemistry Research Base, International Review of UK Chemistry Research, April 2009, EPSRC

Desk Study: summary Research teams capture, manage, discuss and disseminate their data in relative isolation with highly fragmented data infrastructures and poorly integrated software applications Conventional systems of publication lead to insufficient information relating to provenance of results and irreproducible experiments The processes for recognition lead to a lack of inclination and incentive to share or make all the supporting information for a study publicly available A low awareness of data curation and preservation issues leads to data loss and reduced productivity

Laboratories & Large Scale Facilities University of Cambridge (Earth Sciences) University of Cambridge (Chemistry) EPSRC National Crystallography Service DLS & ISIS, STFC

Mini Immersive Studies Visit 17th Nov 2009 Visit 24th Nov 2009 Visit 7 th & 14 th Dec 2009 (excluding ISIS User Office) Visit 15th Jan 2010 (including DLS User Office) Visit 4 th Mar 2010 (pending)

An Idealised Data Lifecycle Model Effective validation, reuse and repurposing of data requires –trust and a thorough understanding of the data –contextual information detailing how the data was generated, processed, analysed and managed Research Data includes (all information relating to an experiment): –raw, reduced, derived and results data (processed? intermediate?) –research and experiment proposals –results of the peer-review process –laboratory notebooks –equipment configuration and calibration data –wikis and blogs –metadata (context, provenance etc.) –documentation for interpretation and understanding (semantics) –administrative and safety data –processing software and control parameters

Process & Analyse Derived Data Research Concept or Experiment Design Write Research Proposal Peer-review of Proposal Conduct Experiment Generate, Create, & Collect Raw Data Check & Clean Raw Data Interpret & Analyse Results Data Storage, Archive, Preservation & Curation Appraisal, Documentation (Metadata, Context etc.) & Quality Control IPR, Embargo & Access Control Discover, Validate, Reuse & Repurpose Data Publish Research Outputs An Idealised Scientific Research Data Lifecycle Model

Profile: Earth Sciences, Cambridge Lone researcher scenario Experiment and data collection conducted at ISIS (GEM) using neutron beams Little or no shared infrastructure –Data sharing with colleagues via , ftp, memory stick etc. –Data received from ISIS is currently stored on laptops or WebDAV server Management of intermediate, derived and results data a major issue –Data managed by individual researcher on own laptop –No departmental or central institutional facility

Earth Sciences: typical workflow Martin Dove & Erica Yang

Earth Sciences: issues Processing pipeline is dependent on a suite of software –instrument specific (GUDRUN, Arial) –closed (GSAS) –open source (data2config) –written in-house (RMCProfile) Sustainability issues with regard to software tools and utilities Contextual information is not routinely captured Main analysis is reliant on scientists knowledge and experience in selecting parameters and interpreting data –much of which is not recorded or captured other than in a lab note book The actual workflow or processing pipeline is not recorded –Much of the visualisation is done within MS Excel spreadsheets Raw and reduced data are stored at ISIS All other data are managed and maintained by the individual scientist on his/her laptop

Earth Sciences: early observations… Data management needs largely so that a researcher (or another team member) can return to and validate results in the future Need department or research group level data storage and management infrastructure to capture, manage and maintain: –Metadata and contextual information (including provenance); –Control files and parameters; –Processing software; –Workflow for a particular analysis; –Derived and results data; –Links between all the datasets relating to a specific experiment or analysis Any changes should be embedded into scientists workflow and be non- intrusive

Profile: EPSRC NCS, Southampton Service provision function –Local x-ray diffraction instruments + use of DLS (beamline I19) –Full structure determination or data collection only Operates nationally across institutions Moderate infrastructure Raw data generated in-house is stored at ATLAS Data Store (STFC) Local institutional repository (eCrystals) for intermediate, derived and results data –Metadata application profile –Public and private parts (embargo system) –Digital Object Identifier, InChi Manages experiment proposals and instrument time allocation Experiments conducted and data collected by NCS scientists either in- house or at DLS (I19) Acts as interface between end-researcher and DLS (I19)

GETDATAXPREPSHELXSSHELXLENCIFERCHECKCIFBABELCML & INCHI RAWDERIVED DATARESULTS DATA.htm.hkl _0kl.jpg _h0l.jpg _hk0.jpg _crystal.jpg.prp _xs.lst _xl.lst.res.cif _checkcif.htm.mol.cml _inchi.cml Initialisation: mount new sample Collection: collect data Processing: process and correct images Solution: solve structures Refinement: refine structure CIF: produce Crystallographic Information File Validation: chemical & crystallographic checks Report: generate Crystal Structure Report CML, INChI EPSRC NCS: typical workflow

EPSRC NCS: issues Funding stream critical to service function Processing pipeline is dependent on a suite of software –instrument specific (clean and reduce raw data) –written in-house (archive, upload scripts –for transferring raw and reduced data to ATLAS; supergui) –open source (SHELX suite for data work-up) Raw data storage at ATLAS is very basic (no metadata) Sustainability issues with regard to software tools and utilities Contextual information is not routinely captured Main analysis is reliant on scientists knowledge and experience in selecting parameters and interpreting data –much of which is not recorded or captured other than in a lab note book The actual workflow or processing pipeline is not recorded Considerable amount of paper-based scheduling and record keeping Data needs to be in a form capable of being transferred to and understood by end-researcher

EPSRC NCS: early observations… Service function implies an obligation to: –Retain experiment data –Maintain administrative and safety data –Transfer data to end-researcher eCrystals repository, appears to be working well –Metadata application profile may need to be reviewed –Some issues with the underlying repository software? Labour-intensive paper-based administration and records-keeping –Paper-based system for scheduling experiments –Paper copies of Experiment Risk Assessment (ERA) get annotated by scientist and photocopied several times –Several identifiers per sample (researcher assigned; researcher institution assigned, NCS assigned) Administrative functions require streamlining between NCS and DLS –e.g. standardisation of ERA forms

Profile: Chemistry, Cambridge …site visit pending; 4 th March 2010

Profile: DLS & ISIS, STFC Operate on behalf of multiple institutions and communities Scientific (peer) and technical review of proposals for beam time allocation User offices manage administrative and safety information Several FTEs per beamline Visiting scientists need to undergo safety training and test every 6 months Large infrastructure, engineered to manage raw data –ICAT implementation of Core Scientific Metadata Model (CSMD) Derived data taken off site on laptops, removable drives etc. Results data independently worked up by individual researchers

DLS & ISIS: early observations… Service function implies an obligation to retain raw data Need to work across organisational boundaries (integrated approach) No storage or management of derived and results data (IPR and ownership issues?) CSMD and its implementation in ICAT, need to be extended –For additional info e.g. costs; preservation –For use beyond STFC Experiment/Sample identifiers based on beam line number

Gap Analysis … in progress based on idealised scientific research data lifecycle and case studies – NCS & DLS – Earth Sciences & ISIS – Cambridge Chemistry – DLS & ISIS

Data Management Planning Tools Largely aimed at an institutional context –Data Audit/Asset Framework (DAF) –Data Management Plan (DMP) Checklist –Assessing Institutional Data Assets (AIDA) Toolkit –Digital Repository Audit Method Based On Risk Assessment (DRAMBORA) –Life Cycle Information for e-Literature (LIFE) –Keeping Research Data Safe Surveys (KRDS 1 & 2) Most are heavy-weight and institution-facing –Not enough resources to implement fully and formally –Draw on particular aspects of tools –May be able to draw on results of institution facing projects (IDMB, INCREMENTAL) –DMP checklist, DAF and KRDS in project proposal and plan –DMP checklist results due 22 nd Feb 2010

Early Requirements… Basic requirement for data storage and backup facilities –Research group, department or institution level Adequate metadata and contextual information to support: –Maintenance and management –Linking together all data associated with an experiment –Referencing and citation –Authenticity –Integrity –Provenance –Discovery, search and retrieval –Preservation and curation –IPR, embargo and access management –Interoperability and data exchange

Early Requirements… Relevant Technologies –Persistent Identifiers (URIs, DOIs etc.) –Metadata schema (PREMIS, XML, CML, RDF?) –Controlled vocabularies (ontologies?) –Integrated information model (structured, linked data?) –Extensions to CSMD & ICAT –Interoperability and exchange (OAI-PMH, file formats) –Data packaging (OAI-ORE) –OAIS Representation Information? Cultural Issues –Best practice guidelines –Use of Standards –Advocacy –Training

Summary Considerable variation in requirements between individual research scientists and service facilities At present individual researcher, group, department, institution, facilities all working within their own frameworks There is merit in adopting an integrated approach which caters for all scales of science: –Efficient exchange and reuse of data across disciplinary boundaries –Aggregation and/or cross-searching of related datasets –Data mining to identify patterns or trends

Questions & Discussion