Federation eCrystals Federation: Open Repositories for Data-driven Science Dr Liz Lyon, UKOLN, University of Bath, UK Dr Simon Coles, University of Southampton, UK Chemical Informatics Workshop, Manchester, March 2008 This work is licensed under a Creative Commons Licence Attribution-ShareAlike 3.0
Themes 1.Context: Institutional data repositories crystallography exemplar 2.Scale: repository federations 3.Longevity: Digital curation and preservation 4.Integration: Semantic challenges
eBank Project – building the eCrystals Data Repository ePrints Southampton Institutional Repository exemplar Embedded in workflow Started Sept 2003 Scholarly knowledge cycle context UKOLN-led interdisciplinary team
Scaling Up Report Phase 3 findings: Data policy should reflect lab practice & institutional model Diverse lab practice LIMS proprietary formats Data quality criteria/validation Prior publication problem We need automated assignment of terms for data discovery No discipline preservation model
nλ = 2 d sinθ The
eCrystals Repository ePrints.org v3.0
Repository Foundations Using simple Dublin Core Crystal structure Title (Systematic IUPAC Name) Authors Affiliation Creation Date Additional chemical information through Qualified Dublin Core Empirical formula International Chemical Identifier (InChI) Compound Class & Keywords Specifies which datasets are present in an entry Application Profile DOI links Rights & Citation Learned society + subject repository support
Federation interoperability & linking services Roll-out in 2 phases led by University of Southampton Establish Federation policies, application profile, mappings Bi-directional links with derived articles in publisher repositories, IUCr, Royal Society of Chemistry (RSC), Chemistry Central: scholarly knowledge cycle StOReLink project - Test linking options: StORe middleware and CLADDIER OAI-ORE Testbed eChemistry project
Laboratory practice & workflow Community standard CIF Mixed lab practice – central service facility versus single staff crystallographer in department Achieve end-to-end workflow Challenge of instrument manufacturers with proprietary formats Repository Lite for smaller lab operations? X-ray diffractometers
eBank-UK Phase 3 Curation & Preservation Study: Sustainability issues uk/curation/ Examined four main areas 1.Audit and certification (TRAC, DRAMBORA, NESTOR, ISO International repository audit and certification BOF Group) 2.The Open Archival Information System (OAIS) and Representation Information (RI) 3.eBank-UK application profile and preservation metadata 4.ePrints.org repository platform Recommendations: Self-assessment using DRAMBORA Consider Representation Information in wider context Develop preservation strategy Capture preservation metadata - PREMIS
Crystallographic schema underpins CIF (Crystallographic Information Framework), but is limited to data parameters e.g. cell_length_a Semantic issues
IUCr Acta Cryst 1992 Limited set of keywords describing methods, properties & applications, compounds, attributes No established crystallography dictionary or controlled vocabulary to give chemistry context
What do we want to do? Support depositors keyword/term assignment Facilitate and improve automated indexing Support advanced search / browse Allow metadata validation & enhancement Apply across a heterogeneous Federation Cross search, cross browse functionality Link data to all associated digital objects Develop domain semantics / vocabulary Use domain-specific authority files Mine to discover rather than find Achieve full inter-disciplinary integration
Some (semantic) issues….. How are terms assigned? Informal tags and/or structured KOS? How is a vocabulary curated and maintained? Can a vocabulary be transformed into a (Semantic Web related understanding) ontology? Disambiguation, acronyms, IUPAC names Persistent identification for data citation Granularity of data citation Data (and metadata) quality, provenance, validation Embedding within complex workflows Use collaborative social approaches? Community adoption: becomes part of the culture
Federation Questions? Slides will be available at : This work is licensed under a Creative Commons Licence Attribution-ShareAlike 3.0