UKOLN is supported by: Enhancing access to research data: the challenge of crystallography Rachel Heery, Monica Duke, Michael Day UKOLN, University of Bath Leslie Carr, Simon Coles University of Southampton A centre of expertise in digital informaion management JCDL 2005, June 7-11, Denver
Enhancing access to research data: overview Crystallography as an exemplar Impact of digital technologies on scientific research process Need new modes of data curation eBank project: applying digital library techniques to support data curation Next steps
Changes in scientific research process Increasing data volumes from eScience / Grid- enabled / cyber-infrastructure applications, big science Changing research methods: high througput technologies, automation, smart labs Potential for re-use of data, new inter-disciplinary research Different types of data: observational data, experimental data, computational data: different stewardship requirements
Data Overload! How do we disseminate? EPSRC National Crystallography Service The data deluge: crystallography
Data overload & the publication bottleneck 25,000,000 2,000, ,000
Current Publishing Process Journal articles: aims, ideas, context, conclusions – only most significant data Raw & underlying data required by peers not readily available
Context: existing data repositories National data archives: –UK Data Archive, Arts and Humanities Data Service, US National Archives and Records Administration (NARA), Atlas Datastore Discipline specific archives: –GenBank, Protein Data Bank Crystallography archives –Cambridge Crystallographic Data Centre (Cambridge Structural Database), Indiana University Molecular Structure Center (Crystal Data Server, Reciprocal Net), FIZ Karlsruhe (Inorganic crystals), Toth Information Systems (CHRYSTMET) Journals require deposit of data to support articles –Typically deposit of summary data…. partial coverage
Crystallography workflow RAW DATADERIVED DATARESULTS DATA Initialisation: mount new sample on diffractometer & set up data collection Collection: collect data Processing: process and correct images Solution: solve structures Refinement: refine structure CIF: produce CIF (Crystallographic Information File) Validation: chemical & crystallographic checks
eBank UK project overview JISC funded in 2003, now in Phase 2 to 2006 Joint effort between crystallographers, computer scientists, digital library researchers Investigating contribution of existing digital library technologies to enable publication at source Partners have interest in dissemination of chemistry research data, open access, OAI, institutional repositories
eBank project team University of Bath, UKOLN Michael Day, Monica Duke, Rachel Heery, Liz Lyon, Traugott Koch University of Southampton, School of Chemistry Simon Coles, Jeremy Frey, Mike Hursthouse University of Southampton, School of Electronics and Computer Science Leslie Carr, Chris Gutteridge University of Manchester, PSIgate John Blunden-Ellis
eBank phase one: achievements Gathered requirements from crystallographers Established pilot institutional repository for crystallography data at Southampton with web interface Developed a demonstrator aggregator service at UKOLN (CCDC exploring aggregation service) Developed appropriate schema Demonstrated a search interface as an embedded service at PSIgate portal Demonstrated an added value service linking research data to papers (one-off)
Institutional repositories…publication at source Institution establishes repository(s) Institution pro-actively supports deposit process OAI provides basis for interoperability Potential for added value services And/Or ….international subject based archives?
Crystallography good fit…. Crystallography has well defined data creation workflow Tradition of sharing using standard file format Crystallography Information File (CIF) What about other chemistry sub- disciplines? other scientific disciplines?
Data Flow in eBank UK OAI-PMH Submit Store/link Harvest (XML) Index and Search Data files Metadata present HTML present HTML Institutional repository eBank aggregator Create
Southampton digital repository
Access to ALL underlying data
OAI-PMH: harvesting and aggregating eBank aggregator at UKOLN demo/ Demonstrating potential for linking between data and journal article
Embedded search service at PSIgate PSIgate subject gateway: service provider
Schema for records made available for harvesting Data holding (collection of files associated with experiment) Qualified Dublin Core data elements plus additional chemical properties –Empirical formula –International Chemical Identifier (InChI) –Compound Class Individual data files Separate records for stage status of each file Description set wrapped into one XML record using METS Research metadata/data as a complex object
ebank_dc record (XML) Crystal structure (data holding) Crystal structure report (HTML) Dataset Institutional repositories eBank UK aggregator service ePrint UK aggregator service Other aggregators and services Deposit Harvesting OAI-PMH ebank_dc Harvesting OAI- PMH oai_dc,ebank_dc Harvesting OAI-PMH oai_dc Dataset dc:identifier dcterms:references Linking dc:type=CrystalStructure Model input Andy Powell, UKOLN. Eprint oai_dc record (XML) dcterms:isReferencedBy dc:type=Eprint and/or Text eBank data model Eprint jump-off page (HTML) dc:identifier Eprint manifestation (e.g. PDF) Linking Deposit
Creating the metadata Potential to embed deposit and disseminate into workflow of chemist in automated way
Data Collection Diffraction Unit Cell Success Strategy Data Collection Data Process System Y PreScans Yes BruNo Mount BruNo Unmount Setup via GUI Sample Tray No
eBank phase two work areas Sub-disciplines of chemistry and physical sciences Pursue generic data model Use of identifiers for citing datasets Subject approach to discovering research data Access to research data in teaching and learning context Liaise with other digital repository initiatives
For the future… Who provides added value services? –Authority files, automated subject indexing, annotation, data mining, visualisation What are the preservation issues? –UK Digital Curation Centre –National Science Board Draft report on long-lived data collections How to manage complex objects descriptions within OAI Digital curation of research data presents new roles for scientists, computer scientists, data managers…. data scientists
Thank you. Comments, questions? Acnowledgement to all project partners for their contributions to this presentation.