1ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/ Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

Slides:



Advertisements
Similar presentations
ETD Preservation Survey Results Gail McMillan Digital Library and Archives, Virginia Tech 11th International ETD Symposium Robert Gordon University.
Advertisements

Learning from Events 12th June 2013 The Tata Steel Approach
You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
1 Accessibility Forum Projects Bill Hetzner Jim Kindrick.
Current design issues for digital archives Robert Munro (presented by David Nathan) Endangered Languages Archive (ELAR), School of Oriental and African.
Copyright © 2003 Pearson Education, Inc. Slide 7-1 Created by Cheryl M. Hughes, Harvard University Extension School Cambridge, MA The Web Wizards Guide.
Buzzword Bingo.
Mirror Mirror on the wall does your repository reflect it all? Peter West and Timothy Miles-Board EPrints Services University of Southampton Southampton,
1 of 18 Information Dissemination New Digital Opportunities IMARK Investing in Information for Development Information Dissemination New Digital Opportunities.
1 jNIK IT tool for electronic audit papers 17th meeting of the INTOSAI Working Group on IT Audit (WGITA) SAI POLAND (the Supreme Chamber of Control)
Implementation of a Validated Statistical Computing Environment Presented by Jeff Schumack, Associate Director – Drug Development Information September.
Deconstructing Cataloging A Web Services Approach to Bibliographic Control Thomas Hickey.
Programs and Research Public Private Agreements for Mass Digitisation Ricky Erway JISC Digitisation Conference July 2007.
1 State Wildlife Action Plans Wiki: Business Transformation Tutorial Brand Niemann July 5, 2008
DRIVER Long Term Preservation for Enhanced Publications in the DRIVER Infrastructure 1 WePreserve Workshop, October 2008 Dale Peters, Scientific Technical.
Electronic Resources in the EUI Library
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
My Alphabet Book abcdefghijklm nopqrstuvwxyz.
Addition Facts
1 SHERPA Securing a hybrid environment for research preservation and access.
Edinburgh 23 October DSpace: A Platform for Research Repositories Peter Morgan Project Director, Cambridge University Library.
ETD Preservation Workshop Session One: ETDs and Preservation Needs Gail McMillan, Virginia Tech.
Copyright 2006 Digital Enterprise Research Institute. All rights reserved. MarcOnt Initiative Tools for collaborative ontology development.
Research evaluation: is it our business? Librarians in the brave new world of research evaluation Andria McGrath Senior Information Specialist, Research.
Using Pivots to Explore Heterogeneous Collections A Case Study in Musicology Daniel Alexander Smith 8 December 2009.
Musicology in the Digital Age 26 April 2010 Introducing musicSpace David Bretherton
Metadata for preservation: the Cedars perspective
1/ 26 AGROVOC and the OWL Web Ontology Language: the Agriculture Ontology Service - Concept Server OWL model NKOS workshop Alicante,
UKOLN is supported by: Digital Repositories Roadmap: looking forward The JISC/CNI Meeting, July 2006 Rachel Heery Assistant Director R&D, UKOLN
UKOLN, University of Bath
An overview of collection-level metadata Applications of Metadata BCS Electronic Publishing Specialist Group, Ismaili Centre, London, 29 May 2002 Pete.
The SPECTRa Project : A wider chemistry picture Alan Tonge & Jim Downing A Digital Repository for the Chemical Community.
Pure Silver Reusing and Repurposing Bibliographic Data in a Current Research Information System and Institutional Repository 15 September.
EBankII Workshop 1 Making Scientific Data Openly Available Simon Coles School of Chemistry, University of Southampton.
Joint Information Systems Committee 24/04/2014 | | Slide 1 Jorum Licence – past, present and future Susan Eales JISC Programme Manager JISC/British Library.
LIFE 3 LIFE 3 : Predicting Long Term Preservation Costs Brian Hole LIFE 3 Project Manager The British Library IFLA conference 27/02/10.
Developing an effective assessment strategy Peter Hartley, Professor of Education Development University of Bradford
Law School 1 Using Blackboard Assignment tool for e-submission, e-marking, e-feedback Jane Daly 21 st March 2013.
Configuration management
1 The OneGeology project IC GS Ian Jackson, February 2007.
Safety Cases: Purpose, Process and Prospects John McDermid, OBE FREng University of York UK.
ABC Technology Project
Collections and services in the information environment JISC Collection/Service Description Workshop, London, 11 July 2002 Pete Johnston UKOLN, University.
Microsoft Word By: Phuong Nguyen.
Building repositories Iryna Kuchma, eIFL Open Access program manager, eIFL.net Presented at “Open Access: Maximising Research Impact” workshop, May 25.
1 Dissemination to Policy and Decision Makers and a Wider Audience Peter J. Bates pjb Associates
Electronic theses and copyright Janet Aucock Head of Repository services March 2014.
Collection-level description in practice Collection-Level Description & NOF-digitise projects NOF-digitise programme seminar, London, 22 February 2002.
1 of 35 Dr. Anne Adams Esteem Dissemination.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
A Virtual Research Environment for the Study of Documents and Manuscripts 1 1 Research administration Resource discovery Data creation, use and analysis.
Who are the Experts?Simon KampaSlide 1 Who are the Experts? Simon Kampa IAM Group University of Southampton
ARL 1 Library Publishing Services: New Opportunities for Research Libraries Karla Hahn ARL Office of Scholarly Communication ARL May Membership Meeting.
How creating a course on the e-lastic platform 1.
25 seconds left…...
BY-LAWS COMMITTEE PRESENTATION MARCH 2007 PRESIDENTS CONFERENCE.
Week 1.
We will resume in: 25 Minutes.
Module 12 WSP quality assurance tool 1. Module 12 WSP quality assurance tool Session structure Introduction About the tool Using the tool Supporting materials.
Immunobiology: The Immune System in Health & Disease Sixth Edition
A lesson approach © 2011 The McGraw-Hill Companies, Inc. All rights reserved. a lesson approach Microsoft® PowerPoint 2010 © 2011 The McGraw-Hill Companies,
CFR 250/590 Introduction to GIS, Autumn 1999 Data Search & Import © Phil Hurvitz, find_data 1  Overview Web search engines NSDI GeoSpatial Data.
© 2007 BST. All rights reserved. Confidential Information. SLU – 1 PDS_139 (0503) L2 Applying Problem- Solving Tools.
Steffen Staab 1WeST Web Science & Technologies University of Koblenz ▪ Landau, Germany Structured Data on the Web Introduction to.
1 Literacy PERKS Standard 1: Aligned Curriculum. 2 PERKS Essential Elements Academic Performance 1. Aligned Curriculum 2. Multiple Assessments 3. Instruction.
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
© Copyright 2011 John Wiley & Sons, Inc.
Alan Tonge Semantic Web Data Repositories from Chemistry e-Thesis Data Mining Open Repositories 2008 Southampton University 2 April 2008 SPECTRa-T Project.
SHERPA and OUP: an odd couple?
Presentation transcript:

1ETD 2008_Morgan_The SPECTRa-T Project Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project Peter Morgan SPECTRa-T Project Director Head of Medical and Science Libraries Cambridge University Library

2ETD 2008_Morgan_The SPECTRa-T Project Outline Why SPECTRa-T? Getting started Mining the text –PDFs –.docx Workflows Further thoughts

3ETD 2008_Morgan_The SPECTRa-T Project Why SPECTRa-T?

4ETD 2008_Morgan_The SPECTRa-T Project theses should be semantic and interactive -Peter Murray-Rust (ETD 2007 keynote address)

5ETD 2008_Morgan_The SPECTRa-T Project SPECTRa-T background SPECTRa-T = Submission, Preservation, & Exposure of Chemistry Teaching and Research data from Theses) SPECTRa-T funded by JISC Digital Repositories Programme 1 year project (April 2007 – March 2008) partners: –University of Cambridge (Chemistry + Library) –Imperial College London (Chemistry + ICT) team had previously worked together on SPECTRa

6ETD 2008_Morgan_The SPECTRa-T Project Why SPECTRa-T? research chemists produce experimental data (materials, reactions, properties = recipes) these data are the basis of further research theses are a rich source of data –c.10k chemistry papers p.a. worldwide –a typical thesis contains preparations –20% will be published in research papers –80% are not published

7ETD 2008_Morgan_The SPECTRa-T Project Why SPECTRa-T? text-mining can retrieve these data two basic data types: –Named Chemical Entities (NCEs) (e.g. words/phrases describing properties, procedures, instruments, etc) –Chemical Objects (COs) (e.g. molecules, spectra) our Semantic Web aim: –extract both data types –create RDF triples and chemical objects –link them to enable semantic querying

8ETD 2008_Morgan_The SPECTRa-T Project RDF triples RDF triples are statements containing a subject (resource), predicate (property), and object (value) water boils at 100 degrees Celsius the value of one property can be used as the resource for another

9ETD 2008_Morgan_The SPECTRa-T Project Getting started

10ETD 2008_Morgan_The SPECTRa-T Project Test material 100 PDF chemistry theses from CalTech, MIT, St Andrews & Stirling –some MIT theses OCR-derived (later removed from analysis because of misassigned characters) 20 Word chemistry theses from Cambridge (converted to Office Open XML.docx mark-up format)

11ETD 2008_Morgan_The SPECTRa-T Project Software OSCAR3 (Open Source Chemistry Analysis Routines) as text-mining tool –developed by SciBorg Project (Cambridge) –natural language processing to identify chemical terms –converts human-readable text into XML marked-up content that machines can manipulate –prefers SciXML documents –uses ChEBI Ontology for chemical name recognition

12ETD 2008_Morgan_The SPECTRa-T Project OSCAR3 parsing Highlighted experimental procedures created by OSCAR3

13ETD 2008_Morgan_The SPECTRa-T Project Mining the text

14ETD 2008_Morgan_The SPECTRa-T Project PDF... wraps text in simple high-level elements is optimized for human, not machine, readability produces poor SciXML –line breaks = loss of continuous text and paragraph structures –chemical drawings replaced by text and disconnected lines –loss of subscript and superscript characters –non-printing characters –OCR-derived text produces erroneous character assignment (e.g. i,l,1)

15ETD 2008_Morgan_The SPECTRa-T Project PDF processing SPECTRa-T tools... –removed line-breaks –removed non-printing characters –removed text fragments resulting from broken drawings –used UTF-8 Unicode to preserve Greek characters (lost in ASCII) (note: PDF/A can avoid some but not all such problems) text then converted to SciXML

16ETD 2008_Morgan_The SPECTRa-T Project SciXML from PDF OSCAR retrieves Named Chemical Entities OSCAR creates SAFXML (Standoff Annotated Format XML) output NCE metadata transformed by XSL stylesheets into RDF triples RDF triplestore can be queried BUT... OSCAR cannot identify Chemical Objects

17ETD 2008_Morgan_The SPECTRa-T Project processing Word theses converted to Office Open XML (.docx) using MS Word 2007 XML is converted into rich SciXML SciXML structure enables OSCAR3 to identify Experimental sections and extract Chemical Objects XML converted to CML (Chemical Markup Language) URIs assigned to CO metadata & associated with NCEs CML COs deposited in lightweight data repository RDF triplestore and CO data repository, linked by URIs, can now be queried semantically

18ETD 2008_Morgan_The SPECTRa-T Project Workflows

19ETD 2008_Morgan_The SPECTRa-T Project PDF workflow THESIS Input PDF document (text) SAFXMLSciXMLRDF SPECTRa-T text processing tools OSCAR3 Triplestore (NCEs) XSL stylesheet Processing of PDF e-theses to yield named chemical entities in a queryable RDF Triplestore (Text and lines in red indicate SPECTRa-T tools) PDF flow Query

20ETD 2008_Morgan_The SPECTRa-T Project workflow THESIS Input.docx document (XML markup) SAFXMLSciXMLRDF SPECTRa-T text processing tools Triplestore (NCEs) XSL stylesheet Processing of DOCX e-theses to yield named chemical entities and linked chemical objects in a semantically queryable linked RDF triplestore and data repository (Text and lines in red indicate SPECTRa-T tools) Add URI link Data XML Create URI CML Chemical Objects URI Data Repository (COs) Semantic Query DOCX flow OSCAR3

21ETD 2008_Morgan_The SPECTRa-T Project Further thoughts

22ETD 2008_Morgan_The SPECTRa-T Project Caveats SPECTRa-T a proof-of-concept approach restricted to a few chemistry sub-disciplines investigated only 2 file formats dangerous to generalise too far but our specific observations raise questions about broader implications...

23ETD 2008_Morgan_The SPECTRa-T Project File formats PDF has some value for text-mining born-digital PDF is better than OCR-derived PDF/A will resolve some problems but both still contain broken text and unreliable structure for text-mining –(and most legacy material is still only in PDF) XML better at providing structured documents for text-mining –(and may be good for preservation as well)

24ETD 2008_Morgan_The SPECTRa-T Project Role of institutional repository preservation versus re-usability? should a central IR require both PDF and Word/XML versions of a thesis? which file format(s) should be openly accessible? –cf. UKPMC XML policy for research papers should subject data be held in subject-specific data repositories managed by domain experts? can subject-based departmental repositories co-exist with a central IR? how can librarians and repository managers understand researchers needs?

25ETD 2008_Morgan_The SPECTRa-T Project IPR institutions can best realise the value of their research data assets by encouraging their discovery facts cannot be copyrighted derived data and databases raise complex legal issues ownership and licensing issues need urgent clarification

26ETD 2008_Morgan_The SPECTRa-T Project Fit for purpose? need to be clear why we collect theses are they intended to be fully re-usable? what does this entail for each subject? do librarians understand researchers? do thesis regulations ensure appropriate formats and submission processes? do IPR policies facilitate re-use? in short, are our e-theses fit for purpose?

27ETD 2008_Morgan_The SPECTRa-T Project Thanks... thanks to my colleagues on the Project team –at Cambridge: Jim Downing, Peter Murray-Rust, Diana Stewart, Alan Tonge, Joe Townsend –at Imperial College London Matt Harvey, Henry Rzepa thanks to the Joint Information Systems Committee (JISC) for funding the project (see for Final Report)... and thanks to you for listening!