The use of OCR in the digitisation of herbarium specimens Robyn E Drinkwater, Robert Cubey & Elspeth Haston.

Slides:



Advertisements
Similar presentations
Customizing EndNote for Institutional Animal Care and Use Committee (IACUC) Protocol Searches Melissa A. Ratajeski, MLIS, RLAT Health Sciences Library.
Advertisements

1 Metadata Tools for JISC Digitisation Projects of still images and text Ed Fay BOPCRIS, Hartley Library University of Southampton.
The following is designed to give a brief understanding of the different methods you can use to scan, file and search documents in FILEstream. FILEstream.
Database Management Systems and Enterprise Software
Databases. A database program can be used to:  sort a file into a different order  Maintain contact with clients  search through the records for a.
Databases. What is a database? It is a collection of information, which can be searched and sorted. It can be information about anything. Toys, pupils,
Traditional Core & Advanced Capture Techniques. Agenda The Capture Process What’s New in Capture Workflow? Core and optional capture features Imports.
Importing Transfer Equivalencies: How to Maximize Efficiency How Columbia College Office of Registrar improved productivity through third party solutions.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
OCR and making your publications accessible A practical guide.
Integration of the UC Davis Biological Collections Data via a Web Portal [A Pilot Project] To develop a Web Portal allowing better & more use of the information.
Supporting high-throughput digitisation workflows in EMu
Input devices, processing and output devices Hardware Senior I.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Digitizing California Arthropod Collections Peter Oboyski, Phuc Nguyen, Serge Belongie, Rosemary Gillespie Essig Museum of Entomology University of California.
Database Software Application
1 Australian Newspapers Digitisation Program Development of the Newspapers Content Management System Rose Holley – ANDP Manager ANPlan/ANDP Workshop, 28.
Create Forms Lesson 5. Software Orientation Creating Forms A form is a database object –enter, edit, or display data from a table or query Providing.
ALLOWS FOR efficient computerization and management of biological collections and mobilization of specimen information onto the Internet.ALLOWS FOR efficient.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Library Automation. Library automation Why automate? Why automate? The automation process. The automation process. The end result. The end result.
Importing existing reference lists Lorraine Beard & Martin Snelling DRAFT: May 2007.
Automated Georeferencing of Natural History Museum Data Nelson E. Rios Discussion The Tulane University Fish Collection, with 7.1 million fluid-preserved.
OCR and SALIX Parsing Daryl Lafferty Arizona State University October, 2012.
OCR implementation in The Caribbean Plants Digitization Project A project to image and catalog over 150,000 Caribbean specimens at the New York Botanical.
Unit Seven Database 1.Passage One. Foundation of Database.
CDP Standard Grade1 Commercial Data Processing Standard Grade Computing Studies.
Database A database program is a piece of software to organize and sort information. It creates an electronic list of information that can be sorted very.
SilverImage Streamline tool for photographing herbarium specimen sheets Developed by SilverBiology Michael.
Image Workflow Processes Elspeth Haston, Robert Cubey, Martin Pullan & David J Harris.
OCR within the digitisation workflow at RBGE Elspeth Haston, Hannah Atkins, Rob Cubey, Robyn Drinkwater, David Harris, Katherine O’Donnell, Martin Pullan.
Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
Databases. What is a database?  A database is used to store data. The word DATA is actually Latin for FACTS. A database is, therefore, a place, or thing.
1 UNOG Library Digitization and Microform Unit (DMU) – December 2009.
Common Application Software. MS Word Some advanced use : Mail-merge Self-made Templates Macro (recording and running)
Introduction to Database Tonga Institute of Higher Education NOS 215.
What have we learned?. What is a database? An organized collection of related data.
Microsoft Access. Microsoft access is a database programs that allows you to store retrieve, analyze and print information. Companies use databases for.
SOML Large Optics Daily Reporting Guide to using the new ETSEDMS server for Large Optics Daily Reporting.
Statistical Expertise for Sound Decision Making Quality Assurance for Census Data Processing Jean-Michel Durr 28/1/20111Fourth meeting of the TCG - Lubjana.
Databases.  A database is simply a collection of information stored in an orderly manner.  A database can be as simple as a birthday book, address book.
Lesson 13 Databases Unit 2—Using the Computer. Computer Concepts BASICS - 22 Objectives Define the purpose and function of database software. Identify.
Label Processing Methods for HelpingScience.org Developed by SilverBiology Michael Giddens.
The William and Linda Steere Herbarium The New York Botanical Garden
OCR CAMBRIDGE NATIONALS UNIT 1 - UNDERSTANDING COMPUTER SYSTEMS DATA CAPTURE METHODS.
Royal Botanic Garden Edinburgh Funded mostly by Scottish Government Martin Pullan – Biodiversity informatics David Harris – Herbarium Curator.
Riccardi: DIALOGUE Workshop August 1, 2005 Supported by NSF BDI 1 Representing and Using Phylogenetic Characters in Morphbank Greg Riccardi, David Gaitros,
Georeferencing Botanical Data Using Text Analysis Tools Clare A Llewellyn, Elspeth Haston & Claire Grover.
Jason W. Karl, Ph.D. Jeffrey K. Gillan Jason W. Karl, Ph.D. Jeffrey K. Gillan 23 October 2013 Ty Montgomery Richard Bliss Ty Montgomery Richard Bliss
Automation Living in a Paper Oriented World and The Steps to Automation.
Capture This! PO105 James Green. Table of Contents Capture Overview Laserfiche Tools Case Scenarios Questions and Answers.
THE LEADER IN MID-MARKET ENTERPRISE DOCUMENT MANAGEMENT SOLUTIONS A Day in the Life of a Paperless Office Presented by: NAME Sales Director, ______ Region.
Information Processes and Technology Information Processes Collecting.
What are our collections being used for?
State of the art literature review on...
Windows 7 and file management
Data and Information.
Integration of the UC Davis Biological Collections Data via a Web Portal [A Pilot Project] Project Goals To develop a Web Portal allowing better & more.
Elspeth Haston, Robyn Drinkwater, Robert Cubey & Ruth Monfries
homework assignment due Feb 23
InnovationQ Plus Quick Start Guide
Deb Paul, iDigBio aOCR WG
The New Face of Information Retrieval: The Ankara University Open Access Platform Prof. Dr. Sekine Karakaş Prof. Dr. Doğan.
Data Capture Process Stages
Spreadsheets, Modelling & Databases
Exploring Microsoft® Office 2016 Series Editor Mary Anne Poatsy
Metadata supported full-text search in a web archive
Presentation transcript:

The use of OCR in the digitisation of herbarium specimens Robyn E Drinkwater, Robert Cubey & Elspeth Haston

What is happening in digitisation?

… and these minimal data records are going to need data added to them.

Parse OCR text directly into the database fields Use OCR data to prepare the specimens for manual / semi automated data entry What are the options when using optical character recognition (OCR)?

We have had a digitisation project running to digitise all the specimens from SW Asia and the Middle East at RBGE. Minimal data had been captured originally* – Filing name – Geographical filing region – Barcode We have been routinely processing all our specimen images through ABBYY OCR software. * E Haston, R Cubey, DJ Harris (2011). Data concepts and their relevance for data capture in large scale digitisation of biological collections. International Journal of Humanities and Arts Computing 6 (1-2),

Exploring the data…

We used the OCR output text to pull out over 7,000 specimen images and associated data records These were then prepared into batches: – some random – some sorted by collector and / or country Step One

A team of six digitisers at RBGE completed a series of trials They used two different protocols for data entry – complete records – partial records (including collector and geographical information but not habitat and description) In total 7,200 specimens were processed Step Two

Compared to unsorted, random specimens, those which were sorted based on data from the OCR output were quicker to digitise Of the methods tested here, the most efficient used a protocol based on partial data entry, working with specimens which had been filtered by Collector and Country Results…

The human factor…

Digitisation staff preferred working with sorted specimens They also preferred working with physical specimens rather than images The human factor…

This work is more easily applied than parsing data from the OCR output It can be used in conjunction with other tools later in the digitisation process since these other processes will almost certainly be more efficient with sorted batches of specimens Other tasks can also be built on top of this: eg condition assessment, QC, etc Some more thoughts…

It’s surprising what can be used to help filter specimens – the black art of search terms!

Acknowledgments The digitisation team at RBGE: Nicky Sharp, David Braidwood, Muhammad Ghazali, Lorna Glancy, Dorota Jaworska, Esther Nieto. The Andrew W Mellon Foundation Dr Antje Ahrends (RBGE) & Dr Chris Glaseby (BIOSS) for statistical advice