OCR implementation in The Caribbean Plants Digitization Project A project to image and catalog over 150,000 Caribbean specimens at the New York Botanical.

Slides:



Advertisements
Similar presentations
Processing New Herbarium Collections Using EMu The last bits of paper Nicole Tarnowsky.
Advertisements

LDS Church Implementations Aleph / Primo Implementation & Gap Application Development Ken Jensen David Benge.
 Goals and Scope  Research Question  Overall Workflow  Imaging Approach  OCR, NLP, Geo-referencing  Outreach and Crowd Sourcing.
Sylvia OrliSylvia Orli Department of BotanyDepartment of Botany National Museum of Natural HistoryNational Museum of Natural History Smithsonian InstitutionSmithsonian.
Databasing and Digitization of a Smaller Herbarium at a Smaller Institution it CAN be done and funded, too.
NYBG + KE EMu The New York Botanical Garden + KE EMu Melissa Tulig Botanical Information Management.
Web-based Specimen Databasing: Lessons from the Plant Bug Planetary Biodiversity Inventory Project presented by Randall T. Schuh Curator and Chair Division.
Implementation of KE EMu at The New York Botanical Garden Barbara Thiers Anthony Kirchgessner Shannon Dominick Emily Ashley Melissa Tulig.
National Herbarium of New South Wales Royal Botanic Gardens & Domain Trust, Sydney New South Wales Flora Online Karen Wilson and Gary Chapple.
Importing Transfer Equivalencies: How to Maximize Efficiency How Columbia College Office of Registrar improved productivity through third party solutions.
Collections Management Museums EMu – Data Cleaning with EMu Data Cleaning with EMu Warren Hindley.
FOSSIL INSECT DIGITIZATION WORKFLOW AT THE UNIVERSITY OF COLORADO Talia Karim 1, Lindsay Walker 1, Richard Levy 2 1 CU Museum of Natural History 2 Denver.
Computerization of Large Collections: Challenges, Benefits and Lessons Learned Barbara M. Thiers Director of the Herbarium New York Botanical Garden.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
DIGITAL FLORA La Selva Biological Station Website: Supported by:
Toward Automatic Processing and Indexing of Microfilm.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Supporting high-throughput digitisation workflows in EMu
Smart Web Exhibits: Carnegie Museum of Natural History Presented at Web-Wise 2002, Johns Hopkins University.
The New York Botanical Garden The Macroalgae Digitization Project Advancing online algal collections at the New York Botanical Garden and beyond Stephen.
Discovering Effective Workflows How can iDigBio help the biological and paleontological community with workflow development? support from NSF grant: Advancing.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Digitizing California Arthropod Collections Peter Oboyski, Phuc Nguyen, Serge Belongie, Rosemary Gillespie Essig Museum of Entomology University of California.
Improving the Quality of Tax Statistics: Recent Innovations in Editing and Imputation Techniques at the Statistics of Income Division of the U.S. Internal.
The Macroalgal Herbarium Consortium ACCESSING 150 YEARS OF SPECIMEN DATA TO UNDERSTAND CHANGES IN THE MARINE/AQUATIC ENVIRONMENT.
1 Newspaper Digitisation Workflows Rose Holley- Manager ANDP Presentation to Cultural Heritage Digitisation professionals 26 November 2008.
1 Australian Newspapers Digitisation Program Development of the Newspapers Content Management System Rose Holley – ANDP Manager ANPlan/ANDP Workshop, 28.
The use of OCR in the digitisation of herbarium specimens Robyn E Drinkwater, Robert Cubey & Elspeth Haston.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
The Macroalgal Digitization Project Chris Neefus, Department of Biological Sciences University of New Hampshire, Durham, New Hampshire.
Enabling E Research ANU Data Commons. What is it ? Building a repository for data sets o data can be deposited o updated o published to Research Data.
Presented by: Michael Bevans Information Manager for Digitization
Automated Georeferencing of Natural History Museum Data Nelson E. Rios Discussion The Tulane University Fish Collection, with 7.1 million fluid-preserved.
OCR and SALIX Parsing Daryl Lafferty Arizona State University October, 2012.
Field Work, Herbaria, Databases, Floras, and Monographs for Plant Systematics Spring 2014.
OCLC Online Computer Library Center Kathy Kie December 2007 OCLC Cataloging & Metadata Services an introduction.
Designing a Database (Part I) -Identify all fields needed to produce the required information -Group related fields into tables -Determine Each Table’s.
 How are changes in distribution patterns of lichens and bryophytes over time correlated with man-made environmental changes?  How accurately can we.
Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin.
Image Workflow Processes Elspeth Haston, Robert Cubey, Martin Pullan & David J Harris.
Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
Internet Information Systems Writing to Databases and Amending Data.
CBoL Taipei, september 2007 BARCODE DATA, MUSEUM CATALOGS AND GBIF Simon Tillier.
The Macroalgal Herbarium Consortium ACCESSING 150 YEARS OF SPECIMEN DATA TO UNDERSTAND CHANGES IN THE MARINE/AQUATIC ENVIRONMENT.
Index Herbariorum An Overview Barbara M. Thiers Biodiversity Collections Index Kick-Off Meeting U.S. Natural History Museum,28-29 Jan 2008.
Corinna Gries Edward Gilbert Thomas H. Nash III. Lichens Bryophytes Climate Change  NSF ADBC funding 2011 ~ 2.3 million specimen (90%) ○ 900,000 lichens.
Statistical Expertise for Sound Decision Making Quality Assurance for Census Data Processing Jean-Michel Durr 28/1/20111Fourth meeting of the TCG - Lubjana.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
IMu Rapid Data Entry Andrew Brown. Overview Browser-based Desktop Tablet Phone Project-based Authenticated access.
Context: The Strategic Plan for Establishing the Network Integrated Biocollections Alliance Judith E. Skog, Office of the Assistant Director, Biological.
Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin.
The William and Linda Steere Herbarium The New York Botanical Garden
Royal Botanic Garden Edinburgh Funded mostly by Scottish Government Martin Pullan – Biodiversity informatics David Harris – Herbarium Curator.
Data Imports Tony Hayes, Quality Assurance Manager.
 Research Question  Goals and Scope  Digitization Workflow  Geo-referencing  Dissemination  Outreach and Crowd Sourcing.
BBG Brooklyn Botanical Garden By Lena Mei, Savita Ramlall, and Edwin Gonzalez Mentor: Dr. Gerry Moore Co Mentor: Paul and Cathy.
Georeferencing Botanical Data Using Text Analysis Tools Clare A Llewellyn, Elspeth Haston & Claire Grover.
Image Field Data Digital Image File Name 1132 Image Title Carl Van Vechten. Portrait of Paul Robeson as Othello. Silver gelatin.
Making a Herbarium Specimen / Voucher
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No Herbadrop.
OnCore Current Status and Implementation Project Plan
What are our collections being used for?
Image1 Field Data Digital Image File Name 35159
Crowd-sourcing, Public Participation, and Data Enrichment – Using crowd-sourcing tools Biological Collections Digitisation in the Pacific , Symposium.
(Dublin, Ireland, April 2014)
Deb Paul, iDigBio aOCR WG
INHS Insect collection digitization workflow
Module 4: CITES and Scientific Institutions
Presentation transcript:

OCR implementation in The Caribbean Plants Digitization Project A project to image and catalog over 150,000 Caribbean specimens at the New York Botanical Garden *Legend: estimated number of specimens per country The New York Botanical Garden Presented by: Stephen Gottschalk

The New York Botanical Garden NYBG’s Caribbean Collections  More than 100 expeditions sponsored by the garden since  Notable and prolific collections by current and former Garden staff including the Garden’s founder, Nathaniel Lord Britton  Approximately 75 % of the specimen data could be digitized from field books at NYBG and other institutions, or from published itineraries which provide the same information

The New York Botanical Garden Caribbean Project workflow summary: Curation and rapid barcoding of specimens Specimen imaging Optical Character Recognition (OCR) and data parsing Field book entries Manual keying of specimen data Specimen Catalog Record

The New York Botanical Garden Sample ideal fieldbook: Plant family Plant description Collection locality No. of duplicates Determination Collection no. Collection date Habitat

The New York Botanical Garden Sample fieldbook - the product:

The New York Botanical Garden Sample Caribbean fieldbooks, less than ideal: Vol 132, J. A. Safer, 1909Vol. 69, Van Hermann, 1904

The New York Botanical Garden OCR assists in attaching fieldbook records: OCR derived fields Fieldbook entries user input IRN

The New York Botanical Garden Using OCR to populate fields: Python script finds line of query term User detects pattern to update fields Query raw OCR to extract records of a given label type SELECT * FROM OCR_all where label like "*New*Yor*Bot*Gar*Exp*Cub*"; Example:

The New York Botanical Garden Using OCR to populate fields: Python script finds line of query term User detects pattern to update fields Query raw OCR to extract records of a given label type Example: Return line containing “Col”

The New York Botanical Garden Using OCR to populate fields: Python script finds line of query term User detects pattern to update fields Query raw OCR to extract records of a given label type Example: Length of string Find position of “j. a.” find “sha” Find “afer” J. A. Shafer collections!

The New York Botanical Garden Avoid false positives: F. S. Earle – no!

The New York Botanical Garden Consider pattern training and a second OCR pass: Wright Labels, 162 total, generally low quality: Percentage correctly OCR’d OCR Pattern Training Used

The New York Botanical Garden Consider pattern training and a second OCR pass: Zanoni Labels, 114 total, generally typed: OCR Pattern Training Used Percentage correctly OCR’d

The New York Botanical Garden Closing thoughts: OCR plus human parsing works well with very little programming. Works well for large, self contained data sets but maybe not for partial or changing data sets – automation would be helpful for addressing this. Allows for creation of “digital” fieldbooks (ie order by collector, collection number and place).

The New York Botanical Garden Acknowledgements  National Science Foundation  Barbara Thiers, Jacquelyn Kallunki, Michael Bevans, Anthony Kirchgessner, Melissa Tulig, Benito Santos, Nicole Tarnowsky, Tom Zanoni, Benjamin Saracco, Stephen Sinon, Vinson Doyle, Jessica Allen, Sarah Dutton, Lane Gibbons, Elizabeth Kiernan, Brandy Watts, Charles Zimmerman  Visit the Virtual Herbarium: