IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210).

Slides:



Advertisements
Similar presentations
Summit 2012 October 23 – 24 reporting: Edward Gilbert, Debbie Paul.
Advertisements

Sylvia OrliSylvia Orli Department of BotanyDepartment of Botany National Museum of Natural HistoryNational Museum of Natural History Smithsonian InstitutionSmithsonian.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
GUID-1 Workshop Welcome and Introduction Donald Hobern GBIF Program Officer for Data Access and Database Interoperability February 2006.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
NYBG + KE EMu The New York Botanical Garden + KE EMu Melissa Tulig Botanical Information Management.
Crowd Sourcing and Community Management Capabilities Available within Symbiota Data Portals Nico Franz 1, Corinna Gries 2, Thomas Nash III 2 & Edward Gilbert.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Data Cleaning, Validation and Enhancement iDigBio Wet Collections Digitization Workshop March 4 – 6, 2013 KU Biodiversity Institute, University of Kansas.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
IDigBio Minimum Information Standards for Scientific Collections (MISC)/Authority Files Working Group Gil Nelson Andréa Matsunaga (on behalf of the WG)
Discovering Effective Workflows How can iDigBio help the biological and paleontological community with workflow development? support from NSF grant: Advancing.
Digitizing California Arthropod Collections Peter Oboyski, Phuc Nguyen, Serge Belongie, Rosemary Gillespie Essig Museum of Entomology University of California.
This material is based upon work supported by the National Science Foundation under Cooperative Agreement EF Any opinions, findings, and conclusions.
The Macroalgal Herbarium Consortium ACCESSING 150 YEARS OF SPECIMEN DATA TO UNDERSTAND CHANGES IN THE MARINE/AQUATIC ENVIRONMENT.
The use of OCR in the digitisation of herbarium specimens Robyn E Drinkwater, Robert Cubey & Elspeth Haston.
NSF EF Welcome to Summit III University of Florida Florida State University.
Roles and Goals Greg Riccardi. iDigBio People University of Florida o Larry Page, Jose Fortes, Pamela Soltis, Bruce McFadden, Renato Figueiredo, Reed.
1st iDigBio – BRIT Hackathon iDigBio Augmenting Optical Character Recognition Working Group (AOCR wg) February 13 – 14, 2013.
ALLOWS FOR efficient computerization and management of biological collections and mobilization of specimen information onto the Internet.ALLOWS FOR efficient.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
This material is based upon work supported by the National Science Foundation under Cooperative Agreement EF Any opinions, findings, and conclusions.
Update from the Entomological Society of America (ESA) Systematics, Evolution, and Biodiversity (SysEB) Section Symposium: From Voucher.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
The Macroalgal Digitization Project Chris Neefus, Department of Biological Sciences University of New Hampshire, Durham, New Hampshire.
Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.
Pedagogic Service Project: Enriching the MERLOT Collection Ellen Iverson Science Education Resource Center, Carleton College Scott Cooper University of.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing.
University of Florida Florida State University
Raw Data Cleaning, Validation and Enhancement The Field Museum - Chicago, Illinois iDigBio Entomology Digitization Workshop Deborah Paul, iDigBio April.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Image Workflow Processes Elspeth Haston, Robert Cubey, Martin Pullan & David J Harris.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
The Macroalgal Herbarium Consortium ACCESSING 150 YEARS OF SPECIMEN DATA TO UNDERSTAND CHANGES IN THE MARINE/AQUATIC ENVIRONMENT.
Computer Aided Design By Brian Nettleton This material is based upon work supported by the National Science Foundation under Grant No Any opinions,
Corinna Gries Edward Gilbert Thomas H. Nash III. Lichens Bryophytes Climate Change  NSF ADBC funding 2011 ~ 2.3 million specimen (90%) ○ 900,000 lichens.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
The Macroalgal Herbarium Consortium Accessing 150 Years of Specimen Data to Understand Changes in the Marine/Aquatic Environment Janet Sullivan and Chris.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
IDigBio Summit meeting Judith E. Skog Biological Sciences Directorate Office of the Assistant Director Emerging Frontiers Division.
Millman—Nov 04—1 An Update on Digital Libraries David Millman Director of Research & Development Academic Information Systems Columbia University
Context: The Strategic Plan for Establishing the Network Integrated Biocollections Alliance Judith E. Skog, Office of the Assistant Director, Biological.
Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin.
IDigBio: Addressing a BIO Big Data Challenge. A. Matsunaga, et al IEEE e-Science. 2013: How iDigBio is Different.
The William and Linda Steere Herbarium The New York Botanical Garden
Royal Botanic Garden Edinburgh Funded mostly by Scottish Government Martin Pullan – Biodiversity informatics David Harris – Herbarium Curator.
Riccardi: DIALOGUE Workshop August 1, 2005 Supported by NSF BDI 1 Representing and Using Phylogenetic Characters in Morphbank Greg Riccardi, David Gaitros,
This material is based upon work supported by the National Science Foundation under Cooperative Agreement EF Any opinions, findings, and conclusions.
This material is based upon work supported by the National Science Foundation under Cooperative Agreement EF Any opinions, findings, and conclusions.
Taxonomic Name Recognition (TNR) in Biodiversity Heritage Library (生物多样性图书馆分 类学名称识别) Qin Wei (魏琴), Chris Freeland, P. Bryan Heidorn Missouri Botanical.
What are our collections being used for?
Natural History Collections: Connecting With Faculty and Content
Digitisation Workflows, Tools and Techniques - Whole-drawer imaging
Discussion and Conclusion
Elspeth Haston, Robyn Drinkwater, Robert Cubey & Ruth Monfries
Who’s Who in Bioinformatics: The European Landscape
Innovative Uses of Collections Data (by & for collections!)
Data Management: The Data Repatriation Re-integration Step or …
Biodiversity Informatics 101
Natural History Collections (NHC) Biodiversity Data Informatics 101
Title of Poster Site Visit 2017 Introduction Results
Deb Paul, iDigBio aOCR WG
People Who Did the Study Universities they are affiliated with
Title of Poster Site Visit 2018 Introduction Results
Designing, Implementing, and Benefiting from a Collections Attribution Channel: the view from iDigBio and the ADBC Alex Thompson, Deborah L. Paul, Gil.
This material is based upon work supported by the National Science Foundation under Grant #XXXXXX. Any opinions, findings, and conclusions or recommendations.
Presentation transcript:

iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Using optical character recognition (OCR) output in digitization: SPNHC, June 26, 2014 Symposium: Progress in Natural History Collections Digitisation Canolfan Mileniwm Cymru \ Wales Millennium Centre, Cardiff Bay Deborah Paul, Andrea Matsunaga, Miao Chen, Jason Best, Sylvia Orli, Elpseth Haston, find Deb See your data before it's in the database and after # spnhc2014 #digitization#collections

What is iDigBio? NIBA - NSF - ADBC - iDigBio - TCN - PEN NIBA - NSF - ADBC - iDigBio - TCN - PEN biodiversity data facilitate use of biodiversity data digitisation enable digitisation access portal access sustainability sustainability – community collaboration 2

Minimal Data Capture “filed as” name higher geography barcode image all sheets in folder get the same initial data only the barcode differs Biological collection data capture: a rapid approach using curatorial data Trend filed as name 3

Would you like to…? faster enter records faster? ditto use the ditto feature often? find duplicates find duplicates quickly? find find the labels findhandwriting find the labels with lots of handwriting? create your own create your own record sets to transcribe? by collector by country or county by your Great Aunt Penelope by taxon language by language speed up validationdatabase updates create cogent sets to speed up validation and database updates? make transcribers / validators jobs easier make transcribers / validators jobs easier (paid and volunteer)? 4

Got Text? Got Handwriting ? Got Text? Got Handwriting ? 5

Next imagine output from 1000s of labels or notebooks or text files! No.....2L31. National Herbarium of Canada FLORA OF’T TERRITORIES. Hab. and Loc., Arctic Coast west of Mackenzie River delta: Between King Pt. and Kay Pt., 69° 12’ N., and 138° to 138° 30’ W.. Collector, A. E. Porsild July 23-25, 1934 OCR Label 6

7

Web Service-Based Word Cloud Created by sending a text file to this cloud generator 8

OCR text 9

Robyn E Drinkwater, Robert Cubey, Elspeth Haston at TDWG Seeing the dark data…

It’s surprising what can be used to help filter specimens – the black art of search terms! 11

12

Inside the 1899 Harriman Expedition 13

14

Overall Word Cloud Workflow OCR Output OCR Output OCR Output OCR Output OCR Engine OCR Engine OCR Engine OCR Engine OCR Engine OCR Engine Crowd sourcing (BVP) Crowd sourcing (BVP) Index (Solr) OCR confidence (n-gram) OCR confidence (n-gram) Images OCR Output OCR Output DwC Parsed Output DwC Parsed Output Word Cloud Word Cloud Cluster (carrot 2 ) Cluster (carrot 2 ) Histogram (Google Charts, Facet Explorer) Histogram (Google Charts, Facet Explorer) Web Service (Jason Davies) Web Service (Jason Davies) Google Charts: N-gram: Facet explorer: Jason Davies WC: Apache Solr: carrot 2 : Some work from the recent iDigBio CITSCribe Hackathon

Word Clouds using N-gram Scoring, Faceting, Solr + Carrot 2 16

sort validation Use for initial sort or validation Imagine Integration with current software 17

18

Working Group Collaboration - Workflows Setting up OCR Running OCR Machine Learning Natural Language Processing 19

Sample Workflows with OCR integrated New workflow sample OCR protocols Got one? Got a resource for these? Got new ideas for how to use the text data to improve the data? Let’s share! 20

Managing your crowdsourcing data behind the scenes OCR too! 21

OCR use, a bit more… aOCR WG, JRA Synthesys3, … user-interface interest group exemplar ML and NLP workflows combining with Voice recognition software (Macroalgal TCN) Got Text? Got Handwriting? Got Text? Got Handwriting? 22

Diolch yn fawr! Andrea Matsunaga, Researcher, iDigBio Miao Chen, Indiana University, Data to Insight Center Jason Best, Botanical Research Institute of Texas Sylvia Orli, IT Head, Smithsonian Botany Department William Ulate, Technical Director, BHL Reed Beaman, Informatics Specialist, iDigBio Elspeth Haston, et al Royal Botanic Garden Edinburgh (RBGE) iDigBio Augmenting Optical Character Recognition WG Work presented here made possible by many and especially… MaCC TCN SALIX 23