IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210).

iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Using optical character recognition (OCR) output in digitization: SPNHC, June 26, 2014 Symposium: Progress in Natural History Collections Digitisation Canolfan Mileniwm Cymru \ Wales Millennium Centre, Cardiff Bay Deborah Paul, Andrea Matsunaga, Miao Chen, Jason Best, Sylvia Orli, Elpseth Haston, find Deb on Twitter @idbdeb @iDigBio See your data before it's in the database and after # spnhc2014 #digitization#collections

What is iDigBio? NIBA - NSF - ADBC - iDigBio - TCN - PEN NIBA - NSF - ADBC - iDigBio - TCN - PEN biodiversity data facilitate use of biodiversity data digitisation enable digitisation access portal access sustainability sustainability – community collaboration 2

Minimal Data Capture “filed as” name higher geography barcode image all sheets in folder get the same initial data only the barcode differs Biological collection data capture: a rapid approach using curatorial data Trend filed as name 3

Would you like to…? faster enter records faster? ditto use the ditto feature often? find duplicates find duplicates quickly? find find the labels findhandwriting find the labels with lots of handwriting? create your own create your own record sets to transcribe? by collector by country or county by your Great Aunt Penelope by taxon language by language speed up validationdatabase updates create cogent sets to speed up validation and database updates? make transcribers / validators jobs easier make transcribers / validators jobs easier (paid and volunteer)? 4

Got Text? Got Handwriting ? Got Text? Got Handwriting ? 5

Next imagine output from 1000s of labels or notebooks or text files! No.....2L31. National Herbarium of Canada FLORA OF’T TERRITORIES. Hab. and Loc., Arctic Coast west of Mackenzie River delta: Between King Pt. and Kay Pt., 69° 12’ N., and 138° to 138° 30’ W.. Collector, A. E. Porsild July 23-25, 1934 OCR Label 6

Web Service-Based Word Cloud http://aocr1.acis.ufl.edu/datasets/lichens/silver/ocr/WebrootDatasetsLichensSilverOcr.txt http://aocr1.acis.ufl.edu/datasets/lichens/silver/ocr/WebrootDatasetsLichensSilverOcr.txt Created by sending a text file to this cloud generator http://www.jasondavies.com/wordcloud/#http%3A%2F%2Faocr1.acis.ufl.edu%2Fdatasets%2Flichens%2Fsilver%2Focr%2FWebrootDatasetsLichensSilverOcr.txtMüllMüll 8

OCR text 9

Robyn E Drinkwater, Robert Cubey, Elspeth Haston at TDWG 2013. Seeing the dark data…

It’s surprising what can be used to help filter specimens – the black art of search terms! 11

http://tinyurl.com/LichenRecords 12

Inside the 1899 Harriman Expedition 13

Overall Word Cloud Workflow OCR Output OCR Output OCR Output OCR Output OCR Engine OCR Engine OCR Engine OCR Engine OCR Engine OCR Engine Crowd sourcing (BVP) Crowd sourcing (BVP) Index (Solr) OCR confidence (n-gram) OCR confidence (n-gram) Images OCR Output OCR Output DwC Parsed Output DwC Parsed Output Word Cloud Word Cloud Cluster (carrot 2 ) Cluster (carrot 2 ) Histogram (Google Charts, Facet Explorer) Histogram (Google Charts, Facet Explorer) Web Service (Jason Davies) Web Service (Jason Davies) Google Charts: http://developers.google.com/chart/interactive/docs/galleryhttp://developers.google.com/chart/interactive/docs/gallery N-gram: http://github.com/idigbio-citsci-hackathon/OCR-Error-Estimationhttp://github.com/idigbio-citsci-hackathon/OCR-Error-Estimation Facet explorer: http://github.com/idigbio-citsci-hackathon/facet-explorerhttp://github.com/idigbio-citsci-hackathon/facet-explorer Jason Davies WC: http://www.jasondavies.com/wordcloud/http://www.jasondavies.com/wordcloud/ Apache Solr: http://lucene.apache.org/solr/http://lucene.apache.org/solr/ carrot 2 : http://project.carrot2.org/http://project.carrot2.org/ Some work from the recent iDigBio CITSCribe Hackathon

Word Clouds using N-gram Scoring, Faceting, Solr + Carrot 2 16

sort validation Use for initial sort or validation Imagine Integration with current software 17

Working Group Collaboration - Workflows Setting up OCR Running OCR Machine Learning Natural Language Processing 19

Sample Workflows with OCR integrated New workflow sample OCR protocols Got one? Got a resource for these? Got new ideas for how to use the text data to improve the data? Let’s share! 20

Managing your crowdsourcing data behind the scenes OCR too! 21

OCR use, a bit more… aOCR WG, JRA Synthesys3, … user-interface interest group exemplar ML and NLP workflows combining with Voice recognition software (Macroalgal TCN) Got Text? Got Handwriting? Got Text? Got Handwriting? 22

Diolch yn fawr! Andrea Matsunaga, Researcher, iDigBio Miao Chen, Indiana University, Data to Insight Center Jason Best, Botanical Research Institute of Texas Sylvia Orli, IT Head, Smithsonian Botany Department William Ulate, Technical Director, BHL Reed Beaman, Informatics Specialist, iDigBio Elspeth Haston, et al Royal Botanic Garden Edinburgh (RBGE) iDigBio Augmenting Optical Character Recognition WG Work presented here made possible by many and especially… MaCC TCN SALIX 23

IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210).

Similar presentations

Presentation on theme: "IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210)."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210).

Similar presentations

Presentation on theme: "IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210)."— Presentation transcript:

Similar presentations

About project

Feedback