Presentation is loading. Please wait.

Presentation is loading. Please wait.

IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210).

Similar presentations


Presentation on theme: "IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210)."— Presentation transcript:

1 iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Data Discovery and Doer Happiness: Uses for Optical Character Recognition (OCR) Output. Presenter: Deborah Paul Florida State University Integrated Digitized Biocollections (iDigBio) at Biodiversity Information Standards (TDWG) 2014 Conference Elmia Congress Centre, Rydberg Hall, Jönköping, Sweden Oct 27 th, 2014 Authors: Deborah Paul, Andrea Matsunaga, Miao Chen, Jason Best, Sylvia Orli, William Ulate, Reed Beaman

2 2 Minimal Data Capture “filed as” name higher geography barcode image all sheets in folder get the same initial data only the barcode differs Trend filed as name Biological collection data capture: a rapid approach using curatorial data

3 3 Raw OCR output, warts and all, can be used to: faster enter records faster ditto use the database entry ditto feature duplicates find duplicates quickly labels find the labels handwriting find the labels with lots of handwriting create your own create your own record sets to transcribe by: – collector – country or county – your Great Aunt Penelope – taxon – language speed up validation database updates create cogent sets to speed up validation and database updates more fun! make transcribers / validators jobs easier and more fun!

4 4 Got Text? Got Handwriting ? Got Text? Got Handwriting ?

5 5 Next imagine output from 1000s of labels or notebooks or text files! No.....2L31. National Herbarium of Canada FLORA OF’T TERRITORIES. Hab. and Loc., Arctic Coast west of Mackenzie River delta: Between King Pt. and Kay Pt., 69° 12’ N., and 138° to 138° 30’ W.. Collector, A. E. Porsild July 23-25, 1934 OCR Label

6 6 Robyn E Drinkwater, Robert Cubey, Elspeth Haston at TDWG 2013. Seeing the dark data…

7 7 It’s surprising what can be used to help filter specimens – the black art of search terms!

8 8 Overall Word Cloud Workflow OCR Output OCR Output OCR Output OCR Output OCR Engine OCR Engine OCR Engine OCR Engine OCR Engine OCR Engine Crowd sourcing (BVP) Crowd sourcing (BVP) Index (Solr) OCR confidence (n-gram) OCR confidence (n-gram) Images OCR Output OCR Output DwC Parsed Output DwC Parsed Output Word Cloud Word Cloud Cluster (carrot 2 ) Cluster (carrot 2 ) Histogram (Google Charts, Facet Explorer) Histogram (Google Charts, Facet Explorer) Web Service (Jason Davies) Web Service (Jason Davies) Google Charts: http://developers.google.com/chart/interactive/docs/galleryhttp://developers.google.com/chart/interactive/docs/gallery N-gram: http://github.com/idigbio-citsci-hackathon/OCR-Error-Estimationhttp://github.com/idigbio-citsci-hackathon/OCR-Error-Estimation Facet explorer: http://github.com/idigbio-citsci-hackathon/facet-explorerhttp://github.com/idigbio-citsci-hackathon/facet-explorer Jason Davies WC: http://www.jasondavies.com/wordcloud/http://www.jasondavies.com/wordcloud/ Apache Solr: http://lucene.apache.org/solr/http://lucene.apache.org/solr/ carrot 2 : http://project.carrot2.org/http://project.carrot2.org/ Some work from the iDigBio CITSCribe Hackathon

9 9 N-gram Scoring, Faceting, Solr + Carrot 2 Word Clouds with…

10 10 Imagine Integration with current software Use for initial sort or validation

11 11

12 12 Managing your crowdsourcing data behind the scenes – OCR too!

13 13 aOCR group finishing up a study comparing parsing algorithm strategies against a known standard to better define what’s possible at the moment for automated parsing of OCR output to standard Darwin Core terms. Work on Automated Parsing Algorithms

14 14 http://tinyurl.com/LichenRecords

15 15 Inside the 1899 Harriman Expedition

16 16 Inside the 1899 Harriman Expedition

17 17 Workflow Modules and Sample Digitization Workflows with OCR integrated DROIDaOCR The iDigBio DROID and aOCR groups produced a step-by-step series of tasks for implementing OCR in a digitization workflow. Project specific workflows are available from RBGE, NYBG, SALIX2, ASU Herbarium, ScioTR, TTD-TCN, … Yours?

18 18 OCR use, Voice Recognition, User Interface Optimization, Image Analysis,… aOCR WG and Synthesys3 user-interface interest group exemplar ML and NLP workflows combining OCR with Voice recognition software in Symbiota (Macroalgal TCN) Automated image analysis combining touch-screen technology into the digitization workflow (ScioTR)($6.99) Got Text? Got Handwriting? Got Text? Got Handwriting?

19 19 Tack så mycket!  Andrea Matsunaga, Researcher, iDigBio  Miao Chen, Indiana University, Data to Insight Center  Jason Best, Botanical Research Institute of Texas  Sylvia Orli, IT Head, Smithsonian Botany Department  William Ulate, Technical Director, BHL  Reed Beaman, Informatics Specialist, iDigBio  Elspeth Haston, et al Royal Botanic Garden Edinburgh (RBGE)  Stephen Gottschalk, New York Botanic Gardens (NYBG)  iDigBio Augmenting Optical Character Recognition WG Work presented here made possible by many and especially… MaCC TCN SALIX2 Tack så mycket Smithsonian

20 iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. www.idigbio.org facebook.com/iDigBio twitter.com/iDigBio vimeo.com/idigbio idigbio.org/rss-feed.xml webcal://www.idigbio.org/events-calendar/export.ics Find out more at iDigBio

21 21 http://aocr1.acis.ufl.edu/datasets/lichens/silver/ocr/WebrootDatasetsLichensSilverOcr.txt Created by sending a text file to this cloud generator http://www.jasondavies.com/wordcloud/#http%3A%2F%2Faocr1.acis.ufl.edu%2Fdatasets%2Flichens%2Fsilver%2Focr%2FWebrootDatasetsLichensSilverOcr.txtMüllMüll Web Service-Based Word Cloud

22 22 OCR text


Download ppt "IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210)."

Similar presentations


Ads by Google