Download presentation
Presentation is loading. Please wait.
Published byOsborn Collins Modified over 9 years ago
1
iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Data Discovery and Doer Happiness: Uses for Optical Character Recognition (OCR) Output. Presenter: Deborah Paul Florida State University Integrated Digitized Biocollections (iDigBio) at Biodiversity Information Standards (TDWG) 2014 Conference Elmia Congress Centre, Rydberg Hall, Jönköping, Sweden Oct 27 th, 2014 Authors: Deborah Paul, Andrea Matsunaga, Miao Chen, Jason Best, Sylvia Orli, William Ulate, Reed Beaman
2
2 Minimal Data Capture “filed as” name higher geography barcode image all sheets in folder get the same initial data only the barcode differs Trend filed as name Biological collection data capture: a rapid approach using curatorial data
3
3 Raw OCR output, warts and all, can be used to: faster enter records faster ditto use the database entry ditto feature duplicates find duplicates quickly labels find the labels handwriting find the labels with lots of handwriting create your own create your own record sets to transcribe by: – collector – country or county – your Great Aunt Penelope – taxon – language speed up validation database updates create cogent sets to speed up validation and database updates more fun! make transcribers / validators jobs easier and more fun!
4
4 Got Text? Got Handwriting ? Got Text? Got Handwriting ?
5
5 Next imagine output from 1000s of labels or notebooks or text files! No.....2L31. National Herbarium of Canada FLORA OF’T TERRITORIES. Hab. and Loc., Arctic Coast west of Mackenzie River delta: Between King Pt. and Kay Pt., 69° 12’ N., and 138° to 138° 30’ W.. Collector, A. E. Porsild July 23-25, 1934 OCR Label
6
6 Robyn E Drinkwater, Robert Cubey, Elspeth Haston at TDWG 2013. Seeing the dark data…
7
7 It’s surprising what can be used to help filter specimens – the black art of search terms!
8
8 Overall Word Cloud Workflow OCR Output OCR Output OCR Output OCR Output OCR Engine OCR Engine OCR Engine OCR Engine OCR Engine OCR Engine Crowd sourcing (BVP) Crowd sourcing (BVP) Index (Solr) OCR confidence (n-gram) OCR confidence (n-gram) Images OCR Output OCR Output DwC Parsed Output DwC Parsed Output Word Cloud Word Cloud Cluster (carrot 2 ) Cluster (carrot 2 ) Histogram (Google Charts, Facet Explorer) Histogram (Google Charts, Facet Explorer) Web Service (Jason Davies) Web Service (Jason Davies) Google Charts: http://developers.google.com/chart/interactive/docs/galleryhttp://developers.google.com/chart/interactive/docs/gallery N-gram: http://github.com/idigbio-citsci-hackathon/OCR-Error-Estimationhttp://github.com/idigbio-citsci-hackathon/OCR-Error-Estimation Facet explorer: http://github.com/idigbio-citsci-hackathon/facet-explorerhttp://github.com/idigbio-citsci-hackathon/facet-explorer Jason Davies WC: http://www.jasondavies.com/wordcloud/http://www.jasondavies.com/wordcloud/ Apache Solr: http://lucene.apache.org/solr/http://lucene.apache.org/solr/ carrot 2 : http://project.carrot2.org/http://project.carrot2.org/ Some work from the iDigBio CITSCribe Hackathon
9
9 N-gram Scoring, Faceting, Solr + Carrot 2 Word Clouds with…
10
10 Imagine Integration with current software Use for initial sort or validation
11
11
12
12 Managing your crowdsourcing data behind the scenes – OCR too!
13
13 aOCR group finishing up a study comparing parsing algorithm strategies against a known standard to better define what’s possible at the moment for automated parsing of OCR output to standard Darwin Core terms. Work on Automated Parsing Algorithms
14
14 http://tinyurl.com/LichenRecords
15
15 Inside the 1899 Harriman Expedition
16
16 Inside the 1899 Harriman Expedition
17
17 Workflow Modules and Sample Digitization Workflows with OCR integrated DROIDaOCR The iDigBio DROID and aOCR groups produced a step-by-step series of tasks for implementing OCR in a digitization workflow. Project specific workflows are available from RBGE, NYBG, SALIX2, ASU Herbarium, ScioTR, TTD-TCN, … Yours?
18
18 OCR use, Voice Recognition, User Interface Optimization, Image Analysis,… aOCR WG and Synthesys3 user-interface interest group exemplar ML and NLP workflows combining OCR with Voice recognition software in Symbiota (Macroalgal TCN) Automated image analysis combining touch-screen technology into the digitization workflow (ScioTR)($6.99) Got Text? Got Handwriting? Got Text? Got Handwriting?
19
19 Tack så mycket! Andrea Matsunaga, Researcher, iDigBio Miao Chen, Indiana University, Data to Insight Center Jason Best, Botanical Research Institute of Texas Sylvia Orli, IT Head, Smithsonian Botany Department William Ulate, Technical Director, BHL Reed Beaman, Informatics Specialist, iDigBio Elspeth Haston, et al Royal Botanic Garden Edinburgh (RBGE) Stephen Gottschalk, New York Botanic Gardens (NYBG) iDigBio Augmenting Optical Character Recognition WG Work presented here made possible by many and especially… MaCC TCN SALIX2 Tack så mycket Smithsonian
20
iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. www.idigbio.org facebook.com/iDigBio twitter.com/iDigBio vimeo.com/idigbio idigbio.org/rss-feed.xml webcal://www.idigbio.org/events-calendar/export.ics Find out more at iDigBio
21
21 http://aocr1.acis.ufl.edu/datasets/lichens/silver/ocr/WebrootDatasetsLichensSilverOcr.txt Created by sending a text file to this cloud generator http://www.jasondavies.com/wordcloud/#http%3A%2F%2Faocr1.acis.ufl.edu%2Fdatasets%2Flichens%2Fsilver%2Focr%2FWebrootDatasetsLichensSilverOcr.txtMüllMüll Web Service-Based Word Cloud
22
22 OCR text
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.