Download presentation
Presentation is loading. Please wait.
Published byBarbara Morton Modified over 9 years ago
1
Digitizing California Arthropod Collections Peter Oboyski, Phuc Nguyen, Serge Belongie, Rosemary Gillespie Essig Museum of Entomology University of California Berkeley, California, USA
2
Who is CalBug? Essig Museum of Entomology California Academy of Sciences California State Collection of Arthropods Bohart Museum, UC Davis Entomology Research Museum, UC Riverside San Diego Natural History Museum LA County Museum Santa Barbara Museum of Natural History
4
(Optional) Sort by locality, date, sex, etc. Remove labels, add unique identifier Replace labels, return to collection Manually enter data into MySQL database Online crowd-sourcing of manual data entry Optical Character Recognition (OCR) & Automated data parsing Error checking Geographic referencing Aggregate data in online cache Temporospatial analyses Take digital image, name and save file Digitization workflow Handling & Imaging Data CaptureData Manipulation
5
Why Image Specimens/Labels? Data capture can be done remotely Magnify difficult to read labels Potential for OCR Verbatim digital archive of label data
6
DinoLite 1 st generation - DinoLite digital microscope
8
2 nd generation – Digital Camera (Canon G9)
9
Higher resolution Labels flat & unobstructed Scale bar, controlled light Important to add species name to image or file name EMEC218958 Paracotalpa ursina.jpg ~150,000 images waiting to database
10
Manually enter data into MySQL database Online crowd-sourcing of manual data entry Optical Character Recognition (OCR) & Automated data parsing Data capture Using our own MySQL database (EssigDB) Built-in error checking Data carry-over one record to next Taxonomy automatically added “Notes from Nature” Collaboration with Zooniverse Citizen Scientist transcription of labels Collaboration with UC San Diego Improved word spotting & OCR
12
Notes from Nature Citizen Science data transcription
15
Integrating OCR with crowd sourcing o Spotting words within images o Copy-paste, highlight-drag fields o Auto-detecting repeated “words” o eg. species, states, counties o Providing an additional “vote” for transcription consensus
16
The OCR challenge for specimen labels DETECTION: Finding text in a complex matrix Machine-typed vs. hand-written labels Sliding window classifier creating text bounding boxes >95% detection and localization using pixel- overlap measures
17
RECOGNITION: Using Tesseract OCR engine Machine Type 74% accuracy for word-level 82% accuracy for character-level Hand Writing 5.4% accuracy for word-level 9.2% accuracy for character-level Current Progress in OCR recognition
19
Where do we go from here? Improved recognition of hand-writing Incorporate OCR into crowd sourcing Develop (semi-) automated data parsing
20
Thank you http://calbug.berkeley.edu
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.