Download presentation
Presentation is loading. Please wait.
Published byFrank McKinney Modified over 9 years ago
1
Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin
2
NSF ADBC (#1115116) ~ 2.3 million specimen 90% of all specimens 900,000 lichens 1.4 million bryophytes > 60 non-governmental US herbaria (95%) Mexico, US, Canada 16 digitization centers
4
Lichen Consortium http://lichenportal.org http://lichenportal.org 34 Collections 902,664 Records Bryophyte Consortium http://bryophyteportal/ http://bryophyteportal/ 26 Collections 1,300,135 Records Symbiota software
5
Imaging Stage Capture Image barcode in file name Create Skeleton File species name, country, state, exsiccati, etc. Upload to FTP server Image processing extract barcode, create web versions, map to portal DBs Herbarium Database Automated OCR Tesseract, ABBYY Existing Record simply link image Upload to FTP server Image URLs Manage Specimen Data in Portal Manage / Review Records in Portal Symbiota Editor review, edit, keystroke Create New Record barcode, image, skeletal data Automated NLP Darwin Core Parsing
6
1. Iterate through “unprocessed” images 2. OCR via Tesseract (version 3) a) In focus, good lighting, minimal noise b) Resolution: >20px x-height 3. Database raw text block 4. Progress to next step 1. Low OCR return => hand processing 2. Natural Language Processing
7
Issues Old fonts Faded labels Form labels Handwritten labels Specialized terms Solutions Image treatments OCR tuning Dictionaries Consensus OCR ¢_].L.|»‘¢.'».f.'._..‘~,(.J fin-x‘*\'a:"511z:1 wf.~\:'i/.onli State University P.’~.r"~2=,_. gg J:.2 " J*J*" †(=:\‘-“ax "»..'\-12 ‘ “ "‘ ;T~;‘~7i?»-1_1_\f;>sf`;,' ESX Z»ie+‘-». “~'.»te;~:i_.t<» ff`t;~f3":.f.“ » »4 xx,, """‘“â€T"’.t;;a¢f~rus ’ V4 J 'if. r°'° M '1?nies ivain.) Sav. neutal Station - " '1 ~»r';;4-\P ` 1. T11./P..,J..-. ELEV. ' `.fJL_\ LATL Q _‘ 1 _ Y’ DATE _,. W5. (> f-, -:‘; i f>i_T ~~. A 1: ». v\.-v »~. 4. a xvala 8/27/73 PLANTS OF NEW r~1ExIco Herbarium of Arizona State University Parmelia ulophyllodes (Vain.) Sav. COUNTY “°â€â€œâ€œ Joranada Experimental Station - New Mexico State University "“““' on Juniperus ELEV. ‘ 4400 EEILLEETUR DATE DU T. H. Nash #7914 8/27/73 T. H. N.
8
1. Iterate through raw OCR text blocks 2. Parse text block 1. Darwin Core 2. Populate database 3. Review 1. Adjust content 2. Approve 3. Handwritten => keystroke
9
Issues Variable layouts Loose standards OCR error Solutions Authority tables Levenshtein distance Word stats Format recognition Parsing profiles Duplicate harvesting
10
1. Extract collector data a) Last name, number, date 2. Harvest duplicates from consortium DB a) Exact duplicates b) Duplicate events 3. High similarity indexes 4. OCR block comparison 5. Consensus record
11
1. Target similar label formats 2. Use raw OCR to locate “Nash” labels 3. Targeted parsing algorithms 4. Exclude: a) Determined by Nash b) Author of scientific name c) Associated collector d) County
13
Michael Adamo Bruce Allen Meredith Blackwell Bill Buck Alina Freire-Fierro John Freudenstein Alan Fryday David Giblin Karen Hughes Steffi Ickert-Bond Timothy James Jennifer S. Kluse Matt Von Konrat Ben Legler Tatyana Livshultz Robert Lücking Francois Lutzoni Bob Magill Andrew Miller Brent Mishler Donald Pfister Richard Rabeler Malcolm Sargent Edward Schilling Michaela Schmull Blanka Shaw Jon Shaw Carol Shearer Larry StClair Barbara Thiers Funded by the NSF ADBC program
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.