Download presentation
Presentation is loading. Please wait.
Published byEdward Greene Modified over 9 years ago
1
Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin
2
16 digitization centers > 60 non-governmental US herbaria (95%) Mexico, US, Canada ~ 2.3 million specimen 90% of all specimens 900,000 lichens 1.4 million bryophytes
3
http://lbcc.limnology.wisc.edu/
5
Lichen Consortium http://lichenportal.org http://lichenportal.org Started in 2009 24 Collections ~ 797,916 Records Bryophyte Consortium http://bryophyteportal/ http://bryophyteportal/ Started in 2010 16 Collections 1,059,063 Records
6
Imaging Stage Capture Image barcode in file name Create Skeleton File barcode, species name, exsiccati, etc. Upload to FTP server Image processing extract barcode, create web versions, map to portal DBs Duplicate Harvesting Existing Herbarium Database Automated Processing OCR / NLP / Georeferencing augmented with raw OCR, parsed fields, coordinates, etc. Existing Record simply link image Upload to FTP server Image URLs Manage Specimen Data in Portal Manage / Review Records in Portal Symbiota Editor review, edit, keystroke, and finalize Create New Record barcode, image, skeletal data
7
Image all specimen / specimen labels Collect and load skeletal data Barcode, scientific name, country, state Upload to portal Record exists => link image to existing record Record absent => create empty “unprocessed” record Automated OCR label Block of raw text => database Automated NLP (field parsing) Review data Keystroke full record Collector name & number => look for dups Reparse full record => learnable parsers
8
Tesseract V3 Dual cycle Automatic Manual review Expected hurtles Handwritten labels Old fonts Faded labels Form labels Adjustable image variables ¢_].L.|»‘¢.'».f.'._..‘~,(.J fin-x‘*\'a:"511z:1 wf.~\:'i/.onli State University P.’~.r"~2=,_. gg J:.2 " J*J*" †(=:\‘-“ax "»..'\-12 ‘ “ "‘ ;T~;‘~7i?»-1_1_\f;>sf`;,' ESX Z»ie+‘-». “~'.»te;~:i_.t<» ff`t;~f3":.f.“ » »4 xx,, """‘“â€T"’.t;;a¢f~rus ’ V4 J 'if. r°'° M '1?nies ivain.) Sav. neutal Station - " '1 ~»r';;4-\P ` 1. T11./P..,J..-. ELEV. ' `.fJL_\ LATL Q _‘ 1 _ Y’ DATE _,. W5. (> f-, -:‘; i f>i_T ~~. A 1: ». v\.-v »~. 4. a xvala 8/27/73 PLANTS OF NEW r~1ExIco Herbarium of Arizona State University Parmelia ulophyllodes (Vain.) Sav. COUNTY “°â€â€œâ€œ Joranada Experimental Station - New Mexico State University "“““' on Juniperus ELEV. ‘ 4400 EEILLEETUR DATE DU T. H. Nash #7914 8/27/73 T. H. N.
9
1. Iterate through new “unprocessed” images 1. 81439 bryophytes images 2. 147122 lichens images 2. OCR via Tesseract (version 3) a) Untreated image b) Treated image (contrast, brightness, etc) 3. Store raw text linked to skeletal record 4. Progress to next step 1. Low OCR return => hand processing 2. “Unprocessed-OCR” => NLP
10
1. Iterate through raw OCR text blocks a) 147122 lichen OCR blocks b) 81439 bryophyte OCR blocks 2. Collector, number, and date a) Attempt duplicate harvesting 3. Field-by-field parsing 4. Full-parsing 5. Parsing based on NLP profiles 1. E.g. targeted label formats
11
1. Extract collector data a) Last name, number, date 2. Harvest duplicates from consortium DB a) Exact duplicates b) Duplicate events 3. Compare return field-by-field 4. Compare fields with raw OCR 5. Populate fields that have high similarity indexes 6. Processing status: “pending review”
12
1. Premise: Target similar label formats 2. Use raw OCR to locate “Nash” labels 3. Need to exclude: a) Determined by Nash b) Author of scientific name c) Associated collector 4. Test for similarity to target label format 5. Targeted parsing algorithms
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.