Presentation is loading. Please wait.

Presentation is loading. Please wait.

Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin.

Similar presentations


Presentation on theme: "Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin."— Presentation transcript:

1 Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin

2  NSF ADBC (#1115116)  ~ 2.3 million specimen  90% of all specimens  900,000 lichens  1.4 million bryophytes  > 60 non-governmental US herbaria (95%)  Mexico, US, Canada  16 digitization centers

3

4  Lichen Consortium  http://lichenportal.org http://lichenportal.org  34 Collections  902,664 Records  Bryophyte Consortium  http://bryophyteportal/ http://bryophyteportal/  26 Collections  1,300,135 Records  Symbiota software

5 Imaging Stage Capture Image barcode in file name Create Skeleton File species name, country, state, exsiccati, etc. Upload to FTP server Image processing extract barcode, create web versions, map to portal DBs Herbarium Database Automated OCR Tesseract, ABBYY Existing Record simply link image Upload to FTP server Image URLs Manage Specimen Data in Portal Manage / Review Records in Portal Symbiota Editor review, edit, keystroke Create New Record barcode, image, skeletal data Automated NLP Darwin Core Parsing

6 1. Iterate through “unprocessed” images 2. OCR via Tesseract (version 3) a) In focus, good lighting, minimal noise b) Resolution: >20px x-height 3. Database raw text block 4. Progress to next step 1. Low OCR return => hand processing 2. Natural Language Processing

7  Issues  Old fonts  Faded labels  Form labels  Handwritten labels  Specialized terms  Solutions  Image treatments  OCR tuning  Dictionaries  Consensus OCR ¢_].L.|»‘¢.'».f.'._..‘~,(.J fin-x‘*\'a:"511z:1 wf.~\:'i/.onli State University P.’~.r"~2=,_. gg J:.2 " J*J*" ” (=:\‘-“ax "»..'\-12 ‘ “ "‘ ;T~;‘~7i?»-1_1_\f;>sf`;,' ESX Z»ie+‘-». “~'.»te;~:i_.t<» ff`t;~f3":.f.“ » »4 xx,, """‘“”T"’.t;;a¢f~rus ’ V4 J 'if. r°'° M '1?nies ivain.) Sav. neutal Station - " '1 ~»r';;4-\P ` 1. T11./P..,J..-. ELEV. ' `.fJL_\ LATL Q _‘ 1 _ Y’ DATE _,. W5. (> f-, -:‘; i f>i_T ~~. A 1: ». v\.-v »~. 4. a xvala 8/27/73 PLANTS OF NEW r~1ExIco Herbarium of Arizona State University Parmelia ulophyllodes (Vain.) Sav. COUNTY “°”““ Joranada Experimental Station - New Mexico State University "“““' on Juniperus ELEV. ‘ 4400 EEILLEETUR DATE DU T. H. Nash #7914 8/27/73 T. H. N.

8 1. Iterate through raw OCR text blocks 2. Parse text block 1. Darwin Core 2. Populate database 3. Review 1. Adjust content 2. Approve 3. Handwritten => keystroke

9  Issues  Variable layouts  Loose standards  OCR error  Solutions  Authority tables  Levenshtein distance  Word stats  Format recognition  Parsing profiles  Duplicate harvesting

10 1. Extract collector data a) Last name, number, date 2. Harvest duplicates from consortium DB a) Exact duplicates b) Duplicate events 3. High similarity indexes 4. OCR block comparison 5. Consensus record

11 1. Target similar label formats 2. Use raw OCR to locate “Nash” labels 3. Targeted parsing algorithms 4. Exclude: a) Determined by Nash b) Author of scientific name c) Associated collector d) County

12

13  Michael Adamo  Bruce Allen  Meredith Blackwell  Bill Buck  Alina Freire-Fierro  John Freudenstein  Alan Fryday  David Giblin  Karen Hughes  Steffi Ickert-Bond  Timothy James  Jennifer S. Kluse  Matt Von Konrat  Ben Legler  Tatyana Livshultz  Robert Lücking  Francois Lutzoni  Bob Magill  Andrew Miller  Brent Mishler  Donald Pfister  Richard Rabeler  Malcolm Sargent  Edward Schilling  Michaela Schmull  Blanka Shaw  Jon Shaw  Carol Shearer  Larry StClair  Barbara Thiers Funded by the NSF ADBC program


Download ppt "Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin."

Similar presentations


Ads by Google