Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin.

Slides:



Advertisements
Similar presentations
50 Years of Experience in Making Grey Literature Available Matching the Expectations of the Particle Physics Community Carmen ODell.
Advertisements

EMu Online Data Sources Brad Lickman For Taxonomy and Geolocation (and Vocabulary Control)
Summit 2012 October 23 – 24 reporting: Edward Gilbert, Debbie Paul.
 Goals and Scope  Research Question  Overall Workflow  Imaging Approach  OCR, NLP, Geo-referencing  Outreach and Crowd Sourcing.
Record Import Service Importing files of records into LA via the Internet.
Geospatial One-Stop A Federal Gateway to Federal, State & Local Geographic Data
NYBG + KE EMu The New York Botanical Garden + KE EMu Melissa Tulig Botanical Information Management.
BGBM - Biodiversity Informatics04 June 2013 How the specimen data is organised and published at BGBM.
National Herbarium of New South Wales Royal Botanic Gardens & Domain Trust, Sydney New South Wales Flora Online Karen Wilson and Gary Chapple.
Crowd Sourcing and Community Management Capabilities Available within Symbiota Data Portals Nico Franz 1, Corinna Gries 2, Thomas Nash III 2 & Edward Gilbert.
Toward Automatic Processing and Indexing of Microfilm.
Library integrated system -Aleph Fang Peng Stony Brook University.
Reference Manager Making your life easier! Updated September 2007.
Digitizing California Arthropod Collections Peter Oboyski, Phuc Nguyen, Serge Belongie, Rosemary Gillespie Essig Museum of Entomology University of California.
Program Wednesday – Welcome and presentation, coffee – Presentation of Picturae – digitization projects (how much time do you.
The Role of Small Herbaria in Large Digitization Projects Chris Neefus, Albion Hodgdon Herbarium (NHA) University of New Hampshire, Durham, New Hampshire,
The Macroalgal Herbarium Consortium ACCESSING 150 YEARS OF SPECIMEN DATA TO UNDERSTAND CHANGES IN THE MARINE/AQUATIC ENVIRONMENT.
The use of OCR in the digitisation of herbarium specimens Robyn E Drinkwater, Robert Cubey & Elspeth Haston.
Considerations for the Construction of Lichen Databases Data Management.
ALLOWS FOR efficient computerization and management of biological collections and mobilization of specimen information onto the Internet.ALLOWS FOR efficient.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Pam Fuller U.S. Geological Survey Gainesville, FL Nancy Elder U.S. Geological Survey Western Fisheries Research Center Marrowstone Marine Station Nonindigenous.
2.3 million specimens, 65 institutions, 1 year later DIGITIZING 'ALL' NORTH AMERICAN LICHEN AND BRYOPHYTE SPECIMENS Corinna Gries Edward Gilbert Thomas.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
PROCAL-TRACK VIRTUAL JOB SHEET & INSTRUMENT STATUS.
The Macroalgal Digitization Project Chris Neefus, Department of Biological Sciences University of New Hampshire, Durham, New Hampshire.
AYAN MITRA CHRIS HOFFMAN JANA HUTCHINS Arizona Geospatial Data Sharing Web Application Development April 10th, 2013.
Integrating and managing your Engaging Networks data Top ten data features.
Presented by: Michael Bevans Information Manager for Digitization
Natural Resource Program Center NPSpecies Update Alison Loar and Michelle Flenner 4/21/2010.
OCR and SALIX Parsing Daryl Lafferty Arizona State University October, 2012.
OCR implementation in The Caribbean Plants Digitization Project A project to image and catalog over 150,000 Caribbean specimens at the New York Botanical.
Metadata Normalisation in Europeana The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing.
Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing.
OpenURL Link Resolvers 101
University of Florida Florida State University
[] Where Did Those GBIF Occurrences Come From? Providing Digital Access to NatureServe's Reference Database: Report on a Project in the Early Stages of.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
 How are changes in distribution patterns of lichens and bryophytes over time correlated with man-made environmental changes?  How accurately can we.
Web Authoring Rico Yu. Ch.6 Planning for a Web Site Introduction Steps in setting up Needs Planning.
Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin.
Microsoft Word 2010 Lesson 10 Brandy Frazier – Southern Nash High School – Nash County.
Image Workflow Processes Elspeth Haston, Robert Cubey, Martin Pullan & David J Harris.
Wikis. Some resources  What is a wiki:  How do you make a PBWorks account:
BIEN Confederated DB (S) Analytical DB(s) Heterogeneous source database(s) of Plots/Specimens/Occurrences Synonymy Names Reference taxonomy *** *** Feedback.
The Macroalgal Herbarium Consortium ACCESSING 150 YEARS OF SPECIMEN DATA TO UNDERSTAND CHANGES IN THE MARINE/AQUATIC ENVIRONMENT.
Advanced Samples Integrate label printing to existing information system Labels with variable contents should print automatically when the new products.
The PLAZI Markup System Donat Agosti Terry Catapano Robert “Bob“ Morris Guido Sautter Universität Karlsruhe (TH) Research University – founded 1825.
Corinna Gries Edward Gilbert Thomas H. Nash III. Lichens Bryophytes Climate Change  NSF ADBC funding 2011 ~ 2.3 million specimen (90%) ○ 900,000 lichens.
Database & Website ‘Buffet’ 1.ASU Lichen Herbarium Database 2.ASU Lichen Herbarium Website 3.ABLS Lichen Exchange 4.The Species Analyst (Kansas) 5.Collections.
2.3 million specimens, 65 institutions, 1 year later DIGITIZING 'ALL' NORTH AMERICAN LICHEN AND BRYOPHYTE SPECIMENS Corinna Gries Edward Gilbert Thomas.
Google Refine for Data Quality / Integrity. Context BioVeL Data Refinement Workflow Synonym Expansion / Occurrence Retrieval Data Selection Data Quality.
The Macroalgal Herbarium Consortium Accessing 150 Years of Specimen Data to Understand Changes in the Marine/Aquatic Environment Janet Sullivan and Chris.
FRErator – the Bridge between FRE and Curator DB.
The William and Linda Steere Herbarium The New York Botanical Garden
HalFILE 2.1 halFILE Workflow. Workflow? Workflow is simply a clearly defined business process Workflow as it relates to pc’s is the attempt to automate.
 Research Question  Goals and Scope  Digitization Workflow  Geo-referencing  Dissemination  Outreach and Crowd Sourcing.
Walkthrough – Wireframes – for Photo Upload process Purpose: To provide media handling screen to help UCJEPS grant. Proposed: Photographer to use barcode.
5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS Bill KampBill Kamp, Lumnilogical Research Center,
Jason W. Karl, Ph.D. Jeffrey K. Gillan Jason W. Karl, Ph.D. Jeffrey K. Gillan 23 October 2013 Ty Montgomery Richard Bliss Ty Montgomery Richard Bliss
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
What are our collections being used for?
Resource Management / Acquisitions
Tools and Techniques to Clean Up your Database
Tools and Techniques to Clean Up your Database
Data Mining Chapter 6 Search Engines
Final Design Authorization
OpenURL: Pointing a Loaded Resolver
INHS Insect collection digitization workflow
Health & Consumers DG SANCO Unit A.4 Information systems
Presentation transcript:

Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin

 16 digitization centers  > 60 non-governmental US herbaria (95%)  Mexico, US, Canada  ~ 2.3 million specimen  90% of all specimens  900,000 lichens  1.4 million bryophytes

 Lichen Consortium   Started in 2009  24 Collections  ~ 797,916 Records  Bryophyte Consortium   Started in 2010  16 Collections  1,059,063 Records

Imaging Stage Capture Image barcode in file name Create Skeleton File barcode, species name, exsiccati, etc. Upload to FTP server Image processing extract barcode, create web versions, map to portal DBs Duplicate Harvesting Existing Herbarium Database Automated Processing OCR / NLP / Georeferencing augmented with raw OCR, parsed fields, coordinates, etc. Existing Record simply link image Upload to FTP server Image URLs Manage Specimen Data in Portal Manage / Review Records in Portal Symbiota Editor review, edit, keystroke, and finalize Create New Record barcode, image, skeletal data

 Image all specimen / specimen labels  Collect and load skeletal data  Barcode, scientific name, country, state  Upload to portal  Record exists => link image to existing record  Record absent => create empty “unprocessed” record  Automated OCR label  Block of raw text => database  Automated NLP (field parsing)  Review data  Keystroke full record  Collector name & number => look for dups  Reparse full record => learnable parsers

 Tesseract V3  Dual cycle  Automatic  Manual review  Expected hurtles  Handwritten labels  Old fonts  Faded labels  Form labels  Adjustable image variables ¢_].L.|»‘¢.'».f.'._..‘~,(.J fin-x‘*\'a:"511z:1 wf.~\:'i/.onli State University P.’~.r"~2=,_. gg J:.2 " J*J*" ” (=:\‘-“ax "»..'\-12 ‘ “ "‘ ;T~;‘~7i?»-1_1_\f;>sf`;,' ESX Z»ie+‘-». “~'.»te;~:i_.t<» ff`t;~f3":.f.“ » »4 xx,, """‘“”T"’.t;;a¢f~rus ’ V4 J 'if. r°'° M '1?nies ivain.) Sav. neutal Station - " '1 ~»r';;4-\P ` 1. T11./P..,J..-. ELEV. ' `.fJL_\ LATL Q _‘ 1 _ Y’ DATE _,. W5. (> f-, -:‘; i f>i_T ~~. A 1: ». v\.-v »~. 4. a xvala 8/27/73 PLANTS OF NEW r~1ExIco Herbarium of Arizona State University Parmelia ulophyllodes (Vain.) Sav. COUNTY “°”““ Joranada Experimental Station - New Mexico State University "“““' on Juniperus ELEV. ‘ 4400 EEILLEETUR DATE DU T. H. Nash #7914 8/27/73 T. H. N.

1. Iterate through new “unprocessed” images bryophytes images lichens images 2. OCR via Tesseract (version 3) a) Untreated image b) Treated image (contrast, brightness, etc) 3. Store raw text linked to skeletal record 4. Progress to next step 1. Low OCR return => hand processing 2. “Unprocessed-OCR” => NLP

1. Iterate through raw OCR text blocks a) lichen OCR blocks b) bryophyte OCR blocks 2. Collector, number, and date a) Attempt duplicate harvesting 3. Field-by-field parsing 4. Full-parsing 5. Parsing based on NLP profiles 1. E.g. targeted label formats

1. Extract collector data a) Last name, number, date 2. Harvest duplicates from consortium DB a) Exact duplicates b) Duplicate events 3. Compare return field-by-field 4. Compare fields with raw OCR 5. Populate fields that have high similarity indexes 6. Processing status: “pending review”

1. Premise: Target similar label formats 2. Use raw OCR to locate “Nash” labels 3. Need to exclude: a) Determined by Nash b) Author of scientific name c) Associated collector 4. Test for similarity to target label format 5. Targeted parsing algorithms