Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin.

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

Home-Grown Digital Library System Built Upon Open Source XML Technologies and Metadata Standards David Lacy Villanova University
1 Metadata Tools for JISC Digitisation Projects of still images and text Ed Fay BOPCRIS, Hartley Library University of Southampton.
Summit 2012 October 23 – 24 reporting: Edward Gilbert, Debbie Paul.
 Goals and Scope  Research Question  Overall Workflow  Imaging Approach  OCR, NLP, Geo-referencing  Outreach and Crowd Sourcing.
Record Import Service Importing files of records into LA via the Internet.
Virtualizing Entomology Collection Student: Di Wang (Alan) Sponsors: John Marris: Curator, Entomology Research Museum Stuart Charters: Department of Applied.
NYBG + KE EMu The New York Botanical Garden + KE EMu Melissa Tulig Botanical Information Management.
BGBM - Biodiversity Informatics04 June 2013 How the specimen data is organised and published at BGBM.
Crowd Sourcing and Community Management Capabilities Available within Symbiota Data Portals Nico Franz 1, Corinna Gries 2, Thomas Nash III 2 & Edward Gilbert.
BiodIS K-State Biodiversity Information System David Allen and Mike Haddock K-State Libraries Coalition for Networked Information December 15, 2009.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
A Beginners Guide to Web Site Design. What we will cover…. Planning your site. Creating a template. Images and Fonts. Absolute vs. Relative Links.
CONTENT: A model for collaborative database building Trevor Bond Alan Cornish Washington State University Libraries.
Toward Automatic Processing and Indexing of Microfilm.
Herbarium Collections and Invasive Species Biology: Understanding the Past, Present, and Future David Giblin, Ph.D. Ben Legler Richard G. Olmstead, Ph.D.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Digitizing California Arthropod Collections Peter Oboyski, Phuc Nguyen, Serge Belongie, Rosemary Gillespie Essig Museum of Entomology University of California.
Program Wednesday – Welcome and presentation, coffee – Presentation of Picturae – digitization projects (how much time do you.
The Role of Small Herbaria in Large Digitization Projects Chris Neefus, Albion Hodgdon Herbarium (NHA) University of New Hampshire, Durham, New Hampshire,
The Macroalgal Herbarium Consortium ACCESSING 150 YEARS OF SPECIMEN DATA TO UNDERSTAND CHANGES IN THE MARINE/AQUATIC ENVIRONMENT.
The use of OCR in the digitisation of herbarium specimens Robyn E Drinkwater, Robert Cubey & Elspeth Haston.
AGent Demonstration Multi-Tier Solution Presented by Auto-Graphics Pomona, CA December 8-9, 2003 Version 2.0.
Label production Solution with Label Gallery programs Label Gallery is used for general label design and print GalleryData is used to create small database.
IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey.
Port Townsend Leader Historical Newspaper Archive Keith Darrock.
ALLOWS FOR efficient computerization and management of biological collections and mobilization of specimen information onto the Internet.ALLOWS FOR efficient.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Unlocking a Biodiversity Resource for Understanding Biotic Interactions, Nutrient Cycling and Human Affairs Wordle based on proposal.
2.3 million specimens, 65 institutions, 1 year later DIGITIZING 'ALL' NORTH AMERICAN LICHEN AND BRYOPHYTE SPECIMENS Corinna Gries Edward Gilbert Thomas.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
The Macroalgal Digitization Project Chris Neefus, Department of Biological Sciences University of New Hampshire, Durham, New Hampshire.
IDigBio, Integrated Digitized Biocollections, is the National Resource funded by the National Science Foundation for Advancing Digitization of Biological.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
Presented by: Michael Bevans Information Manager for Digitization
Automated Georeferencing of Natural History Museum Data Nelson E. Rios Discussion The Tulane University Fish Collection, with 7.1 million fluid-preserved.
Training course on biodiversity data publishing and fitness-for-use in the GBIF Network, 2011 edition How Darwin Core Archives have changed the landscape.
OCR and SALIX Parsing Daryl Lafferty Arizona State University October, 2012.
OCR implementation in The Caribbean Plants Digitization Project A project to image and catalog over 150,000 Caribbean specimens at the New York Botanical.
Metadata Normalisation in Europeana The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing.
TECHNOLOGY SUPPORT FOR ESSSS Progress, Issues, and Challenges Marshall Breeding Director for Innovative Technology and Research Vanderbilt University Library.
Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing.
OpenURL Link Resolvers 101
University of Florida Florida State University
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
 How are changes in distribution patterns of lichens and bryophytes over time correlated with man-made environmental changes?  How accurately can we.
Image Workflow Processes Elspeth Haston, Robert Cubey, Martin Pullan & David J Harris.
BIEN Confederated DB (S) Analytical DB(s) Heterogeneous source database(s) of Plots/Specimens/Occurrences Synonymy Names Reference taxonomy *** *** Feedback.
The Macroalgal Herbarium Consortium ACCESSING 150 YEARS OF SPECIMEN DATA TO UNDERSTAND CHANGES IN THE MARINE/AQUATIC ENVIRONMENT.
Web Page Design Introduction. The ________________ is a large collection of pages stored on computers, or ______________ around the world. Hypertext ________.
The PLAZI Markup System Donat Agosti Terry Catapano Robert “Bob“ Morris Guido Sautter Universität Karlsruhe (TH) Research University – founded 1825.
Corinna Gries Edward Gilbert Thomas H. Nash III. Lichens Bryophytes Climate Change  NSF ADBC funding 2011 ~ 2.3 million specimen (90%) ○ 900,000 lichens.
Database & Website ‘Buffet’ 1.ASU Lichen Herbarium Database 2.ASU Lichen Herbarium Website 3.ABLS Lichen Exchange 4.The Species Analyst (Kansas) 5.Collections.
AUTOMATED NDNP QUALITY REVIEW Andrew Weidner Project Coordinator, New Mexico Historical Newspapers University of North Texas Libraries: Digital Newspaper.
2.3 million specimens, 65 institutions, 1 year later DIGITIZING 'ALL' NORTH AMERICAN LICHEN AND BRYOPHYTE SPECIMENS Corinna Gries Edward Gilbert Thomas.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
The Macroalgal Herbarium Consortium Accessing 150 Years of Specimen Data to Understand Changes in the Marine/Aquatic Environment Janet Sullivan and Chris.
Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin.
The William and Linda Steere Herbarium The New York Botanical Garden
Networking Biodiversity Data – Online Access to Distributed Data Sources in GBIF-D Andrea Hahn, A. Kirchhoff & W.G. Berendsohn Botanic Garden and Botanical.
 Research Question  Goals and Scope  Digitization Workflow  Geo-referencing  Dissemination  Outreach and Crowd Sourcing.
Behrooz ChitsazLorrie Apple Johnson Microsoft ResearchU.S. Department of Energy.
Agilon’s Constituent Connection Leading Edge Software for Fundraising and Relationship Management Agilon Fundraising Solutions Automating Online Gifts,
Making a Herbarium Specimen / Voucher
What are our collections being used for?
Botalista Software Presentation
Training course on biodiversity data publishing and fitness-for-use in the GBIF Network, 2011 edition How Darwin Core Archives have changed the landscape.
CONTENTS SLIDE 2 Digital Poster Display Template SLIDE 3
Deb Paul, iDigBio aOCR WG
Integrating source modifiers with sequence data through a new GenBank submission module in Symbiota   Andrew N. Miller1, Phil Anders1, Neil Cobb2, Ben.
Presentation transcript:

Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin

 NSF ADBC (# )  ~ 2.3 million specimen  90% of all specimens  900,000 lichens  1.4 million bryophytes  > 60 non-governmental US herbaria (95%)  Mexico, US, Canada  16 digitization centers

 Lichen Consortium   34 Collections  902,664 Records  Bryophyte Consortium   26 Collections  1,300,135 Records  Symbiota software

Imaging Stage Capture Image barcode in file name Create Skeleton File species name, country, state, exsiccati, etc. Upload to FTP server Image processing extract barcode, create web versions, map to portal DBs Herbarium Database Automated OCR Tesseract, ABBYY Existing Record simply link image Upload to FTP server Image URLs Manage Specimen Data in Portal Manage / Review Records in Portal Symbiota Editor review, edit, keystroke Create New Record barcode, image, skeletal data Automated NLP Darwin Core Parsing

1. Iterate through “unprocessed” images 2. OCR via Tesseract (version 3) a) In focus, good lighting, minimal noise b) Resolution: >20px x-height 3. Database raw text block 4. Progress to next step 1. Low OCR return => hand processing 2. Natural Language Processing

 Issues  Old fonts  Faded labels  Form labels  Handwritten labels  Specialized terms  Solutions  Image treatments  OCR tuning  Dictionaries  Consensus OCR ¢_].L.|»‘¢.'».f.'._..‘~,(.J fin-x‘*\'a:"511z:1 wf.~\:'i/.onli State University P.’~.r"~2=,_. gg J:.2 " J*J*" ” (=:\‘-“ax "»..'\-12 ‘ “ "‘ ;T~;‘~7i?»-1_1_\f;>sf`;,' ESX Z»ie+‘-». “~'.»te;~:i_.t<» ff`t;~f3":.f.“ » »4 xx,, """‘“”T"’.t;;a¢f~rus ’ V4 J 'if. r°'° M '1?nies ivain.) Sav. neutal Station - " '1 ~»r';;4-\P ` 1. T11./P..,J..-. ELEV. ' `.fJL_\ LATL Q _‘ 1 _ Y’ DATE _,. W5. (> f-, -:‘; i f>i_T ~~. A 1: ». v\.-v »~. 4. a xvala 8/27/73 PLANTS OF NEW r~1ExIco Herbarium of Arizona State University Parmelia ulophyllodes (Vain.) Sav. COUNTY “°”““ Joranada Experimental Station - New Mexico State University "“““' on Juniperus ELEV. ‘ 4400 EEILLEETUR DATE DU T. H. Nash #7914 8/27/73 T. H. N.

1. Iterate through raw OCR text blocks 2. Parse text block 1. Darwin Core 2. Populate database 3. Review 1. Adjust content 2. Approve 3. Handwritten => keystroke

 Issues  Variable layouts  Loose standards  OCR error  Solutions  Authority tables  Levenshtein distance  Word stats  Format recognition  Parsing profiles  Duplicate harvesting

1. Extract collector data a) Last name, number, date 2. Harvest duplicates from consortium DB a) Exact duplicates b) Duplicate events 3. High similarity indexes 4. OCR block comparison 5. Consensus record

1. Target similar label formats 2. Use raw OCR to locate “Nash” labels 3. Targeted parsing algorithms 4. Exclude: a) Determined by Nash b) Author of scientific name c) Associated collector d) County

 Michael Adamo  Bruce Allen  Meredith Blackwell  Bill Buck  Alina Freire-Fierro  John Freudenstein  Alan Fryday  David Giblin  Karen Hughes  Steffi Ickert-Bond  Timothy James  Jennifer S. Kluse  Matt Von Konrat  Ben Legler  Tatyana Livshultz  Robert Lücking  Francois Lutzoni  Bob Magill  Andrew Miller  Brent Mishler  Donald Pfister  Richard Rabeler  Malcolm Sargent  Edward Schilling  Michaela Schmull  Blanka Shaw  Jon Shaw  Carol Shearer  Larry StClair  Barbara Thiers Funded by the NSF ADBC program