OCR and SALIX Parsing Daryl Lafferty Arizona State University October, 2012.

Slides:



Advertisements
Similar presentations
Collecting data Chapter 6. What is data? Data is raw facts and figures. In order to process data it has to be collected. The method of collecting data.
Advertisements

Summit 2012 October 23 – 24 reporting: Edward Gilbert, Debbie Paul.
 Goals and Scope  Research Question  Overall Workflow  Imaging Approach  OCR, NLP, Geo-referencing  Outreach and Crowd Sourcing.
CAPTURE SOFTWARE Please take a few moments to review the following slides. Please take a few moments to review the following slides. The filing of documents.
CAPTURE SOFTWARE Please take a few moments to review the following slides. Please take a few moments to review the following slides. The filing of documents.
© Tally Solutions Pvt. Ltd. All Rights Reserved 1 Barcode in Shoper 9 December 2009.
Crowd Sourcing and Community Management Capabilities Available within Symbiota Data Portals Nico Franz 1, Corinna Gries 2, Thomas Nash III 2 & Edward Gilbert.
FOSSIL INSECT DIGITIZATION WORKFLOW AT THE UNIVERSITY OF COLORADO Talia Karim 1, Lindsay Walker 1, Richard Levy 2 1 CU Museum of Natural History 2 Denver.
SOFTWARE PRESENTATION ODMS (OPEN SOURCE DOCUMENT MANAGEMENT SYSTEM)
Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Optical Character Recognition for Logistics Reporting Contributors: Joy Kamunyori, Mike Frost, Ashraf Islam A recording of the WebEx session can be found.
The use of OCR in the digitisation of herbarium specimens Robyn E Drinkwater, Robert Cubey & Elspeth Haston.
Label production Solution with Label Gallery programs Label Gallery is used for general label design and print GalleryForm is used to create data entry.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
1 MUST HAVE SHOULD HAVE COULD HAVE Module # Configuring the Qi Cluster Objectives: Know the sequence of configuring the Qi Cluster Know how to.
Expert Group Meeting on Price Statistics and National Accounts: ICP Round 2011 Jointly organized by: UN-ECLAC, CARICOM, CARTAC and ECCB 3rd-6th December.
Using MicroGrade ® to Manage Assessment in Performance Ensembles Presented by: Dr. Daniel R. Zanutto California State University, Long Beach March 14,
Presented by: Michael Bevans Information Manager for Digitization
Options for digital delivery Record Society Conference, April 19 th 2007 Bruce Tate Project Manager British History Online.
OCR implementation in The Caribbean Plants Digitization Project A project to image and catalog over 150,000 Caribbean specimens at the New York Botanical.
Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing.
Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin.
SilverImage Streamline tool for photographing herbarium specimen sheets Developed by SilverBiology Michael.
Image Workflow Processes Elspeth Haston, Robert Cubey, Martin Pullan & David J Harris.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Diagnostic Pathfinder for Instructors. Diagnostic Pathfinder Local File vs. Database Normal operations Expert operations Admin operations.
Document Management System. Introduction to the Document Management System (DMS) Key functions of the ACCPAC DMS are: Convert digital format documents.
FAMILYSEARCH INDEXING IS WORLDWIDE. INDEXING 1.WHAT IS INDEXING? - A PROCESS WHERE A PERSON CAN TRANSCRIBE DATA FROM A DIGITAL IMAGE WHICH IS THEN POSTED.
Verification & Validation. Batch processing In a batch processing system, documents such as sales orders are collected into batches of typically 50 documents.
TEMPLATE DESIGN © Professional Template for a 24x36 poster presentation Your name and the names of the people who have.
Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin.
ONLINE SEARCH AND REDACTION SYSTEM Many concepts of digitalization which aim is to present datas on internet are faced with two main subjects and problems:
Data Entry To STORET/WQX In A Perfect World If Data Source Is Hardcopy Data Entry Should Be…. Fast Efficient Have Minimal Room For Errors.
How to combine IRIS products Available APIs Examples of integrations Ole Andersen Senior Strategic Account Manager.
Class 3Intro to Databases Class 4 Simple Example of a Database We’re going to build a simple example of a database, which will allow us to register users.
CERTIFICATE IV IN BUSINESS JULY 2015 BSBWRT401A - Write Complex Documents.
Archiving.Net® Document Management System rchiving.Net® is a bi-lingual (Arabic/English) document management system that lets you capture, index, organize,
Welcome to Acrobat Day 2 of 2 Dan McAllister Just arriving? Sign-in near the door Grab a handout Just arriving? Sign-in near the door Grab a handout Finished.
Where are my files? Discoveries in establishing a digital archive workflow Sally McDonald Archivist/Librarian Western History/Genealogy, Denver Public.
Jason W. Karl, Ph.D. Jeffrey K. Gillan Jason W. Karl, Ph.D. Jeffrey K. Gillan 23 October 2013 Ty Montgomery Richard Bliss Ty Montgomery Richard Bliss
Your current Moodle 1.9 Minimum Requirements Ability to do a TEST RUN! Upgrading Moodle to Version 2 By Ramzan Jabbar Doncaster College for the Deaf By.
TEMPLATE DESIGN © Professional Template for a 36x48 poster presentation Your name and the names of the people who have.
EE400D DOCUMENTATION INSTRUCTIONAL SERIES BLOG POSTS.
Scan, Import, and Automatically file documents to Box Introduction
Scanning to Google Drive and Docs™ with ccScan®
ICE Integrated Cloud Environment Cloud Scanning and Mobile Printing
Screenshot evidence of page size
The effort-saving, cost-cutting, low-overhead, cloud capture platform.
Architecture Concept Documents
ADE EDIS READ & Optimizer TRAINING Colorado Department of Education
Building A Web-based University Archive
Professional Template for a 60x36 poster presentation
Quicken File Password related Issues
Quicken File Password related Issues
Professional Template for a 36x48 poster presentation
Materials & Methods Introduction Abstract Results Conclusion
Materials & Methods Introduction Abstract Results Conclusion
ListReader: Wrapper Induction for Lists in OCRed Documents
Professional Template for a 36x48 poster presentation
Materials & Methods Introduction Abstract Results Conclusion
Professional Template for a 36x48 poster presentation
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
eCourses Gradebook Overview
Title Goes Here Title Goes Here Title Goes Here Title Goes Here
INHS Insect collection digitization workflow
Materials & Methods Introduction Abstract Results Conclusion
Module 2 - Xtrata Pro Product Overview Module 2 – Product Overview
Materials & Methods Introduction Abstract Results Conclusion
Presentation transcript:

OCR and SALIX Parsing Daryl Lafferty Arizona State University October, 2012

SALIX: Semi-Automatic Label Information eXtraction SALIX was developed at Arizona State University from 2009 through Over 55,000 ASU Herbarium specimen labels were digitized using SALIX

Ideal SALIX Process Flow The ideal process flow is: Photograph the specimen label Perform OCR on the photograph Have SALIX parse the resulting text into database categories Upload the results to the database

Practical SALIX Process Flow The actual process flow has added steps: Photograph the specimen label Perform OCR on the photograph Correct any OCR errors. Tweak the text layout Have SALIX parse the resulting text into database categories Correct any mis-parsed results Upload the results to the database

OCR Workflow We use a ABBYY Professional Version 10 We capture an image of the full specimen, and another of just the label for OCR. Processing is done in batch mode, usually run over night on a folder containing hundreds of images. The result is a single text file with one label per page. OCR errors are corrected in the text file before processing with SALIX

The SALIX User Interface

Manual Data Entry

A label that results in many OCR errors

A label that results in few OCR errors

Label Length and Quality We first categorized 4 different label types, with the following average characteristics: We then had 3 students each process 10 labels of each category (40 labels total through SALIX and typed into Symbiota form.

Sample Throughput Data

Conclusions OCR quality has a strong effect on semi-automated parsing throughput using SALIX. OCR using ABBYY in Batch Mode was most efficient for our workflow. The relationship is roughly: where S = Ratio of SALIX Throughput/Typing Throughput and E = OCR Error rate stated as OCR Errors per 100 words (Obviously, the relationship isn't accurate as E approaches zero, i.e. less than about 2 Errors/100 words)

Acknowledgements All of the data presented here was from Anne Barber's Master's Thesis, completed at ASU in May, Anne also developed the process flow that helped optimize SALIX throughput. The overall project was under the direction of Les Landrum, curator of the ASU Herbarium.