Improving search in scanned documents: Looking for OCR mismatches

Slides:



Advertisements
Similar presentations
Texts and Digital Objects What seems to have changed.
Advertisements

1 Metadata Tools for JISC Digitisation Projects of still images and text Ed Fay BOPCRIS, Hartley Library University of Southampton.
Delivering textual resources. Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How.
How to publish genomic Data papers based on BOL data - Biodiversity Data Journal Lyubomir Penev Bulgarian Academy of Sciences & Pensoft Publishers ViBRANT.
Services Digitisation & Content Management. 600 People – India.
Taxonomic Literature Standards and Synergies TDWG 2006 Anna L. Weitzman & Christopher H. C. Lyal.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
OCR Nationals – Unit 1 AO4 – Business Documents. Overview of AO4 To produce a variety of different business documents for the company.
Introduction to biological data management
Biodiversity Heritage Library by Connie Rinaldo. Overview History EOL/BHL: WHY? Members/Collaborators Process Governance Sustainability: Legal and Financial.
Cynthia Parr Species Pages Group GBIF Briefing 11 Aug 2010.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Open access journals Pensoft Journal Ststem PJS 2.0 Lyubomir Penev Bulgarian Academy of Sciences & Pensoft Publishers ViBRANT ViBRANT Tools for DNA taxonomists,
Scratchpads Publication Module - A paradigm shift in publishing RBG Kew, Seminar,
The Pensoft Journal System and XML-based workflow Lyubomir Penev Life and Literature Conference, Chicago 2011 ViBRANT Virtual Biodversity.
Improving search in scanned documents: Looking for OCR mismatches David Morse David King Anton Dil Alistair Willis David Roberts Chris Lyal.
CIS 451: Introduction to XML Dr. Ralph D. Westfall October, 2011.
GLOBAL BIODIVERSITY INFORMATION FACILITY Cataloging and using Taxonomic Data The Global Names Architecture David Remsen Senior Programme Officer, ECAT.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Semantics and Syntax of Dublin Core Usage in Open Archives Initiative Data Providers of Cultural Heritage Materials Arwen Hutt, University of Tennessee.
Image Workflow Processes Elspeth Haston, Robert Cubey, Martin Pullan & David J Harris.
Crowd-sourcing the creation of “articles” within the Biodiversity Heritage Library Bianca Crowley Trish Rose-Sandler
TDWG 2006 Conference, St Louis Digitizing the legacy literature of biodiversity An introduction to the Biodiversity Heritage Library (BHL) Neil Thomson.
The Future of Informatics in Digital Literature – or Literature and it’s (Digital) Future Donat Agosti and Terrance Catapano Plazi TDWG, Woods Hole, September.
Introduction to metadata
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
OWL Representing Information Using the Web Ontology Language.
Biodiversity literature mark-up Compelling use cases for Natural History Collections Dr Dimitris Koureas Natural History Museum London Workshop on mark-up.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Chapter Three Presentation: User interface How to Build a Digital Library Ian H. Witten and David Bainbridge.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Presenting Documents How to Build a Digital Library Ian H. Witten and David Bainbridge.
Expanding the Notion of Links DeRose, S.J. Expanding the Notion of Links. In Proceedings of Hypertext ‘89 (Nov. 5-8, Pittsburgh, PA). ACM, New York, 1989,
Taxonomic Name Recognition (TNR) in Biodiversity Heritage Library (生物多样性图书馆分 类学名称识别) Qin Wei (魏琴), Chris Freeland, P. Bryan Heidorn Missouri Botanical.
Taxonomic standards Leen Vandepitte On behalf of WoRMS data management team.
World wide access to biodiversity literature The Biodiversity Heritage Library Henning Scholz 1 & Tom Garnett 2 1 Museum für Naturkunde, Berlin, Germany.
Vince Graziano English Librarian
Where to find online information
Slides Template for Module 3 Contextual details needed to make data meaningful to others CC BY-NC.
Taxonomy is described sometimes as a science and sometimes as an art,
IFLA Newspapers pre-conference Geneva, Arturs Zogla
Taxonomy is described sometimes as a science and sometimes as an art,
LECTURE 3: DATABASE SEARCHING PRINCIPLES
International Congress of Entomology, Orlando
The High Energy Physics information platform: Introduction
Modern Systems Analysis and Design Fifth Edition
RCN Development of an Online Database to Enhance the Conservation of SGCN Invertebrates in the Northeastern Region James W. Fetzner Jr. & John.
Elspeth Haston, Robyn Drinkwater, Robert Cubey & Ruth Monfries
Digital Scanning at the Course Materials Program
Legislative Influence Detector
Sequence comparison: Significance of similarity scores
CS 430: Information Discovery
Hands-on Introduction and Refresher Course
Delete Comments After corrections have been made or you change your mind about something, you might want to delete the comment. Just highlight the comment.
Cynthia S. Parr, Robert Guralnick, Nico Cellinese, Roderic D.M. Page 
A Green Form-Based Information Extraction System for Historical Documents Tae Woo Kim No HD. I’m glad to present GreenFIE today. A Green Form-…
Title of presentation | Presentation by [Enter details in 'Header & Footer' field 18/05/2019.
Extracting Information from Diverse and Noisy Scanned Document Images
Current Challenges in Digitization
Managing the Institutional Repository for OA Khawulile Radebe: Librarian: Repository Administrator & Metadata.
Writing Technical Reports
TITLE LEFT ALIGNED HERE ALL CAPS
Quick and Dirty: the art of OCR
Presentation transcript:

Improving search in scanned documents: Looking for OCR mismatches David Morse David King Anton Dil Alistair Willis David Roberts Chris Lyal

Introduction Biological taxonomy Heavily publication based manage names of organisms relationships between organisms Heavily publication based extensive legacy literature from 15th century observations appear in wide variety of publications learned society documents, encyclopaedias, institution reports, etc. occurence data, historical trends, geographic clues etc. biodiversity management

Digital libraries and curation Digitised historical collections necessary for practising taxonomists possible Natural Language Processing research? compare Genia/Medline Must be searchable on taxonomic names also other non-standard English: names (authorities) locations etc. technical language non-dictionary terms

Mass digitisation Most biodiversity literature still on paper only but collections being scanned at high rate eg. the Biodiversity Heritage Library Several major repositories of biodiversity documents BHL INOTAXA (BHL + TEI-Lite) eFloras Plazi Internet archive Lack of consistent markup SciXML, NLM DTD, ...

Markup in ABLE Previously been using TEI-Lite Moving towards taXMLit INOTAXA structural markup semantic content in layout eg. indentation for hierarchy Moving towards taXMLit retains structural markup includes semantic markup taxonomic name, authority, operation (new name, synonym etc) date etc.

Digital libraries and curation Large scale scanning BHL currently: 22,000 volumes 9.2 million pages growth rate: 1,500 volumes / month 600,000 pages / month No chance of manual correction/markup Automatic markup necessary INOTAXA Cheaper to manually correct and rekey

Digital libraries and curation Taxonomic nomenclature recognition difficult to automate OCR errors variation in fonts meaning in layout not generally captured by OCR non-dictionary words I should refer to Peritaxia, the first ventral suture being nearly straight

Digital libraries and curation huge terminological variation (GBIF) Actinobacillus actimomycetemcomitans Actinobacillus actimycetemcomitans Actinobacillus actinmycetemcomitans Actinobacillus actinomicetemcomitans Actinobacillus actinomy Actinobacillus actinomyce Actinobacillus actinomycemcomitans Actinobacillus actinomyceremcomitans Actinobacillus actinomycetam Actinobacillus actinomycetamcomitans Actinobacillus actinomycetecomitans Actinobacillus actinomycetemcmitans Actinobacillus actinomycetemcomintans Actinobacillus actinomycetemcomitance Actinobacillus actinomycetemcomitans Actinobacillus actinomycetemcomitants Actinobacillus actinomycetemcommitans Actinobacillus actinomycetemocimitans Actinobacillus actinomycetencomitans Actinobacillus actinomycetum Actinobacillus actinomyctemcomitans Actinobacillus actinomyectomcomitans Actinobacillus actinomyetemcomitans Actinobacillus actinonmycetemcomitans Actinobacillus actionomycetemcomitans Actinobacillus actynomicetemcomitans Actinobacillus antinomycetemcomitans

Variation with OCR Want to help manage databases of taxons lots for taxonomy GBIF, ITIS, Species 2000, Catalogue of life, uBIO, ... very incomplete too few specialists to maintain/integrate Literature used to manage these resources, but only if can identify/search terms in documents. OCR struggles with these examples

OCR accuracy Typical OCR accuracy around 95-96% (by word) Generally on born digital documents known fonts for legacy literature, font may be unique to a publication standard dictionaries Error rates for taxons 20-35% F-score (TaxonFinder, FAT) Is terminology in the 5% of errors? specialist English italics in non-standard fonts

Problems with the terminology variation Misreadings may not be identifiable as errors Homa / Homo Pica / Pioa no canonical reference Even with multiple OCR readings, may not get the correct form RHYNCHOPHOBA (correct) BHYNCHOPHOKA (PDF maker) KHYNCHOPHOBA (ABBYY) Attempting to correct the OCR errors is not an option

New taxons Not in the business of correcting OCR no canonical reference aim is to identify new taxons community too small to correct most errors Is an unrecognised word a new taxon? Pioa Pica? (type of magpie) or a new term? could also be Roa

Proposed approach No possibility of getting the correct version from the two interpretations But we can tell where something’s up Differences between collections of OCRed versions may provide clues Compare outputs using sequence alignment algorithm Needleman Wunsch word by word comparison plus Levenshtein edit hand-keyed INOTAXA text as reference

What we’re working with Can’t get at the internals of OCR systems Have a hand-corrected and version of a document Biologia Centrali-Americana, coleoptera v.4 pt.3 180,553 words INOTAXA hand keyed + scanned version with 2 OCRs BHL NHM (pdf maker) IA ABBYY Finereader taken from same jpeg

Needleman-Wunsch algorithm Global sequence alignment algorithm Match similar terms against each other or insert gaps score for match, mismatch, gap insertion minimise score To align ABCE with ABDE: A - A          A - A          A - A B - B          B - B          B - B C - [ ]         [ ] - D         C - D [ ] - D         C - [ ]         E - E E - E         E - E Depending on score function

Needleman-Wunsch on OCR outputs Compare word sequences The study of the Otiorhynchinæ Alatæ has unfortunately been delayed The study of the Otiorhynchinse Alatee has unfortunately been delayed The study of the Otiorhynchinœ Alatse has unfortunately been delayed Finds mismatches for similar words lower penalty for similar terms Levenshtein comparison Gap sequences for lengthy mismatches So alignment preferable to eg. DIFF

Sequence of gaps schwarzi schwarzi MATCH 34 34 MATCH 1 1 MATCH obsoletus obsoletus MATCH 273, 273, MATCH

Results Not bad Precision good Recall currently difficult to measure hand markup not always consistent

Future work Fuzzy search IE possible from large enough collection? highlight difficult terms in the markup requires development of appropriate markup language partial disambiguation with colocations? IE possible from large enough collection? tail is longer, ♂, shorter, ♀, with green tip OCR doesn’t recognise the symbols either... idiosyncratic language often unique to author Interpretation from layout conventions current OCR not fine grained enough

Thank you Any questions?