Improving search in scanned documents: Looking for OCR mismatches

Improving search in scanned documents: Looking for OCR mismatches
David Morse David King Anton Dil Alistair Willis David Roberts Chris Lyal

Introduction Biological taxonomy Heavily publication based
manage names of organisms relationships between organisms Heavily publication based extensive legacy literature from 15th century observations appear in wide variety of publications learned society documents, encyclopaedias, institution reports, etc. occurence data, historical trends, geographic clues etc. biodiversity management

Digital libraries and curation
Digitised historical collections necessary for practising taxonomists possible Natural Language Processing research? compare Genia/Medline Must be searchable on taxonomic names also other non-standard English: names (authorities) locations etc. technical language non-dictionary terms

Mass digitisation Most biodiversity literature still on paper only
but collections being scanned at high rate eg. the Biodiversity Heritage Library Several major repositories of biodiversity documents BHL INOTAXA (BHL + TEI-Lite) eFloras Plazi Internet archive Lack of consistent markup SciXML, NLM DTD, ...

Markup in ABLE Previously been using TEI-Lite Moving towards taXMLit
INOTAXA structural markup semantic content in layout eg. indentation for hierarchy Moving towards taXMLit retains structural markup includes semantic markup taxonomic name, authority, operation (new name, synonym etc) date etc.

Large scale scanning BHL currently: 22,000 volumes 9.2 million pages growth rate: 1,500 volumes / month 600,000 pages / month No chance of manual correction/markup Automatic markup necessary INOTAXA Cheaper to manually correct and rekey

Taxonomic nomenclature recognition difficult to automate OCR errors variation in fonts meaning in layout not generally captured by OCR non-dictionary words I should refer to Peritaxia, the first ventral suture being nearly straight

huge terminological variation (GBIF) Actinobacillus actimomycetemcomitans Actinobacillus actimycetemcomitans Actinobacillus actinmycetemcomitans Actinobacillus actinomicetemcomitans Actinobacillus actinomy Actinobacillus actinomyce Actinobacillus actinomycemcomitans Actinobacillus actinomyceremcomitans Actinobacillus actinomycetam Actinobacillus actinomycetamcomitans Actinobacillus actinomycetecomitans Actinobacillus actinomycetemcmitans Actinobacillus actinomycetemcomintans Actinobacillus actinomycetemcomitance Actinobacillus actinomycetemcomitans Actinobacillus actinomycetemcomitants Actinobacillus actinomycetemcommitans Actinobacillus actinomycetemocimitans Actinobacillus actinomycetencomitans Actinobacillus actinomycetum Actinobacillus actinomyctemcomitans Actinobacillus actinomyectomcomitans Actinobacillus actinomyetemcomitans Actinobacillus actinonmycetemcomitans Actinobacillus actionomycetemcomitans Actinobacillus actynomicetemcomitans Actinobacillus antinomycetemcomitans

Variation with OCR Want to help manage databases of taxons
lots for taxonomy GBIF, ITIS, Species 2000, Catalogue of life, uBIO, ... very incomplete too few specialists to maintain/integrate Literature used to manage these resources, but only if can identify/search terms in documents. OCR struggles with these examples

OCR accuracy Typical OCR accuracy around 95-96% (by word)
Generally on born digital documents known fonts for legacy literature, font may be unique to a publication standard dictionaries Error rates for taxons 20-35% F-score (TaxonFinder, FAT) Is terminology in the 5% of errors? specialist English italics in non-standard fonts

Problems with the terminology variation
Misreadings may not be identifiable as errors Homa / Homo Pica / Pioa no canonical reference Even with multiple OCR readings, may not get the correct form RHYNCHOPHOBA (correct) BHYNCHOPHOKA (PDF maker) KHYNCHOPHOBA (ABBYY) Attempting to correct the OCR errors is not an option

New taxons Not in the business of correcting OCR
no canonical reference aim is to identify new taxons community too small to correct most errors Is an unrecognised word a new taxon? Pioa Pica? (type of magpie) or a new term? could also be Roa

Proposed approach No possibility of getting the correct version from the two interpretations But we can tell where something’s up Differences between collections of OCRed versions may provide clues Compare outputs using sequence alignment algorithm Needleman Wunsch word by word comparison plus Levenshtein edit hand-keyed INOTAXA text as reference

What we’re working with
Can’t get at the internals of OCR systems Have a hand-corrected and version of a document Biologia Centrali-Americana, coleoptera v.4 pt.3 180,553 words INOTAXA hand keyed + scanned version with 2 OCRs BHL NHM (pdf maker) IA ABBYY Finereader taken from same jpeg

Needleman-Wunsch algorithm
Global sequence alignment algorithm Match similar terms against each other or insert gaps score for match, mismatch, gap insertion minimise score To align ABCE with ABDE: A - A A - A A - A B - B B - B B - B C - [ ] [ ] - D C - D [ ] - D C - [ ] E - E E - E E - E Depending on score function

Needleman-Wunsch on OCR outputs
Compare word sequences The study of the Otiorhynchinæ Alatæ has unfortunately been delayed The study of the Otiorhynchinse Alatee has unfortunately been delayed The study of the Otiorhynchinœ Alatse has unfortunately been delayed Finds mismatches for similar words lower penalty for similar terms Levenshtein comparison Gap sequences for lengthy mismatches So alignment preferable to eg. DIFF

Sequence of gaps schwarzi schwarzi MATCH 34 34 MATCH 1 1 MATCH
obsoletus obsoletus MATCH 273, 273, MATCH

Results Not bad Precision good Recall currently difficult to measure
hand markup not always consistent

Future work Fuzzy search IE possible from large enough collection?
highlight difficult terms in the markup requires development of appropriate markup language partial disambiguation with colocations? IE possible from large enough collection? tail is longer, ♂, shorter, ♀, with green tip OCR doesn’t recognise the symbols either... idiosyncratic language often unique to author Interpretation from layout conventions current OCR not fine grained enough

Thank you Any questions?

Improving search in scanned documents: Looking for OCR mismatches

Similar presentations

Presentation on theme: "Improving search in scanned documents: Looking for OCR mismatches"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Improving search in scanned documents: Looking for OCR mismatches

Similar presentations

Presentation on theme: "Improving search in scanned documents: Looking for OCR mismatches"— Presentation transcript:

Similar presentations

About project

Feedback