Dr. István Marosi Recosoft Ltd., Hungary SSIP 2002, Budapest Internal Structure of the Character Recognition Engine used inside Omnipage Pro Dr. István Marosi Recosoft Ltd., Hungary
Some „Marketing talk” Main tasks of an OCR system: Image acquisition Layout recognition Text recognition User assisted correction Result exportation 1/13/2019 Recosoft Ltd
Some „Marketing talk” Main tasks of an OCR system: Image acquisition Get image B/W Scanning Gray Scanning Color Scanning Load from image file Preprocess image Layout recognition Text recognition User assisted correction Result exportation 1/13/2019 Recosoft Ltd
Some „Marketing talk” Main tasks of an OCR system: Image acquisition Get image Preprocess image Color separation Thresholding Despeckling Rotation Deskewing Layout recognition Text recognition User assisted correction Result exportation 1/13/2019 Recosoft Ltd
The Preprocessed Image Joined chars 1/13/2019 Recosoft Ltd
The Preprocessed Image Joined chars 1/13/2019 Recosoft Ltd
The Preprocessed Image Broken chars 1/13/2019 Recosoft Ltd
The Preprocessed Image Broken chars 1/13/2019 Recosoft Ltd
Some „Marketing talk” Main tasks of an OCR system: Layout recognition Image acquisition Layout recognition Text zones Columns of flowed text Tables Inverse text Graphic zones Text recognition User assisted correction Result exportation 1/13/2019 Recosoft Ltd
Some „Marketing talk” Main tasks of an OCR system: Layout recognition Image acquisition Layout recognition Text zones Graphic zones Line Art Photo Text recognition User assisted correction Result exportation 1/13/2019 Recosoft Ltd
Some „Marketing talk” Main tasks of an OCR system: Text recognition Image acquisition Layout recognition Text recognition ... Let’s do it when the marketing staff is over... User assisted correction Result exportation 1/13/2019 Recosoft Ltd
Some „Marketing talk” Main tasks of an OCR system: Image acquisition Layout recognition Text recognition User assisted correction By the user’s random editing... Pop-up verifier Manual Training By proofreading of doubtful words Result exportation 1/13/2019 Recosoft Ltd
Some „Marketing talk” Main tasks of an OCR system: Image acquisition Layout recognition Text recognition User assisted correction By the user’s random editing... By proofreading of doubtful words Correct: User dictionary Changed: IntelliTrain Remember trained characters Apply them on following pages Result exportation 1/13/2019 Recosoft Ltd
IntelliTrain Recognized word: sorneUüng 1/13/2019 Recosoft Ltd
IntelliTrain Recognized word: sorneUüng Fixed word: something 1/13/2019 Recosoft Ltd
IntelliTrain Recognized word: sorneUüng Fixed word: something 1/13/2019 Recosoft Ltd
IntelliTrain Recognized word: sorneUüng Fixed word: something Substitutions found: m rn thi Uü 1/13/2019 Recosoft Ltd
IntelliTrain Recognized word: sorneUüng Fixed word: something Substitutions found: m rn thi Uü Perform automatically: Learn image pattern and substitution info Find similar substituted (‘blue’) text on actual page Match against pattern of substitution and correct Find such errors on following pages, too 1/13/2019 Recosoft Ltd
Some „Marketing talk” Main tasks of an OCR system: Result exportation Image acquisition Layout recognition Text recognition User assisted correction Result exportation Combine pages into a Document Header / Footer recognition Page numbers Hyperlinks (e.g. „See Table 20”) Save results 1/13/2019 Recosoft Ltd
Some „Marketing talk” Main tasks of an OCR system: Result exportation Image acquisition Layout recognition Text recognition User assisted correction Result exportation Combine pages into a Document Save results doc file e-mail Speech synthesizer 1/13/2019 Recosoft Ltd
OP11 Internals Text recognition in ScanSoft’s OP11 OCR Engines available: Caere’s engine (codename: Salt & Pepper) Recognita’s engine (codename: Paprika) 1/13/2019 Recosoft Ltd
OP11 Internals Text recognition in ScanSoft’s OP11 OCR Engines available: Caere’s engine (Salt & Pepper) Uses a Matrix Matching based algorithm feature set: 40 cells of an 8x5 grid good overall description of a shape weaker at detailed structure Recognita’s engine (Paprika) Uses a Contour Tracing based algorithm feture set: convex and concave arcs on the contour good detailed description of a shape weaker at overall structure 1/13/2019 Recosoft Ltd
OP11 Internals Text recognition in ScanSoft’s OP11 OCR Engines available: Caere’s engine (Salt & Pepper) Recognita’s engine (Paprika) Segmentation algorithms: 1/13/2019 Recosoft Ltd
Segmentation What are those pixel groups belonging to a single letter?
Segmentation What are those pixel groups belonging to a single letter?
Segmentation What are those pixel groups belonging to a single letter?
Segmentation What are those pixel groups belonging to a single letter?
Segmentation What are those pixel groups belonging to a single letter?
Segmentation What are those pixel groups belonging to a single letter?
OP11 Internals Text recognition in ScanSoft’s OP11 OCR Engines available: Caere’s engine (Salt & Pepper) Recognita’s engine (Paprika) Segmentation algorithms: Developed by independent groups Have different strengths and weaknesses 1/13/2019 Recosoft Ltd
OP11 Internals Text recognition in ScanSoft’s OP11 Conclusion: OCR Engines available: Caere’s engine (Salt & Pepper) Recognita’s engine (Paprika) Segmentation algorithms Conclusion: They are complementary Let’s create a voting system 1/13/2019 Recosoft Ltd
OP11 Internals Voting strategies External „Black box” voting Image Paprika Salt & Pepper Txt 2 Txt 1 Vote? Final Txt 1/13/2019 Recosoft Ltd
OP11 Internals Voting strategies External „Black box” voting Image Paprika Salt & Pepper Txt 2 Txt 1 Dict Vote Final Txt 1/13/2019 Recosoft Ltd
OP11 Internals Voting strategies External „Black box” voting ~15% gain Image Voting strategies External „Black box” voting ~15% gain Paprika Salt & Pepper Txt 2 Txt 1 Dict Vote Final Txt 1/13/2019 Recosoft Ltd
OP11 Internals Voting strategies Internal „Shape” voting Image External „Black box” voting Internal „Shape” voting Salt & Pepper Paprika Txt 1 Txt 2 Dict Bronze Final Txt 1/13/2019 Recosoft Ltd
Recognize original segmentation Image OP11 Internals Recognize original segmentation Paprika Original segmentation: Every independent connected component is a character Good segmentation: recognize Bad segmentation: reject K.B. 1/13/2019 Recosoft Ltd
OP11 Internals Paprika Image Recognize original segmentation K.B. Train adaptive classifier from original shapes Txt 1 Adaptive K.B. 1/13/2019 Recosoft Ltd
OP11 Internals Paprika Image Recognize original segmentation K.B. Try several segmentations Loop if unrecognizable K.B. Train adaptive classifier from original shapes Txt 1 Adaptive K.B. Recognize broken and joined shapes 1/13/2019 Recosoft Ltd
OP11 Internals Paprika Image Recognize original segmentation K.B. Train adaptive classifier from original shapes Txt 1 Adaptive K.B. Recognize broken and joined shapes Train adaptive classifier from ‘ugly’ shapes 1/13/2019 Recosoft Ltd
OP11 Internals Paprika Image Recognize original segmentation K.B. Train adaptive classifier from original shapes Txt 1 Adaptive K.B. Recognize broken and joined shapes Train adaptive classifier from ‘ugly’ shapes Recognize more broken and joined shapes Try several segmentations Loop if unrecognizable Txt 2 1/13/2019 Recosoft Ltd
OP11 Internals Voting strategies ~45% gain Image Salt & Pepper Txt 1 Paprika Txt 1 Txt 2 Dict Bronze Final Txt 1/13/2019 Recosoft Ltd
OP12 Voting strategies +20% gain Image Fire- Salt & Pepper worx Txt 1A Paprika Txt 1A Txt 1B Txt 2 Dict Bronze Final Txt 1/13/2019 Recosoft Ltd