Download presentation
Presentation is loading. Please wait.
1
Dr. István Marosi Recosoft Ltd., Hungary
SSIP 2002, Budapest Internal Structure of the Character Recognition Engine used inside Omnipage Pro Dr. István Marosi Recosoft Ltd., Hungary
2
Some „Marketing talk” Main tasks of an OCR system: Image acquisition
Layout recognition Text recognition User assisted correction Result exportation 1/13/2019 Recosoft Ltd
3
Some „Marketing talk” Main tasks of an OCR system: Image acquisition
Get image B/W Scanning Gray Scanning Color Scanning Load from image file Preprocess image Layout recognition Text recognition User assisted correction Result exportation 1/13/2019 Recosoft Ltd
4
Some „Marketing talk” Main tasks of an OCR system: Image acquisition
Get image Preprocess image Color separation Thresholding Despeckling Rotation Deskewing Layout recognition Text recognition User assisted correction Result exportation 1/13/2019 Recosoft Ltd
5
The Preprocessed Image
Joined chars 1/13/2019 Recosoft Ltd
6
The Preprocessed Image
Joined chars 1/13/2019 Recosoft Ltd
7
The Preprocessed Image
Broken chars 1/13/2019 Recosoft Ltd
8
The Preprocessed Image
Broken chars 1/13/2019 Recosoft Ltd
9
Some „Marketing talk” Main tasks of an OCR system: Layout recognition
Image acquisition Layout recognition Text zones Columns of flowed text Tables Inverse text Graphic zones Text recognition User assisted correction Result exportation 1/13/2019 Recosoft Ltd
10
Some „Marketing talk” Main tasks of an OCR system: Layout recognition
Image acquisition Layout recognition Text zones Graphic zones Line Art Photo Text recognition User assisted correction Result exportation 1/13/2019 Recosoft Ltd
11
Some „Marketing talk” Main tasks of an OCR system: Text recognition
Image acquisition Layout recognition Text recognition ... Let’s do it when the marketing staff is over... User assisted correction Result exportation 1/13/2019 Recosoft Ltd
12
Some „Marketing talk” Main tasks of an OCR system:
Image acquisition Layout recognition Text recognition User assisted correction By the user’s random editing... Pop-up verifier Manual Training By proofreading of doubtful words Result exportation 1/13/2019 Recosoft Ltd
13
Some „Marketing talk” Main tasks of an OCR system:
Image acquisition Layout recognition Text recognition User assisted correction By the user’s random editing... By proofreading of doubtful words Correct: User dictionary Changed: IntelliTrain Remember trained characters Apply them on following pages Result exportation 1/13/2019 Recosoft Ltd
14
IntelliTrain Recognized word: sorneUüng 1/13/2019 Recosoft Ltd
15
IntelliTrain Recognized word: sorneUüng Fixed word: something
1/13/2019 Recosoft Ltd
16
IntelliTrain Recognized word: sorneUüng Fixed word: something
1/13/2019 Recosoft Ltd
17
IntelliTrain Recognized word: sorneUüng Fixed word: something
Substitutions found: m rn thi Uü 1/13/2019 Recosoft Ltd
18
IntelliTrain Recognized word: sorneUüng Fixed word: something
Substitutions found: m rn thi Uü Perform automatically: Learn image pattern and substitution info Find similar substituted (‘blue’) text on actual page Match against pattern of substitution and correct Find such errors on following pages, too 1/13/2019 Recosoft Ltd
19
Some „Marketing talk” Main tasks of an OCR system: Result exportation
Image acquisition Layout recognition Text recognition User assisted correction Result exportation Combine pages into a Document Header / Footer recognition Page numbers Hyperlinks (e.g. „See Table 20”) Save results 1/13/2019 Recosoft Ltd
20
Some „Marketing talk” Main tasks of an OCR system: Result exportation
Image acquisition Layout recognition Text recognition User assisted correction Result exportation Combine pages into a Document Save results doc file Speech synthesizer 1/13/2019 Recosoft Ltd
21
OP11 Internals Text recognition in ScanSoft’s OP11
OCR Engines available: Caere’s engine (codename: Salt & Pepper) Recognita’s engine (codename: Paprika) 1/13/2019 Recosoft Ltd
22
OP11 Internals Text recognition in ScanSoft’s OP11
OCR Engines available: Caere’s engine (Salt & Pepper) Uses a Matrix Matching based algorithm feature set: 40 cells of an 8x5 grid good overall description of a shape weaker at detailed structure Recognita’s engine (Paprika) Uses a Contour Tracing based algorithm feture set: convex and concave arcs on the contour good detailed description of a shape weaker at overall structure 1/13/2019 Recosoft Ltd
23
OP11 Internals Text recognition in ScanSoft’s OP11
OCR Engines available: Caere’s engine (Salt & Pepper) Recognita’s engine (Paprika) Segmentation algorithms: 1/13/2019 Recosoft Ltd
24
Segmentation What are those pixel groups belonging to a single letter?
25
Segmentation What are those pixel groups belonging to a single letter?
26
Segmentation What are those pixel groups belonging to a single letter?
27
Segmentation What are those pixel groups belonging to a single letter?
28
Segmentation What are those pixel groups belonging to a single letter?
29
Segmentation What are those pixel groups belonging to a single letter?
30
OP11 Internals Text recognition in ScanSoft’s OP11
OCR Engines available: Caere’s engine (Salt & Pepper) Recognita’s engine (Paprika) Segmentation algorithms: Developed by independent groups Have different strengths and weaknesses 1/13/2019 Recosoft Ltd
31
OP11 Internals Text recognition in ScanSoft’s OP11 Conclusion:
OCR Engines available: Caere’s engine (Salt & Pepper) Recognita’s engine (Paprika) Segmentation algorithms Conclusion: They are complementary Let’s create a voting system 1/13/2019 Recosoft Ltd
32
OP11 Internals Voting strategies External „Black box” voting Image
Paprika Salt & Pepper Txt 2 Txt 1 Vote? Final Txt 1/13/2019 Recosoft Ltd
33
OP11 Internals Voting strategies External „Black box” voting Image
Paprika Salt & Pepper Txt 2 Txt 1 Dict Vote Final Txt 1/13/2019 Recosoft Ltd
34
OP11 Internals Voting strategies External „Black box” voting ~15% gain
Image Voting strategies External „Black box” voting ~15% gain Paprika Salt & Pepper Txt 2 Txt 1 Dict Vote Final Txt 1/13/2019 Recosoft Ltd
35
OP11 Internals Voting strategies Internal „Shape” voting Image
External „Black box” voting Internal „Shape” voting Salt & Pepper Paprika Txt 1 Txt 2 Dict Bronze Final Txt 1/13/2019 Recosoft Ltd
36
Recognize original segmentation
Image OP11 Internals Recognize original segmentation Paprika Original segmentation: Every independent connected component is a character Good segmentation: recognize Bad segmentation: reject K.B. 1/13/2019 Recosoft Ltd
37
OP11 Internals Paprika Image Recognize original segmentation K.B.
Train adaptive classifier from original shapes Txt 1 Adaptive K.B. 1/13/2019 Recosoft Ltd
38
OP11 Internals Paprika Image Recognize original segmentation K.B.
Try several segmentations Loop if unrecognizable K.B. Train adaptive classifier from original shapes Txt 1 Adaptive K.B. Recognize broken and joined shapes 1/13/2019 Recosoft Ltd
39
OP11 Internals Paprika Image Recognize original segmentation K.B.
Train adaptive classifier from original shapes Txt 1 Adaptive K.B. Recognize broken and joined shapes Train adaptive classifier from ‘ugly’ shapes 1/13/2019 Recosoft Ltd
40
OP11 Internals Paprika Image Recognize original segmentation K.B.
Train adaptive classifier from original shapes Txt 1 Adaptive K.B. Recognize broken and joined shapes Train adaptive classifier from ‘ugly’ shapes Recognize more broken and joined shapes Try several segmentations Loop if unrecognizable Txt 2 1/13/2019 Recosoft Ltd
41
OP11 Internals Voting strategies ~45% gain Image Salt & Pepper Txt 1
Paprika Txt 1 Txt 2 Dict Bronze Final Txt 1/13/2019 Recosoft Ltd
42
OP12 Voting strategies +20% gain Image Fire- Salt & Pepper worx Txt 1A
Paprika Txt 1A Txt 1B Txt 2 Dict Bronze Final Txt 1/13/2019 Recosoft Ltd
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.