Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dr. István Marosi Recosoft Ltd., Hungary

Similar presentations


Presentation on theme: "Dr. István Marosi Recosoft Ltd., Hungary"— Presentation transcript:

1 Dr. István Marosi Recosoft Ltd., Hungary
SSIP 2002, Budapest Internal Structure of the Character Recognition Engine used inside Omnipage Pro Dr. István Marosi Recosoft Ltd., Hungary

2 Some „Marketing talk” Main tasks of an OCR system: Image acquisition
Layout recognition Text recognition User assisted correction Result exportation 1/13/2019 Recosoft Ltd

3 Some „Marketing talk” Main tasks of an OCR system: Image acquisition
Get image B/W Scanning Gray Scanning Color Scanning Load from image file Preprocess image Layout recognition Text recognition User assisted correction Result exportation 1/13/2019 Recosoft Ltd

4 Some „Marketing talk” Main tasks of an OCR system: Image acquisition
Get image Preprocess image Color separation Thresholding Despeckling Rotation Deskewing Layout recognition Text recognition User assisted correction Result exportation 1/13/2019 Recosoft Ltd

5 The Preprocessed Image
Joined chars 1/13/2019 Recosoft Ltd

6 The Preprocessed Image
Joined chars 1/13/2019 Recosoft Ltd

7 The Preprocessed Image
Broken chars 1/13/2019 Recosoft Ltd

8 The Preprocessed Image
Broken chars 1/13/2019 Recosoft Ltd

9 Some „Marketing talk” Main tasks of an OCR system: Layout recognition
Image acquisition Layout recognition Text zones Columns of flowed text Tables Inverse text Graphic zones Text recognition User assisted correction Result exportation 1/13/2019 Recosoft Ltd

10 Some „Marketing talk” Main tasks of an OCR system: Layout recognition
Image acquisition Layout recognition Text zones Graphic zones Line Art Photo Text recognition User assisted correction Result exportation 1/13/2019 Recosoft Ltd

11 Some „Marketing talk” Main tasks of an OCR system: Text recognition
Image acquisition Layout recognition Text recognition ... Let’s do it when the marketing staff is over... User assisted correction Result exportation 1/13/2019 Recosoft Ltd

12 Some „Marketing talk” Main tasks of an OCR system:
Image acquisition Layout recognition Text recognition User assisted correction By the user’s random editing... Pop-up verifier Manual Training By proofreading of doubtful words Result exportation 1/13/2019 Recosoft Ltd

13 Some „Marketing talk” Main tasks of an OCR system:
Image acquisition Layout recognition Text recognition User assisted correction By the user’s random editing... By proofreading of doubtful words Correct: User dictionary Changed: IntelliTrain Remember trained characters Apply them on following pages Result exportation 1/13/2019 Recosoft Ltd

14 IntelliTrain Recognized word: sorneUüng 1/13/2019 Recosoft Ltd

15 IntelliTrain Recognized word: sorneUüng Fixed word: something
1/13/2019 Recosoft Ltd

16 IntelliTrain Recognized word: sorneUüng Fixed word: something
1/13/2019 Recosoft Ltd

17 IntelliTrain Recognized word: sorneUüng Fixed word: something
Substitutions found: m  rn thi  Uü 1/13/2019 Recosoft Ltd

18 IntelliTrain Recognized word: sorneUüng Fixed word: something
Substitutions found: m  rn thi  Uü Perform automatically: Learn image pattern and substitution info Find similar substituted (‘blue’) text on actual page Match against pattern of substitution and correct Find such errors on following pages, too 1/13/2019 Recosoft Ltd

19 Some „Marketing talk” Main tasks of an OCR system: Result exportation
Image acquisition Layout recognition Text recognition User assisted correction Result exportation Combine pages into a Document Header / Footer recognition Page numbers Hyperlinks (e.g. „See Table 20”) Save results 1/13/2019 Recosoft Ltd

20 Some „Marketing talk” Main tasks of an OCR system: Result exportation
Image acquisition Layout recognition Text recognition User assisted correction Result exportation Combine pages into a Document Save results doc file Speech synthesizer 1/13/2019 Recosoft Ltd

21 OP11 Internals Text recognition in ScanSoft’s OP11
OCR Engines available: Caere’s engine (codename: Salt & Pepper) Recognita’s engine (codename: Paprika) 1/13/2019 Recosoft Ltd

22 OP11 Internals Text recognition in ScanSoft’s OP11
OCR Engines available: Caere’s engine (Salt & Pepper) Uses a Matrix Matching based algorithm feature set: 40 cells of an 8x5 grid good overall description of a shape weaker at detailed structure Recognita’s engine (Paprika) Uses a Contour Tracing based algorithm feture set: convex and concave arcs on the contour good detailed description of a shape weaker at overall structure 1/13/2019 Recosoft Ltd

23 OP11 Internals Text recognition in ScanSoft’s OP11
OCR Engines available: Caere’s engine (Salt & Pepper) Recognita’s engine (Paprika) Segmentation algorithms: 1/13/2019 Recosoft Ltd

24 Segmentation What are those pixel groups belonging to a single letter?

25 Segmentation What are those pixel groups belonging to a single letter?

26 Segmentation What are those pixel groups belonging to a single letter?

27 Segmentation What are those pixel groups belonging to a single letter?

28 Segmentation What are those pixel groups belonging to a single letter?

29 Segmentation What are those pixel groups belonging to a single letter?

30 OP11 Internals Text recognition in ScanSoft’s OP11
OCR Engines available: Caere’s engine (Salt & Pepper) Recognita’s engine (Paprika) Segmentation algorithms: Developed by independent groups Have different strengths and weaknesses 1/13/2019 Recosoft Ltd

31 OP11 Internals Text recognition in ScanSoft’s OP11 Conclusion:
OCR Engines available: Caere’s engine (Salt & Pepper) Recognita’s engine (Paprika) Segmentation algorithms Conclusion: They are complementary Let’s create a voting system 1/13/2019 Recosoft Ltd

32 OP11 Internals Voting strategies External „Black box” voting Image
Paprika Salt & Pepper Txt 2 Txt 1 Vote? Final Txt 1/13/2019 Recosoft Ltd

33 OP11 Internals Voting strategies External „Black box” voting Image
Paprika Salt & Pepper Txt 2 Txt 1 Dict Vote Final Txt 1/13/2019 Recosoft Ltd

34 OP11 Internals Voting strategies External „Black box” voting ~15% gain
Image Voting strategies External „Black box” voting ~15% gain Paprika Salt & Pepper Txt 2 Txt 1 Dict Vote Final Txt 1/13/2019 Recosoft Ltd

35 OP11 Internals Voting strategies Internal „Shape” voting Image
External „Black box” voting Internal „Shape” voting Salt & Pepper Paprika Txt 1 Txt 2 Dict Bronze Final Txt 1/13/2019 Recosoft Ltd

36 Recognize original segmentation
Image OP11 Internals Recognize original segmentation Paprika Original segmentation: Every independent connected component is a character Good segmentation: recognize Bad segmentation: reject K.B. 1/13/2019 Recosoft Ltd

37 OP11 Internals Paprika Image Recognize original segmentation K.B.
Train adaptive classifier from original shapes Txt 1 Adaptive K.B. 1/13/2019 Recosoft Ltd

38 OP11 Internals Paprika Image Recognize original segmentation K.B.
Try several segmentations Loop if unrecognizable K.B. Train adaptive classifier from original shapes Txt 1 Adaptive K.B. Recognize broken and joined shapes 1/13/2019 Recosoft Ltd

39 OP11 Internals Paprika Image Recognize original segmentation K.B.
Train adaptive classifier from original shapes Txt 1 Adaptive K.B. Recognize broken and joined shapes Train adaptive classifier from ‘ugly’ shapes 1/13/2019 Recosoft Ltd

40 OP11 Internals Paprika Image Recognize original segmentation K.B.
Train adaptive classifier from original shapes Txt 1 Adaptive K.B. Recognize broken and joined shapes Train adaptive classifier from ‘ugly’ shapes Recognize more broken and joined shapes Try several segmentations Loop if unrecognizable Txt 2 1/13/2019 Recosoft Ltd

41 OP11 Internals Voting strategies ~45% gain Image Salt & Pepper Txt 1
Paprika Txt 1 Txt 2 Dict Bronze Final Txt 1/13/2019 Recosoft Ltd

42 OP12 Voting strategies +20% gain Image Fire- Salt & Pepper worx Txt 1A
Paprika Txt 1A Txt 1B Txt 2 Dict Bronze Final Txt 1/13/2019 Recosoft Ltd


Download ppt "Dr. István Marosi Recosoft Ltd., Hungary"

Similar presentations


Ads by Google