Download presentation
Presentation is loading. Please wait.
1
Document Image Analysis CSE 717 An Introduction
2
Document Image Analysis DIA is the theory and practice of recovering the symbol structures of digital images scanned from paper or produced by computer DIA is a subfield of Digital Image processing Digital images of natural objects: X-rays, fingerprints, faces, scenery, etc. are NOT part of DIA Digital images of symbolic objects: Postal addresses, printed articles, forms, music sheets, engineering drawings, topographic maps belong to DIA Source: Scanners, printers, fax machines, hand! Incidental text: license plates, billboards, subtitles, in photos and video WWW ?? DIA’s grand goal is take us to the land of paperless office
3
Document Image Analysis Graphical ProcessingTextual Processing Optical Character Recognition Page Layout Analysis Line Processing Region and Symbol Processing Text Skew, blocks, paragraphs Lines, curves, corners Filled regions
4
ProcessingTextGraphics PixelsPreprocessing Representation, Noise removal, binarization, skew, script id, font id Preprocessing Representation, Noise removal, binarization, thinning, vectorization PrimitivesGlyph Recognition Connected components, strokes, punctuations, words Primitive Recognition Straight lines, curve segments, junctions, nodes, loops, characters StructuresText Recognition Word segmentation, text line reconstruction, table analysis, linguistics Structure Recognition Text fields, legends, labels, dimensions, graphics symbols DocumentsPage Layout Analysis Text versus non-text, physical component analysis, logical component analysis, functional component analysis, compression Interpretation Component recognition, connectivity analysis, CAD layer separation, Database attribute extraction, Compression CorpusInformation Retrieval Document Classification, indexing, search, security, authentication, privacy Database, CAD Validation, search, update Document Image Analysis
5
Postal Examples Meter Mark Sender’s Address Delivery Address Linear Code Digital Post Mark Endorsem ent In Case of Undeliverable as Addressed Return to Sender
6
Forms
7
Unconstrained Text
8
Graphics Documents
9
References Handbook of Character Recognition and Document Image Analysis, H. Bunke and PSP Wang (editors), World Scientific Press Document Image Analysis, Gorman and Kasturi, IEEE Computer Society Press International Conference on Document Analysis and Recognition proceedings International Workshop on Document Analysis Systems proceedings Symposium on Document Image Understanding Technology
10
OCR Features and Systems –Script ID, Devanagari OCR, Tamil OCR, MP versus HW Handwriting Recognition –Postal applications, Arabic Documents Classifiers and Learning –Multi-classifier systems Layout Analysis –Skew correction, geometric methods, test/graphics separation, logical labeling Tables and Forms –Detecting tables in HTML documents, use of graph grammars, semantics Document Engineering –Processing of historical documents (palm leaf manuscripts). Camera Based DIA –Locating and reading Barcodes New Applications -CAPTCHA
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.