Download presentation
Presentation is loading. Please wait.
1
Prénom Nom Document Analysis: Introduction Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008
2
© Prof. Rolf Ingold 2 Outline Introduction: definition and aims Applications overview Methodologies Possibility & limits Experience of the DIVA research group Course content and structure
3
© Prof. Rolf Ingold 3 What is a document ? Data = abstract binary representation of any kind of information to be stored, transmitted or processed by computers Information = data associated with an implicit or explicit interpretation Document = piece of information that can be perceived and interpreted by humans to be perceived documents have to be rendered displayed projected on screens printed played on speakers ……
4
© Prof. Rolf Ingold 4 Taxonomy of documents Documents may be Synthetic (structured) or captured (unstructured) Static (non temporal, printable) or dynamic (temporal) Viewable, audible or tactile Animation Synthetic data Captured data Static documentsDynamic documents AudioImages Graphics Text (printed) Off-line handwriting On-line handwriting Off-line handwriting Video Audio Speech (synthetic)
5
© Prof. Rolf Ingold 5 What is document analysis ? Document analysis aims of extracting symbolic information text (words, expressions, continuous text) graphics (vector graphics, shapes, symbols) layout structures logical structures numeric data writer / speaker identities, ... from different captured sources images (scanned, camera based, synthesized) video on-line handwriting sound
6
© Prof. Rolf Ingold 6 Importance of document structures Document = Content + Structures Structures convey abstract high level information They are revealed by styles
7
© Prof. Rolf Ingold 7 Structural document analysis A Master / Slave Monitor … Network D.Jacobson … M. Shafiq M… Document analysis = Image Analysis of static documents to extract content and structures Document analysis is applicable on captured images (from scanner, camera) synthetic images of electronic documents, available in unstructured or purely structured form
8
© Prof. Rolf Ingold 8 Analysis of Electronic Documents Most electronic documents are unstructured or poorly structured Document understanding can be seen as a reverse-engineering task using a fixed-layout document format (such as PDF or XPS) as a pivot format ASCII
9
© Prof. Rolf Ingold 9 Visual Audio Processing Chain Visual Audio aims at recovering sound from old records by image analysis
10
© Prof. Rolf Ingold 10 Usefulness of document analysis Extracting information from captured documents is useful in different contexts to avoid cumbersome keyboarding to capture information remotely to study the document’s content to categorize, classify and index digitized documents for digital libraries culture preservation to reuse document chunks to reedit and restyle an existing document to extract information for integrated applications office automation database management information systems to perform multimodal alignment
11
© Prof. Rolf Ingold 11 Typical applications of document analysis Commercial products are available for Text reading (OCR products) Office automation (mail reading and dispatching) Form Processing (for dedicated applications) More Specialized products Postal address reading Check reading and processing
12
© Prof. Rolf Ingold 12 Form processing Performance of form processing depends on form complexity on form variability Fields are located easily if their positions are fixed when using different colors Content recognition is hard for several reasons degraded images approximate positioning of symbols variability of handwriting
13
© Prof. Rolf Ingold 13 Check reading Check reading can be automated at >90% difficulties: textured background, variability of writing easiness: fixed vocabulary, redundancy (legal & courtesy amount), availability of contextual information (client database) Legal Amount Payee name MICR Date Courtesy Amount Signature from
14
© Prof. Rolf Ingold 14 Table of contents recognition Aim to extract information from TOC to index journals associate titles and authors to page numbers Advantages Very precise goal Regular layout for a given jounal Difficuties Complex layout Great variability when considering journals universally
15
© Prof. Rolf Ingold 15 Analysis of historical documents Aim to extract information to index historical documents Challenges degradations irregular layout rich typography, ornaments old scripts (no OCR) Possible approach word spotting
16
© Prof. Rolf Ingold 16 Logical & physical document structures Logical document structures Reflecting the author’s point of view Independent of presentation Composed of application dependent logical entities Chapters, sections Specific to the application and document class Physical document structures Reflects the editor’s point of view Composed of a hierarchy of physical entities Text blocs, text lines and tokens Graphical primitives Universal and independent of the document class
17
© Prof. Rolf Ingold 17 Document processing cycle Physical Document Logical Document Paper Document Document Image FormattingPrinting Analysis and RecognitionDigitizing Document analysis can be considered as the reverse of formatting Rendering
18
© Prof. Rolf Ingold 18 Relation between logical and physical structure analysis formatting Styles Logical Structure Physical Structure edit print display Document formatting is straightforward... But document analysis is a non trivial task that generally can not be fully automated
19
© Prof. Rolf Ingold 19 Processing chain Blocs Image Simple text Preprocessing Postanalysis OCR Segmentation Fonts OFR Doc understand. Structured docum. Layout analysis
20
© Prof. Rolf Ingold 20 Pre-processing Pre-processing aims at preparing the document image for further analysis; it includes Brightness / contrast enhancement Noise removal Skew / aberration correction Binarization / color clustering Shape smoothing
21
© Prof. Rolf Ingold 21 Segmentation Document segmentation aims at splitting the image in regions of interests; it includes Page segmentation into blocs Text, graphics and images separation Hairlines and frames detection Text bloc segmentation into text lines, words and characters In form processing, field separation Graphics segmentation into vectors and symbols
22
© Prof. Rolf Ingold 22 Optical Character Recognition (OCR) OCR aims at extracting character codes (ASCII) from text images; OCR was one of the earliest computer vision application Early patents were deposited in the 1910s, 30 years before computer age ! OCR deals with many situations Isolated characters vs. complete words or phrases Different character classes (digits, uppercase letters, full text, …) Restricted or open vocabulary Machine printed vs. handwritten text Different languages (with various diacritics) and different scripts (Latin, Greek, Hebrew, Arabic, Farsi, various Asian scripts, …,) Imperfect image quality (low resolution, textured background, distortions, noise, …)
23
© Prof. Rolf Ingold 23 Text recognition related problems Text analysis must also consider other aspects In case of printed text Font recognition (family, size and style) Font categorization (with/without serifs, fixed vs. proportional font) In case of handwritten text Scriber identification or verification Scriber classification
24
© Prof. Rolf Ingold 24 Layout analysis Layout analysis aims at extracting physical structures of documents; it consists of locating, delimiting and identifying text blocks graphics tables formulas handwritten text fields annotations associating figures and captions locating and delimiting headers and footers recovering the reading order (of multicolumn documents)
25
© Prof. Rolf Ingold 25 Example : layout modeling of scientific journals
26
© Prof. Rolf Ingold 26 Optical Font Recognition (OFR) OFR aims at identifying the used fonts OFR is useful for improving OCR accuracy, by using dedicated classifiers to distinguish “O” and “0”, “I” and “1”, … for assigning logical labels, for logical structure recognition Two strategies may be applied for OFR A priori OFR (without considering the content) A posteriori OFR (when the content is supposed to be known)
27
© Prof. Rolf Ingold 27 Document structure recognition Document structure recognition (also referred to as document understanding) is the first step towards document interpretation Document understanding is dealing with Logical labeling Logical structure recognition Two levels of granularity are being considered macro-structure analysis labeling paragraphs / blocks micro-structure analysis labeling words / strings Document structure recognition is still considered as an open issue There is no universal approach Solutions exist for dedicated document classes (museum notices, checks, table of contents, scientific papers, newspapers, …
28
© Prof. Rolf Ingold 28 Two Levels of Structural Document Analysis Physical structure analysis (also layout analysis) to locate and identify text block, graphics, tables, formulas, handwritten text fields, annotations, … to recover the reading order Logical structure analysis (also document understanding) to assign a hierarchy of logical labels first step towards interpretation
29
© Prof. Rolf Ingold 29 Use Case: Intelligent Newspaper Indexing Full text indexing is not adequate for complex documents Following items have to be identified headlines editorial articles (with title, author & function, summary, content, links,...) captions (associated to images) reader’s letters advertisement ...
30
© Prof. Rolf Ingold 30 Use case: Understanding Museum Notices Group Vedette: Area Title: Principal Title: End of the title: Area Address / Date: Address: Date: Area Collection: Group Cote: from A. Belaïd LORIA-CNRS Nancy Group Vedette: Area Title: Principal Title: End of the title: Area Address / Date: Address: Date: Area Collection: Group Cote: Group Vedette: Area Title: Principal Title: End of the title: Area Address / Date: Address: Date: Area Collection: Group Cote:
31
© Prof. Rolf Ingold 31 Possibilities and limits of DA Layout analysis is considered as almost solved for printed documents It can be achieved generically Problems remain for textured backgrounds and degraded documents (historical & handwritten documents) Document understanding is much less mature Solutions are application dependent Application of specific knowledge is needed (document models)
32
© Prof. Rolf Ingold 32 Need for Document Recognition Models There is no universal approach ! Document recognition systems must be tuned for specific applications for specific document classes Contextual information is required Models provide information like generic document structures (DTD or XML-schema) geometrical and typographical attributes (style information) semantic information (keywords, dictionaries, databases,...) statistical information
33
© Prof. Rolf Ingold 33 Content of document models Generic structure Document Type Definition (DTD) or XML-schema Style information Absolute or relative positioning Typographical attributes & formatting rules Semantics (if available) Linguistic information, keywords Application specific ontology Probabilistic information Frequencies of items or sequences, co-occurrences
34
© Prof. Rolf Ingold 34 Trouble with document models Document models are hard to produce and to maintain implicit models (hard coded in the application) => hard to modify, adapt, extend explicit models, written in a formal language => cumbersome to produce, needs high expertise abstract models, learned automatically => needs a lot of training data (with ground-truth!) Need for more flexible tools: assisted environments with friendly user interfaces recognition improving with use models are learned incrementally
35
© Prof. Rolf Ingold 35 Pattern Based Document Understanding (2-CREM) [Robaday 03] Configurations consist of Set of vertices Labeled (type) Attributed (pos, typo,...) Edges between vertices Labeled (neighborhood relation) Attributed (geom,...) Model consists of Extraction rules For each class Attribute selector List of pattern extraction configura- tion model classification document image rules patt. selector id
36
© Prof. Rolf Ingold 36 Performance evaluation Performance evaluation is an important issue to compare algorithms to estimate corrections costs of real applications Groundtruthed databases are required cost reduction by document analysis tools (bootstrap) synthetic data as alternative
37
© Prof. Rolf Ingold 37 List of Lessons 1.Introduction to document analysis and recognition 2.Document image processing 3.Fundamentals of pattern recognition I 4.Fundamentals of pattern recognition II 5.Printed text recognition 6.Font recognition 7.Layout analysis and segmentation 8.Logical structure analysis 9.Graphics recognition 10.Handwriting recognition 11.Reverse engineering of documents 12.Multimodal applications
38
© Prof. Rolf Ingold 38 Conclusion on document analysis Document analysis is useful for many applications Commercial systems solve some of them Advanced document analysis prototypes are developed in many research labs over the world No universal documentation system is on the way User assisted approaches may be a good trade-off for midsize applications Structural document analysis will not disappear with exclusive electronic document handling (paperless office)
39
© Prof. Rolf Ingold 39 Organization of the course Professor : Rolf Ingold, Pérolles-2, B421, 026 300 84 66 Assistant : Jean-Luc Bloechle,, Pérolles-2, B440, 026 300 92 94 Course : Tuesday, 09:15-10:00 & 10:15-11:00 Exercise : Wednesday, 11:15-12:00 requirements: 2/3 of series returned, 1/2 considered satisfactory Home work : estimated to 4-6 hours a week Website : http://diuf.unifr.ch/diva/web/http://diuf.unifr.ch/diva/web/ Examination : oral, 20 minutes (alternatively written, 120 min) after spring semester (June 2008) or summer (August-September 2008) Credits : 5ECTS
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.