Download presentation
Presentation is loading. Please wait.
1
Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008
2
© Prof. Rolf Ingold 2 Outline Objectives Physical and logical structures Examples of applications Methodologies for structure recognition Microstructures vs. macrostructures Model driven approaches Interactive Systems
3
© Prof. Rolf Ingold 3 Importance of document structures Document = Content + Structures Structures convey abstract high level information Structures are revealed by styles
4
© Prof. Rolf Ingold 4 Applications of document structure recognition Information extraction form analysis (check readers,...) business applications : mail distribution, invoice processing,... analysis of museum & library notices analysis of bibliographical references Document mining, content analysis business reports legal documents scientific publications Intelligent indexing laws magazine & newspaper Document restyling teaching material ...
5
© Prof. Rolf Ingold 5 Extended Processing Chain Blocs Image Simple Text Preprocessing Postanalysis OCR Segmentation Fonts OFR Logical labeling Struct. Document Layout analysis
6
© Prof. Rolf Ingold 6 Physical document structures Reveal the publisher's view Composed of a hierarchy of physical entities text blocs, text lines and tokens graphical primitives Universal, i.e. independent of the document class region blockhr document region block region hrblockfrm
7
© Prof. Rolf Ingold 7 Illustration of physical document structure from A. Belaïd
8
© Prof. Rolf Ingold 8 Illustration of logical document structure
9
© Prof. Rolf Ingold 9 Logical structures Reflect the author’s mind Independent of presentation can be mapped on various physical structures Composed of application dependent logical entities Specific to the application and document class article ppppppppp author title hdln link article link document
10
© Prof. Rolf Ingold 10 Relation between logical and physical structure There is no 1:1 relation between physical and logical structure There are some correspondences between as shown below
11
© Prof. Rolf Ingold 11 Role of style sheets analysis formatting Stylesheet Logical Structure Physical Structure edit print display Document formatting is straightforward... But document analysis is a non trivial task that generally can not be fully automated
12
© Prof. Rolf Ingold 12 Methodologies Document structural analysis can be data-driven : the recognition task is based on image analysis model-driven approaches : the recognition task is Methods of structural document analysis can be classified into geometrical approaches syntactic approaches based on formal grammars structural approaches based on graphs rule based approaches expert systems (artificial intelligence) machine learning
13
© Prof. Rolf Ingold 13 Syntactic Document Recognition [Ingold89] Full model driven approach Formal document description language attributed grammar translated into an analysis graph Top down matching algorithm with backtracking for macro-structure as well as micro-structure recognition Very generic approach Sensitive to noise (no error recovering) Theoretically exponential complexity
14
© Prof. Rolf Ingold 14 Document Description Language [Ingold89] Document class specific formal description composed of composition rules (context-free grammar) typographical rules (attributes) Act:DOC => ActNumber ActContent FootNotes Headings ; ActNumber:FRG => {Number $ Period} ; ActContent:PRT => ActTitle ActDate Otgan {Provis} Formul {Chapter} [Validity] ;... Chapter:PRT => ChTitle ({Section} | {Article}) ; ChTitle.zone = Inherited ChTitle.alignment = (Allowed, Centered, 0pt, 0pt, Undefined) ; ChTitle.lineHeight = 11pt ; ChTitle.spaceBefore = (Allowed,[6pt, 60pt] ) ; ChTitle.interSpace = (Forbidden, [2pt, 3pt]) ; ChTitle.font = (Times, 11pt, Bold, Roman); Article.spaceBefore = <FST: (Forbidden, [6pt, 30pt]), NXT: (Allowed, [6pt, 30pt])> ;...
15
© Prof. Rolf Ingold 15 Analysis graph [Ingold89] Analysis graph for syntactic analysis where each node has two links successor (in case of successful match) alternative (in case of unsuccessful match)
16
© Prof. Rolf Ingold 16 Fuzzy document structure recognition [Hu94] The previous approach has been adapted to be less sensitive to matching errors matching is using fuzzy logic
17
© Prof. Rolf Ingold 17 Fuzzy document structure recognition [Hu94] Pattern matching is using fuzzy logic Parsing is expressed as a cost function to be optimized finding the shortest path in a graph (solved by linear programming)
18
© Prof. Rolf Ingold 18 Graphein : Blackboard approach [Chenevoy92]
19
© Prof. Rolf Ingold 19 Model of Graphein [Chenevoy92]
20
© Prof. Rolf Ingold 20 Complex Layout Analysis [Azolky95]
21
© Prof. Rolf Ingold 21 Modeling of Scientific Journals [Azokly95]
22
© Prof. Rolf Ingold 22 Model for a Scientific Journal...... >...
23
© Prof. Rolf Ingold 23 Use of Document Recognition Models There is no universal approach ! Document recognition systems must be tuned for specific applications for specific document classes Contextual information is required Models provide information like generic document structures (DTD or XML-schema) geometrical and typographical attributes (style information) semantic information (keywords, dictionaries, databases,...) statistical information
24
© Prof. Rolf Ingold 24 Content of document models Generic structure Document Type Definition (DTD) or XML-schema Style information Absolute or relative positioning Typographical attributes & formatting rules Semantics (if available) Linguistic information, keywords Application specific ontology Probabilistic information Frequencies of items or sequences, co-occurrences
25
© Prof. Rolf Ingold 25 Trouble with document models Document models are hard to produce and to maintain implicit models (hard coded in the application) => hard to modify, adapt, extend explicit models, written in a formal language => cumbersome to produce, needs high expertise abstract models, learned automatically => needs a lot of training data (with ground-truth!) Need for more flexible tools: assisted environments with friendly user interfaces recognition improving with use models are learned incrementally
26
© Prof. Rolf Ingold 26 Pattern Based Document Understanding [Robaday 03] Configurations consist of Set of vertices Labeled (type) Attributed (pos, typo,...) Edges between vertices Labeled (neighborhood relation) Attributed (geom,...) Model consists of Extraction rules For each class Attribute selector List of pattern extraction configura- tion model classification document image rules patt. selector id
27
© Prof. Rolf Ingold 27 Evolution of 2-CREM performance improvement of correct labeling as a function of clicks used for correcting labels manually
28
© Prof. Rolf Ingold 28 Conclusion Structure recognition of documents is still an open issue Solutions exist for specialized applications Generic approaches are not mature model are hard to establish training data is missing As alternative interactive systems with incremental model adaptation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.