Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Similar presentations


Presentation on theme: "Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008."— Presentation transcript:

1 Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

2 © Prof. Rolf Ingold 2 Outline  Objectives  Physical and logical structures  Examples of applications  Methodologies for structure recognition  Microstructures vs. macrostructures  Model driven approaches  Interactive Systems

3 © Prof. Rolf Ingold 3 Importance of document structures  Document = Content + Structures  Structures convey abstract high level information  Structures are revealed by styles

4 © Prof. Rolf Ingold 4 Applications of document structure recognition  Information extraction  form analysis (check readers,...)  business applications : mail distribution, invoice processing,...  analysis of museum & library notices  analysis of bibliographical references  Document mining, content analysis  business reports  legal documents  scientific publications  Intelligent indexing  laws  magazine & newspaper  Document restyling  teaching material ...

5 © Prof. Rolf Ingold 5 Extended Processing Chain Blocs Image Simple Text Preprocessing Postanalysis OCR Segmentation Fonts OFR Logical labeling Struct. Document Layout analysis

6 © Prof. Rolf Ingold 6 Physical document structures  Reveal the publisher's view  Composed of a hierarchy of physical entities  text blocs, text lines and tokens  graphical primitives  Universal, i.e. independent of the document class region blockhr document region block region hrblockfrm

7 © Prof. Rolf Ingold 7 Illustration of physical document structure from A. Belaïd

8 © Prof. Rolf Ingold 8 Illustration of logical document structure

9 © Prof. Rolf Ingold 9 Logical structures  Reflect the author’s mind  Independent of presentation  can be mapped on various physical structures  Composed of application dependent logical entities  Specific to the application and document class article ppppppppp author title hdln link article link document

10 © Prof. Rolf Ingold 10 Relation between logical and physical structure  There is no 1:1 relation between physical and logical structure  There are some correspondences between as shown below

11 © Prof. Rolf Ingold 11 Role of style sheets analysis formatting Stylesheet Logical Structure Physical Structure edit print display  Document formatting is straightforward...  But document analysis is a non trivial task that generally can not be fully automated

12 © Prof. Rolf Ingold 12 Methodologies  Document structural analysis can be  data-driven : the recognition task is based on image analysis  model-driven approaches : the recognition task is  Methods of structural document analysis can be classified into  geometrical approaches  syntactic approaches based on formal grammars  structural approaches based on graphs  rule based approaches  expert systems (artificial intelligence)  machine learning

13 © Prof. Rolf Ingold 13 Syntactic Document Recognition [Ingold89]  Full model driven approach  Formal document description language  attributed grammar  translated into an analysis graph  Top down matching algorithm with backtracking  for macro-structure as well as micro-structure recognition  Very generic approach  Sensitive to noise (no error recovering)  Theoretically exponential complexity

14 © Prof. Rolf Ingold 14 Document Description Language [Ingold89]  Document class specific formal description composed of  composition rules (context-free grammar)  typographical rules (attributes) Act:DOC => ActNumber ActContent FootNotes Headings ; ActNumber:FRG => {Number $ Period} ; ActContent:PRT => ActTitle ActDate Otgan {Provis} Formul {Chapter} [Validity] ;... Chapter:PRT => ChTitle ({Section} | {Article}) ; ChTitle.zone = Inherited ChTitle.alignment = (Allowed, Centered, 0pt, 0pt, Undefined) ; ChTitle.lineHeight = 11pt ; ChTitle.spaceBefore = (Allowed,[6pt, 60pt] ) ; ChTitle.interSpace = (Forbidden, [2pt, 3pt]) ; ChTitle.font = (Times, 11pt, Bold, Roman); Article.spaceBefore = <FST: (Forbidden, [6pt, 30pt]), NXT: (Allowed, [6pt, 30pt])> ;...

15 © Prof. Rolf Ingold 15 Analysis graph [Ingold89]  Analysis graph for syntactic analysis where each node has two links  successor (in case of successful match)  alternative (in case of unsuccessful match)

16 © Prof. Rolf Ingold 16 Fuzzy document structure recognition [Hu94]  The previous approach has been adapted to be less sensitive to matching errors  matching is using fuzzy logic

17 © Prof. Rolf Ingold 17 Fuzzy document structure recognition [Hu94]  Pattern matching is using fuzzy logic  Parsing is expressed as a cost function to be optimized  finding the shortest path in a graph (solved by linear programming)

18 © Prof. Rolf Ingold 18 Graphein : Blackboard approach [Chenevoy92]

19 © Prof. Rolf Ingold 19 Model of Graphein [Chenevoy92]

20 © Prof. Rolf Ingold 20 Complex Layout Analysis [Azolky95]

21 © Prof. Rolf Ingold 21 Modeling of Scientific Journals [Azokly95]

22 © Prof. Rolf Ingold 22 Model for a Scientific Journal...... >...

23 © Prof. Rolf Ingold 23 Use of Document Recognition Models  There is no universal approach !  Document recognition systems must be tuned  for specific applications  for specific document classes  Contextual information is required  Models provide information like  generic document structures (DTD or XML-schema)‏  geometrical and typographical attributes (style information)‏  semantic information (keywords, dictionaries, databases,...)‏  statistical information

24 © Prof. Rolf Ingold 24 Content of document models  Generic structure  Document Type Definition (DTD) or XML-schema  Style information  Absolute or relative positioning  Typographical attributes & formatting rules  Semantics (if available)‏  Linguistic information, keywords  Application specific ontology  Probabilistic information  Frequencies of items or sequences, co-occurrences

25 © Prof. Rolf Ingold 25 Trouble with document models  Document models are hard to produce and to maintain  implicit models (hard coded in the application)‏  => hard to modify, adapt, extend  explicit models, written in a formal language  => cumbersome to produce, needs high expertise  abstract models, learned automatically  => needs a lot of training data (with ground-truth!)‏  Need for more flexible tools:  assisted environments with friendly user interfaces  recognition improving with use  models are learned incrementally

26 © Prof. Rolf Ingold 26 Pattern Based Document Understanding [Robaday 03]  Configurations consist of  Set of vertices  Labeled (type)‏  Attributed (pos, typo,...)‏  Edges between vertices  Labeled (neighborhood relation)‏  Attributed (geom,...)‏  Model consists of  Extraction rules  For each class  Attribute selector  List of pattern extraction configura- tion model classification document image rules patt. selector id

27 © Prof. Rolf Ingold 27 Evolution of 2-CREM performance improvement of correct labeling as a function of clicks used for correcting labels manually

28 © Prof. Rolf Ingold 28 Conclusion  Structure recognition of documents is still an open issue  Solutions exist for specialized applications  Generic approaches are not mature  model are hard to establish  training data is missing  As alternative  interactive systems  with incremental model adaptation


Download ppt "Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008."

Similar presentations


Ads by Google