Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Slides:



Advertisements
Similar presentations
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Advertisements

Chapter 11 user support. Issues –different types of support at different times –implementation and presentation both important –all need careful design.
Problem solving methodology Information Technology Units Adapted from VCAA Study Design - Information Technology Byron Mitchell, November.
HTML5 ETDs Edward A. Fox, Sung Hee Park, Nicholas Lynberg, Jesse Racer, Phil McElmurray Digital Library Research Laboratory Virginia Tech ETD 2010, June.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 8 Slide 1 System modeling 2.
Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Help and Documentation zUser support issues ydifferent types of support at different times yimplementation and presentation both important yall need careful.
Prénom Nom Document Analysis: Introduction Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Prénom Nom Document Analysis: Segmentation & Layout Analysis Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Prénom Nom Document Analysis: Non Parametric Methods for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Modified from Sommerville’s originalsSoftware Engineering, 7th edition. Chapter 8 Slide 1 System models.
XLink: Open Linking Standard XML / XSL separate  data semantics  presentation semantics Need to also separate out  navigation semantics Single unique.
Document Image Analysis CSE 717 An Introduction. Document Image Analysis  DIA is the theory and practice of recovering the symbol structures of digital.
Prénom Nom Document Analysis: Fundamentals of pattern recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Creating Accessible PowerPoint Presentations Presenter Jennifer Reid Office of Information Services Information Dissemination Staff.
PDFs & Dorsetforyou.com Laura Hall Senior Website Officer
19 April, 2017 Knowledge and image processing algorithms for real-life applications. Dr. Maria Athelogou Principal Scientist & Scientific Liaison Manager.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 7 Slide 1 System models l Abstract descriptions of systems whose requirements are being.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Luc Audrain Hachette Livre Head of digitalization
Copyright © Texas Education Agency, All rights reserved. 1 Web Technologies Website Development with Dreamweaver.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 7 Slide 1 System models l Abstract descriptions of systems whose requirements are being.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
CHAPTER ONE Problem Solving and the Object- Oriented Paradigm.
Chapter 2: Software Process Omar Meqdadi SE 2730 Lecture 2 Department of Computer Science and Software Engineering University of Wisconsin-Platteville.
Proposals and Formal Reports
©Ian Sommerville 1995/2000 (Modified by Spiros Mancoridis 1999) Software Engineering, 6th edition. Chapter 7 Slide 1 System models l Abstract descriptions.
Of 33 lecture 10: ontology – evolution. of 33 ece 720, winter ‘122 ontology evolution introduction - ontologies enable knowledge to be made explicit and.
Coping with Babel How to Localize XML. Designing for Localization Document design can seriously impact the costs of translation and localization. Remember.
Definition of a taxonomy “System for naming and organizing things into groups that share similar characteristics” Taxonomy Architectures Applications.
Modified by Juan M. Gomez Software Engineering, 6th edition. Chapter 7 Slide 1 Chapter 7 System Models.
Problem solving methodology Information Technology Units Adapted from VCAA Study Design - Information Technology Byron Mitchell, November.
Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh.
Software Engineering, 8th edition Chapter 8 1 Courtesy: ©Ian Somerville 2006 April 06 th, 2009 Lecture # 13 System models.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Chap#11 What is User Support?
Xml:tm XML Text Memory Using XML technology to reduce the cost of translating XML documents.
Technical Communication A Practical Approach Chapter 14: Web Pages and Writing for the Web William Sanborn Pfeiffer Kaye Adkins.
HTML Basics. HTML Coding HTML Hypertext markup language The code used to create web pages.
UI's for inputting and presenting the metadata of hypermedia documents Kai Kuikkaniemi HUT T
Commonsense Reasoning in and over Natural Language Hugo Liu, Push Singh Media Laboratory of MIT The 8 th International Conference on Knowledge- Based Intelligent.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
CS223: Software Engineering
SEESCOASEESCOA SEESCOA Meeting Activities of LUC 9 May 2003.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Topical Analysis and Visualization of (Network) Data Using Sci2 Ted Polley Research & Editorial Assistant Cyberinfrastructure for Network Science Center.
1 Budapest University of Technology and Economics Department of Measurement and Information Systems Budapest University of Technology and Economics Fault.
Artificial Intelligence DNA Hypernetworks Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.
Finding & Citing Peer Reviewed Emergency Management Articles
The Title of Your Presentation with Each Initial Letter Capitalised (bold, centred, Times New Roman, 80pt) Name Author A1, Name Author B2, Name Author.
Computer Fundamentals
Proposals and Formal Reports
Software Word Processors.
Creating Accessible Electronic Content
Tools of Software Development
ece 627 intelligent web: ontology and beyond
Chapter 11 user support.
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
5.00 Apply procedures to organize content by using Dreamweaver. (22%)
Presentation transcript:

Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold 2 Outline  Objectives  Physical and logical structures  Examples of applications  Methodologies for structure recognition  Microstructures vs. macrostructures  Model driven approaches  Interactive Systems

© Prof. Rolf Ingold 3 Importance of document structures  Document = Content + Structures  Structures convey abstract high level information  Structures are revealed by styles

© Prof. Rolf Ingold 4 Applications of document structure recognition  Information extraction  form analysis (check readers,...)  business applications : mail distribution, invoice processing,...  analysis of museum & library notices  analysis of bibliographical references  Document mining, content analysis  business reports  legal documents  scientific publications  Intelligent indexing  laws  magazine & newspaper  Document restyling  teaching material ...

© Prof. Rolf Ingold 5 Extended Processing Chain Blocs Image Simple Text Preprocessing Postanalysis OCR Segmentation Fonts OFR Logical labeling Struct. Document Layout analysis

© Prof. Rolf Ingold 6 Physical document structures  Reveal the publisher's view  Composed of a hierarchy of physical entities  text blocs, text lines and tokens  graphical primitives  Universal, i.e. independent of the document class region blockhr document region block region hrblockfrm

© Prof. Rolf Ingold 7 Illustration of physical document structure from A. Belaïd

© Prof. Rolf Ingold 8 Illustration of logical document structure

© Prof. Rolf Ingold 9 Logical structures  Reflect the author’s mind  Independent of presentation  can be mapped on various physical structures  Composed of application dependent logical entities  Specific to the application and document class article ppppppppp author title hdln link article link document

© Prof. Rolf Ingold 10 Relation between logical and physical structure  There is no 1:1 relation between physical and logical structure  There are some correspondences between as shown below

© Prof. Rolf Ingold 11 Role of style sheets analysis formatting Stylesheet Logical Structure Physical Structure edit print display  Document formatting is straightforward...  But document analysis is a non trivial task that generally can not be fully automated

© Prof. Rolf Ingold 12 Methodologies  Document structural analysis can be  data-driven : the recognition task is based on image analysis  model-driven approaches : the recognition task is  Methods of structural document analysis can be classified into  geometrical approaches  syntactic approaches based on formal grammars  structural approaches based on graphs  rule based approaches  expert systems (artificial intelligence)  machine learning

© Prof. Rolf Ingold 13 Syntactic Document Recognition [Ingold89]  Full model driven approach  Formal document description language  attributed grammar  translated into an analysis graph  Top down matching algorithm with backtracking  for macro-structure as well as micro-structure recognition  Very generic approach  Sensitive to noise (no error recovering)  Theoretically exponential complexity

© Prof. Rolf Ingold 14 Document Description Language [Ingold89]  Document class specific formal description composed of  composition rules (context-free grammar)  typographical rules (attributes) Act:DOC => ActNumber ActContent FootNotes Headings ; ActNumber:FRG => {Number $ Period} ; ActContent:PRT => ActTitle ActDate Otgan {Provis} Formul {Chapter} [Validity] ;... Chapter:PRT => ChTitle ({Section} | {Article}) ; ChTitle.zone = Inherited ChTitle.alignment = (Allowed, Centered, 0pt, 0pt, Undefined) ; ChTitle.lineHeight = 11pt ; ChTitle.spaceBefore = (Allowed,[6pt, 60pt] ) ; ChTitle.interSpace = (Forbidden, [2pt, 3pt]) ; ChTitle.font = (Times, 11pt, Bold, Roman); Article.spaceBefore = <FST: (Forbidden, [6pt, 30pt]), NXT: (Allowed, [6pt, 30pt])> ;...

© Prof. Rolf Ingold 15 Analysis graph [Ingold89]  Analysis graph for syntactic analysis where each node has two links  successor (in case of successful match)  alternative (in case of unsuccessful match)

© Prof. Rolf Ingold 16 Fuzzy document structure recognition [Hu94]  The previous approach has been adapted to be less sensitive to matching errors  matching is using fuzzy logic

© Prof. Rolf Ingold 17 Fuzzy document structure recognition [Hu94]  Pattern matching is using fuzzy logic  Parsing is expressed as a cost function to be optimized  finding the shortest path in a graph (solved by linear programming)

© Prof. Rolf Ingold 18 Graphein : Blackboard approach [Chenevoy92]

© Prof. Rolf Ingold 19 Model of Graphein [Chenevoy92]

© Prof. Rolf Ingold 20 Complex Layout Analysis [Azolky95]

© Prof. Rolf Ingold 21 Modeling of Scientific Journals [Azokly95]

© Prof. Rolf Ingold 22 Model for a Scientific Journal >...

© Prof. Rolf Ingold 23 Use of Document Recognition Models  There is no universal approach !  Document recognition systems must be tuned  for specific applications  for specific document classes  Contextual information is required  Models provide information like  generic document structures (DTD or XML-schema)‏  geometrical and typographical attributes (style information)‏  semantic information (keywords, dictionaries, databases,...)‏  statistical information

© Prof. Rolf Ingold 24 Content of document models  Generic structure  Document Type Definition (DTD) or XML-schema  Style information  Absolute or relative positioning  Typographical attributes & formatting rules  Semantics (if available)‏  Linguistic information, keywords  Application specific ontology  Probabilistic information  Frequencies of items or sequences, co-occurrences

© Prof. Rolf Ingold 25 Trouble with document models  Document models are hard to produce and to maintain  implicit models (hard coded in the application)‏  => hard to modify, adapt, extend  explicit models, written in a formal language  => cumbersome to produce, needs high expertise  abstract models, learned automatically  => needs a lot of training data (with ground-truth!)‏  Need for more flexible tools:  assisted environments with friendly user interfaces  recognition improving with use  models are learned incrementally

© Prof. Rolf Ingold 26 Pattern Based Document Understanding [Robaday 03]  Configurations consist of  Set of vertices  Labeled (type)‏  Attributed (pos, typo,...)‏  Edges between vertices  Labeled (neighborhood relation)‏  Attributed (geom,...)‏  Model consists of  Extraction rules  For each class  Attribute selector  List of pattern extraction configura- tion model classification document image rules patt. selector id

© Prof. Rolf Ingold 27 Evolution of 2-CREM performance improvement of correct labeling as a function of clicks used for correcting labels manually

© Prof. Rolf Ingold 28 Conclusion  Structure recognition of documents is still an open issue  Solutions exist for specialized applications  Generic approaches are not mature  model are hard to establish  training data is missing  As alternative  interactive systems  with incremental model adaptation