JY Ramel et al. Interactive and Incremental Analysis of Document Images Laboratoire d’Informatique de Tours - FRANCE
Document Image Analysis Systems: strategies and tools 2 LABRI Seminar Introduction Context of the presented work Let’s dive into the semantical gap… Characterization and representation of document images Selection of low level primitives A graph representation for the layout or structure Analysis and recognition of the contents Contextual and incremental analysis Operators and scenarios AGORA and RETRO Prototypes Conclusion Work realized over a series of collaborations … thanks to all contributors!
Document Image Analysis Systems: strategies and tools 3 Introduction Context of the work Preservation of the cultural heritage The CESR Tours : a training and research centre Working on various domains of the Renaissance (historians) A rich library of rare books (Loire Valley) An initial project: The Humanistic Virtual Library (BVH in French) Collaboration with RFAI research team A pluridisciplinary collaboration : Experts in DIA + Experts in rare books + End-users Fill up the semantical gap ? A new idea : Introduce more interaction into DIA systems
Document Image Analysis Systems: strategies and tools 4 Introduction Let’s dive into the semantical gap Which segmentation methods are able to extract such EoC (Element of Content)? Data driven methods? No Too much noise and parameters Model driven methods? No Not much variability in the model of document
Document Image Analysis Systems: strategies and tools 5 Introduction Let’s dive into the semantical gap Low level information (data driven) Pixels, regions, contours, primitives Low level processing according to images specificities (data) High level entities (model driven) Indexation = reading = a priori knowledge = Model Domain specific processing genericity Most of the time: no user / Everything is encoded ?
Document Image Analysis Systems: strategies and tools 6 Fill up the gap with the help of the user ? Learning of the model or of the shapes to recognize Introduction Let’s dive into the semantical gap Before
Document Image Analysis Systems: strategies and tools 7 Fill up the gap with the help of the user ? Interactive construction of the processing sequence A posteriori intervention Error Correction by relevance feedback Ariane / Pandore GraphEdit / Directshow Introduction Let’s dive into the semantical gap Before After
Document Image Analysis Systems: strategies and tools 8 Introduction Let’s dive into the semantical gap Fill up the gap with the help of the user Our Proposition Incremental analysis Segmentation for recognition, recognition for segmentation From the simplest to the more difficult Interactive analysis User-driven method Adaptation according to images Adaptation according to user objectives It requires An initial representation of the image content A set of processing operators for segmentation and recognition with interoperability and compatibility capabilities DURING
Document Image Analysis Systems: strategies and tools 9 Part I Characterization and representation of document images Information about the shapes using contour vectorization Information about the structure (layout) using a graph representation Towards a generic structural representation VectoGraph
Document Image Analysis Systems: strategies and tools 10 Characterization and representation of document images Information about the shapes using contour vectorization Which primitives for describing shapes in a document? Binarization to extract contours (Vectors, Quad) and CC
Document Image Analysis Systems: strategies and tools 11 Characterization and representation of document images Information about structure using graphs Idea: An evolutive graph of EoC Two types of EoC: Primitives / Elementary EoC Connected components Vectors Quadrilaterals User defined EoC Characters Words Ornamental letters Triangles Diodes …
Document Image Analysis Systems: strategies and tools 12 Node = Primitive or EoC - Type of EoC – Centre (X,Y) of the Bounding Box – Bounding Box – Bounding Rectangle BR :(P1,P2,P3,P4) – Orientation = inertia axis – Density of B&W inside the BB – Color : Average color – List of the elementary EoC – Number of elementary EoC – Confidence rate Edge = Relation between EoC – Minimal distance between 2 EoC – Angle between EoC – Relation : Inside, Overlap, L, T, P, X, S, undefined EoC Characterization and representation of document images Information about structure using graphs Towards a generic representation with graphs Axis
Document Image Analysis Systems: strategies and tools 13 Initial Representation for old document - Graph of EoCs + Background map - Background map Graph of the connected components Tagging of the nodes according to the size Noise Text Images Edges between closed shapes (CC) Horizontal/Vertical neighborhood H H V V
Document Image Analysis Systems: strategies and tools 14 Characterization and representation of document images Information about structure using graphs Towards a generic structural representation Domain independent Documents : Pixels, Points, … Layout & relations Shapes & EoC Angle Edge : Relation between primitives or EoCs Quadrilateral Vector Connected component Distance Topology Node : Primitive, EoC Representation = structural graph Analysis = Domain dependent
Document Image Analysis Systems: strategies and tools 15 Part II Analysis and recognition of image contents Contextual and incremental analysis of old document images Three operators with simple parameters Interactive construction of scenario Examples AGORA prototype
Document Image Analysis Systems: strategies and tools 16 Analysis and recognition of image contents Strategy of analysis Proposition : user driven analysis (scenario) incremental and interactive approach No predefined EoCs (model of document) Users can themselves define the required EoCs Interactive definition of the model of the document Incremental analysis (simple difficult) Easiness Easy to use interfaces – user assistant No complex image processing algorithms But just: Tagging (extraction-recognition) Merging Deletion
Document Image Analysis Systems: strategies and tools 17 Analysis and recognition of image contents Three operators Tagging (extraction) of EoC (nodes) according to rules about spatial position in the pages rules about neighborhood relationship (using edges) rules about internal properties (node attributes) Merging of EoC according to rules using the distance computed from the background map (edge attributes) on a specific type of EoCs Deletion of EoC according to label and user decision
Document Image Analysis Systems: strategies and tools Analysis and recognition of image contents Scenarios User-defined processing sequence Graph analysis and modification Defined by users on a typical image Depending on the user objectives and on the images Can be saved, edited and applied in batch mode …
Document Image Analysis Systems: strategies and tools 19 Analysis and recognition of image contents Examples Initial representation = primal EoC
Document Image Analysis Systems: strategies and tools 20 Analysis and recognition of image contents Examples Tagging the primal EoC Text – Graphic - Noise Graphic
Document Image Analysis Systems: strategies and tools 21 Analysis and recognition of image contents Examples Merging of EoC = Text Word – Line - Paragraph
Document Image Analysis Systems: strategies and tools 22 Analysis and recognition of image contents Examples Position verticale Position horizontale avg = 0,46 std = 0,41 avg = 0,51std = 0,07 Automatic Tagging Lettrine With the collaboration of Nicholas…
Document Image Analysis Systems: strategies and tools 23 Analysis and recognition of image contents Examples ERREUR Automatic Tagging with manual validation/modification of the rule With the collaboration of Nicholas…
Document Image Analysis Systems: strategies and tools 24 Analysis and recognition of image contents Examples - Primal sketch construction - Img Type = Connected Component of size > Noise Type = Connected Component of size < 10 - Text Type = Connected Component of size between 10 and Horizontal and vertical Fusion of Text with d < Border Type = Img with Width/Height Ratio between 3 and 10 - Ornamental Letter Type = Img close to Nothing at the Left and close to Text at the Right - Img Type = Ornamental Letter with Width/Height Ratio 1,2 - Img Type = Ornamental Letter on the Right < 75 % - Left Margin Type = Text on the left with 25% - Right Margin Type = Text on the right with 25% - Vertical Fusion of the Left and Right Margins with d < Horizontal Fusion of the Text with d < Pagination Type = Text in top with 10% - Text Type = Pagination with a number of Connected Components > 3 - Signature Type = Text in bottom with 25% - Text Type = Signature with a number of Connected components > 5 - Text type = Signature with Text below, on the left or on the right - Suppression of the EoC labelled Text - Suppression of the EoC labelled Noise Example of an obtained scenario applicable on a set of images
Document Image Analysis Systems: strategies and tools Analysis and recognition of image contents Examples Marge Title Text Legend Lettrine Noise Results
Document Image Analysis Systems: strategies and tools 26 Analysis and recognition of image contents AGORA A User-driven Approach |see IJDAR] Graph representation Simple operators Scenarios Some interfaces Used since 2004 at the CESR Always to be improved… Download :
Document Image Analysis Systems: strategies and tools 27 Analysis and recognition of image contents From AGORA to RETRO
Document Image Analysis Systems: strategies and tools 28 Analysis and recognition of image contents From AGORA to RETRO … La l[21]ngueu[7] du chevalet depuis s[21]n pied jusques au c[7][21]chet d’en haut, p[21]rte d[21]uze t[7][21]us … Accuracy Frequence of « tri-gram » Dictionnary Contextual, manual and automatic Transcription
Document Image Analysis Systems: strategies and tools 29 Analysis and recognition of image contents From AGORA to RETRO Experiment on one complete book (Vésale) Book of 150 pages connected components (pseudo characters) classes (clusters) have been built. 90% of these classes are composed of less than 10 occurrences Ignoring these classes during transcription means to miss one character for 14 more than one on each text line !!! 57% of the classes are composed of a single shape Why ? Noise, spots Touching characters Splitted characters Same for words The 200 largest classes correspond to 85% of the text
Document Image Analysis Systems: strategies and tools 30 Conclusion Proposition of a global approach: from images to their interpretation Modelisation of the data representation of image content Genericity : thin and filled shapes, line and curves, shapes and structure or layout Contour vectorization + relationship analysis Utilization of attributed graphs Modelisation of processing operators recognition Utilization of contextual information during the EoC extraction and recognition Involvement of the user (early) in the processing sequence : user- driven analysis Proposition of new structural PR techniques
Document Image Analysis Systems: strategies and tools 31 Thanks Questions ?
Document Image Analysis Systems: strategies and tools Analysis and recognition of image contents Scenarios Document : Image, pixels Symbole … Titre Lettrine EdC Representation = Graph of EdC Q1 Q2 … User defined scenarios = succession of operators + thresholds Q2 Q1 … P2Q3 … Scénario 1 Scénario 2 Scénario 3 P1 P3 P2P4