NLP&CC 2012 报告人:许灿辉 单 位:北京大学计算机科学技术研究所 Integration of Text Information and Graphic Composite for PDF Document Analysis 基于复合图文整合的 PDF 文档分析 Integration of Text Information and Graphic Composite for PDF Document Analysis 基于复合图文整合的 PDF 文档分析 2012 年 11 月 04 日
1 、 Background 2 、 Integration of text information and graphic composite 3 、 Experimental results and discussion Outline
Document Layout Analysis Document Layout Understanding 1.1 Background DAR To extract physical structure. Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis.
Document Layout Analysis Document Layout Understanding 1.1 Background DAR To obtain logical structure. But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis. Logical structure includes logic attributes, hierarchical relations and logical label association.
1.1 Background Image document based layout analysis and understanding Image doc
1.1 Background Digitized document based layout analysis and understanding Digitized doc
1.1 Background Mostly Solved problems for documents analysis in PDF formats: Text line and text block segmentation Table detection Formula detection Core detection Foot and head detection List recognition Paragraph recognition
1.1 Background Unsolved open problems for documents understanding in PDF formats: Graphic recognition Table recognition Formula recognition ToC Reference detection …
1 、 Background 2 、 Integration of text information and graphic composite 3 、 Experimental results and discussion Outline
2.1 Preprocessing Hierarchical For each document page, there are three files for description: A physical xml description of page elements and attributes, including text elements, image elements and path operations with its unique ID. A.png image with resolution of 300 dpi. This synthetic image is rendered according to the selected page. A labeled ground-truth file. It contains information for performance evaluation, such as bounding boxes and element IDs.
2.1 Preprocessing Hierarchical Multi-layer conception incorporating both structural representations and image based analysis is proposed for segmentation. The page images are divided into text layer and non-text layer. Text layer analysis. Clustering the text elements according to proximity of feature similarity. Non-text layer analysis. Connect component based graphic object segmentation.
2.1 Preprocessing Hierarchical Marginal pictorial decorations Decorative lines Photograp hic images Drawings integrating with text
2.2 Non-text layer analysis CC Connected Component detection is considered from visual perspective. Spatial arrangement of intensities described by image texture features is applied for graphic component segmentation. Gray level co-occurrence matrix. The value of indicates the frequency of value i co-occurs with value j in pre-defined spatial relationship.
2.2 Non-text layer analysis CC Local Texture Entropy. The entropy is highest when all entries in are equal. Morphological filtering. The morphological filter consisting conditional dilation is utilized to fill the holes.
2.2 Non-text layer analysis CC Aiming at the reflowable reconstruction of PDF, the purpose of the graphical component segmentation lies in the rectangular bounding box of a holistic graphic composite which is mostly depicted by path operations in PDFs, rather than the fine edge boundaries of the detailed contents of graphic. The outside bounding box of graphic object is then identified on the specific connected component. Up till now, the non-textual.png image I is segmented into N partitions R N. Each subregion R i, i=1,…,N is a connected component. In most cases, a whole graphic figure consists of multiple connected components. Further merging and splitting process are applied to group CCs into desired regions based on predefined criteria which is closely related with the inter text line space.
2.3 Text layer analysis Graph based
2.3 Text layer analysis Graph based Construct a graph G=, weight elements in a component should be similar so edges between two vertices in the same component should have relatively low weights. Elements in different components should be dissimilar, so edges between vertices in different components should have higher weights.
2.3 Text layer analysis Graph based is a component of G The internal difference: Difference between two components :
2.3 Text layer analysis Graph based Region comparison predicate: Maximum internal difference: where
1 、 Background 2 、 Integration of text information and graphic composite 3 、 Experimental results and discussion Outline
3.1 Experimental results Text
3.1 Experimental results Text
3.1 Experimental results Segmentation
3.2 Discussion Overlap