Download presentation
Presentation is loading. Please wait.
1
ITEC810 Final Report Inferring Document Structure Wieyen Lin/41348133 Supervised by Jette Viethen
2
2 Outlines Part A Introduction Related work Part B Material Methodology Part C Implementation Conclusion
3
3 Part A: Introduction
4
4 Introduction
5
5 Introduction (cont’d) Research Objective Analyze a document image and detect its logical structure with annotated labels Project Scope Focus on: Academic articles Source Corpus: Association for Computational Linguistics (ACL) Anthology Corpus
6
6 Related Work Physical Layout Analysis Top-down methods Bottom-up methods Logical Structure Analysis Syntactic methods Rule-based methods
7
7 Part B: Methodology
8
8 Material: XML Source by Text An example of Input file of the project
9
9 Methodology 1a. Grouping texts into lines XML source by text 1b. Aggregating lines into blocks XML source by line Physical Structure Phase I: Aggregation of Homogeneous Blocks
10
10 Methodology (cont’d) 2. Annotating each block with a logical label Logical Structure XML source by block 1b. Aggregating lines into blocks Phase II: Detection of Logical Structure
11
11 Methodology (cont’d) Check dominant font size Read-in 3 lines at a time A1A2A3A1A2A3 AABABBA 1 BA 2 ABC ABCA1BA2ABBAAB Check spacing s 1 =s 2 AAA s 1 >s 2 A1A1 A2A3A2A3 A3A3 A1A2A1A2 A, B, C: lines of texts with different dominant font sizes A 1, A 2 : lines of texts with the same dominant font size s 1 : spacing between A 1 and A 2 s 2 : spacing between A 2 and A 3 A : belongs to the same block Algorithm for aggregating blocks In Phase II
12
12 Part C: Outcomes
13
13 Current Outcome Original PDF document Physical layout outcome in HTML
14
14 Current Outcome (cont’d) Logical structure outcome in HTML
15
15 Implementation: Class Diagram
16
16 Implementation: User Interfaces
17
17 Conclusion: Information Evaluation Error Type Error Found Accuracy of Detection Incorrect title or missing title197.5% (39/40) Incorrect Abstract heading or Missing Abstract heading 490.0% (36/40) Incorrect Abstract or Missing Abstract490.0% (36/40) Incorrect Affiliation(s) or Missing Affiliation(s) 1172.5% (29/40) Missing >50% of Page number(s) or Erroneous Page number(s) found 1562.5% (25/40) Missing >50% Section heading(s) or Erroneous Section heading(s) found 1172.5% (29/40) Summary of detection results out of 40 randomly selected documents
18
18 Conclusion: Future Work Improving Algorithms Aggregation of Homogenous blocks Detection of Abstract Heading, Section Heading, and Paragraph Removing Noise Incomplete table contents Incomplete mathematic formula
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.