Download presentation
Presentation is loading. Please wait.
1
Aletheia Apostolos Antonacopoulos PRImA Lab, The University of Salford, United Kingdom www.primaresearch.org
2
Outline PRImA work overview Aletheia Performance evaluation Demo 2
3
Digitisation Workflow 3 Main steps: ① Scanning ② Image enhancement Page splitting Border removal Page curl removal Dewarping ③ Layout analysis Segmentation of regions, lines, words and characters Region classification Logical layout analysis ④ OCR ⑤ Post-processing
4
Aletheia 4 Ground Truthing Page border marking Layout regions (incl. logical layout) Text lines Words Glyphs Text insertion at all levels of segmentation
5
Ground Truthing Historical Documents 5 Complex Reading Order –Groups of ordered and unordered objects Full Unicode Support (Incl. special characters for historical documents)
6
Ground Truth – Image Enhancement 6
7
Ground Truth - Segmentation 7
8
The IMPACT Dataset A comprehensive dataset of historical document images is being created as part of the IMPACT project Reflects collections and digitisation programmes of 14 Content Holders (most national and major libraries in Europe) 700,000 images with basic metadata Printed documents in 17 languages, 11 scripts From the 17 th to early 20 th century 32,000 pages ground-truthed (down to region outlines and full text in UNICODE) – will have over 50,000 in December Available very soon via the IMPACT Centre of Competence 8
9
Performance Evaluation Overview 9 Evaluation Tools Image Repository Evaluation Results Compatibility through one common format (PAGE)
10
The PAGE Format Framework 10 Page Analysis and Ground-truth Elements (PAGE) Two-level architecture: root structure task specific sub-formats (GTS objects) Separate XML Schema definitions Currently supported GTS formats: Deskewing, Dewarping, Binarisation, Border Removal, Cropping and Page Content Processing results or ground truth (e.g. binarisation, dewarping, page content)
11
Evaluation Tools Segmentation and layout OCR text
12
Evaluation Metrics and Scenarios 12 Metrics Measurements of conditions (types of errors) Scenarios Expression of metrics in application context Combinations of weighted metrics Overall score combines individual weighted scores according to Type & Size of region Neighbourhood of errors Horizontal mergers & splits in text regions, maintaining reading order, attract small penalties Vertical mergers & splits (e.g. merged columns) will attract higher penalties
13
Further Information 13 PRImA http://www.primaresearch.org IMPACT http://www.impact-project.eu
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.