Aletheia Apostolos Antonacopoulos PRImA Lab, The University of Salford, United Kingdom www.primaresearch.org.

Slides:



Advertisements
Similar presentations
EMu New Features 2013 Bernard Marshall KE Software.
Advertisements

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Services Digitisation & Content Management. 600 People – India.
How the edges of a line, paragraph, object, or table are positioned horizontally and vertically between the margins or on a page.
SMPTE Timed Text in the UltraViolet™ Common File Format Mike Dolan (TBT)
These ain’t “Old News”! Creating access to historic newspapers Christine Guenther OCLC Product Manager, Digital Services Preservation Service Centers Bethlehem,
Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
1 Computing for Todays Lecture 6 Yumei Huo Fall 2006.
Prénom Nom Document Analysis: Segmentation & Layout Analysis Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
LYU 0102 : XML for Interoperable Digital Video Library Recent years, rapid increase in the usage of multimedia information, Recent years, rapid increase.
XP New Perspectives on Microsoft Office Word 2003 Tutorial 4 1 Microsoft Office Word 2003 Tutorial 4 – Desktop Publishing and Mail Merge.
Module 9 Designing an XML Strategy. Module 9: Designing an XML Strategy Designing XML Storage Designing a Data Conversion Strategy Designing an XML Query.
OU Digital Library development project Liz Mallett – Project Manager James Alexander – Project Developer 25 January 2012.
Overview of Search Engines
A METS Application Profile for Historical Newspapers
Overview of JSP Technology. The need of JSP With servlets, it is easy to – Read form data – Read HTTP request headers – Set HTTP status codes and response.
Leonardo da Vinci Programme Project ACCELERATE Nicosia, May 2001 Services offered toVIP by the University of Graz, Austria Services to individuals Services.
Tutorial 6 Creating Tables and CSS Layouts. Objectives Session 6.1 – Create a data table to display and organize data – Modify table properties and layout.
Object detection, tracking and event recognition: the ETISEO experience Andrea Cavallaro Multimedia and Vision Lab Queen Mary, University of London
® Microsoft Office 2010 Word Tutorial 4 Desktop Publishing and Mail Merge.
European Metadata Initiatives: The METAe Metadata Engine Simon Tanner Higher Education Digitisation Service
Module 14: Configuring Print Resources and Printing Pools.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
1 An ICU Library Supporting the Display of Complex Text Eric Mader Globalization Center of Competency, Cupertino, CA.
© January/2008 CCS Content Conversion Specialists GmbH Weidestr. 134, Hamburg, Germany consulting technology digitization services.
Lakeland Click arrow to advance show. Click on the “A” under “Listed By Name.” (“A” for Academic Search Database)
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
The Early Modern OCR Project Big Data in the Humanities Matthew Christy, Laura Mandell, Elizabeth Grumbach.
The most powerful high-speed scanning, indexing and OCR solution on the market Supports many high speed scanners: Fujitsu, Canon, Kodak, Epson, Avision,
Session 1 SESSION 1 Working with Dreamweaver 8.0.
An Overview of Projects and Processes Higher Education Digitisation Service Joanne Lomax Smith
XP 2 HTML Tutorial 1: Developing a Basic Web Page.
Element 4 Creatingtables. LEARNING OUTCOMES 1. Insert basic table into a word processing document using the correct procedures. 2. Appropriately customise.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Million Book Bibliotheca Alexandrina Youssef Eldakar 19 November 2006.
Word Tutorial 4 Desktop Publishing and Mail Merge
OCR AS Applied ICT Business Documents. Big picture.
An Animated PowerPoint Networking People. 2 This page has been animated. The PowerPoint 2010 version includes an video element. The templates for older.
Standards for digital encoding Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture 2: TEI.
National Library of Finland Strategic, Systematic and Holistic Approach in Digitisation Cultural unity and diversity of the Baltic Sea Region – common.
XML CSC1310 Fall HTML (TIM BERNERS-LEE) HyperText Markup Language  HTML (HyperText Markup Language): December  Markup  Markup is a symbol.
Key Applications Module Lesson 14 — Working with Tables Computer Literacy BASICS.
Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.
XP 2 HTML Tutorial 1: Developing a Basic Web Page.
XP 1 HTML Tutorial 1: Developing a Basic Web Page.
Collection Description considerations in the nof-digitise programme Sarah Mitchell Programme Manager New Opportunities Fund.
Digitizing Historical Newspapers South Carolina Digital Newspaper Program's participation with the Library of Congress' Chronicling America: Historic American.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Enhance Zone Label OCR Text/ Bitmaps Text/ Bitmaps Database Full Text Index/ Search Full Text Index/ Search Index/ Search by Word ROI Pattern Index/ Search.
Glencoe Introduction to Web Design Chapter 4 XHTML Basics 1 Review Do you remember the vocabulary terms from this chapter? Use the following slides to.
COM: 111 Introduction to Computer Applications Department of Information & Communication Technology Panayiotis Christodoulou.
Using the Gamera framework for the recognition of cultural heritage materials Levy Project II Digital Knowledge Center, Sheridan Libraries, Michael Droettboom,
S.Rajeswari Head , Scientific Information Resource Division
Creating a Web Page.
Software Specification Tools
Positioning Objects with CSS and Tables
Digitisation in academic libraries: Experience from Makerere University Library, Kampala Uganda By Patrick Sekikome Presented at the CERN-UNESCO School.
VI-SEEM Data Repository
Lesson 17 Working with Graphics
Understanding Standards Art and Design (Higher)
An Animated PowerPoint
Tutorial 4 – Desktop Publishing
PRESENTATION LAYOUTS This is a title slide
Chapter 11 Review.
Tables © EIT, Author Gay Robertson, 2017.
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Positioning Objects with CSS and Tables
Key Applications Module Lesson 14 — Working with Tables
iLayout: Performance Evaluation
Presentation transcript:

Aletheia Apostolos Antonacopoulos PRImA Lab, The University of Salford, United Kingdom

Outline  PRImA work overview  Aletheia  Performance evaluation  Demo 2

Digitisation Workflow 3 Main steps: ① Scanning ② Image enhancement Page splitting Border removal Page curl removal Dewarping ③ Layout analysis Segmentation of regions, lines, words and characters Region classification Logical layout analysis ④ OCR ⑤ Post-processing

Aletheia 4 Ground Truthing  Page border marking  Layout regions (incl. logical layout)  Text lines  Words  Glyphs  Text insertion at all levels of segmentation

Ground Truthing Historical Documents 5  Complex Reading Order –Groups of ordered and unordered objects  Full Unicode Support (Incl. special characters for historical documents)

Ground Truth – Image Enhancement 6

Ground Truth - Segmentation 7

The IMPACT Dataset  A comprehensive dataset of historical document images is being created as part of the IMPACT project Reflects collections and digitisation programmes of 14 Content Holders (most national and major libraries in Europe) 700,000 images with basic metadata Printed documents in 17 languages, 11 scripts From the 17 th to early 20 th century 32,000 pages ground-truthed (down to region outlines and full text in UNICODE) – will have over 50,000 in December Available very soon via the IMPACT Centre of Competence 8

Performance Evaluation Overview 9 Evaluation Tools Image Repository Evaluation Results Compatibility through one common format (PAGE)

The PAGE Format Framework 10  Page Analysis and Ground-truth Elements (PAGE)  Two-level architecture:  root structure  task specific sub-formats (GTS objects)  Separate XML Schema definitions  Currently supported GTS formats: Deskewing, Dewarping, Binarisation, Border Removal, Cropping and Page Content Processing results or ground truth (e.g. binarisation, dewarping, page content)

Evaluation Tools  Segmentation and layout  OCR text

Evaluation Metrics and Scenarios 12  Metrics  Measurements of conditions (types of errors)  Scenarios  Expression of metrics in application context  Combinations of weighted metrics  Overall score combines individual weighted scores according to  Type & Size of region  Neighbourhood of errors Horizontal mergers & splits in text regions, maintaining reading order, attract small penalties Vertical mergers & splits (e.g. merged columns) will attract higher penalties

Further Information 13 PRImA IMPACT