1 April 2004 – METS Opening Day West docWORKS/METAe Automated Conversion Of Printed Documents Into Fully Tagged METS Objects Claus Gravenhorst Content Conversion Specialists
2 April 2004 – METS Opening Day West CCS – Offices What is docWORKS/METAe? Production tool for conversion of printed documents into fully tagged digital objects The METAe edition of docWORKS is the result of the EU-funded project METAe Start of project: September 2000 End of project: August 2003 Product launch: March 2003, CeBIT exhibition
3 April 2004 – METS Opening Day West CCS – Offices The project group 1.Leopold-Franzens-Universität Innsbruck (Co-ordinator), Austria 2.Universität Linz, Institut für Angewandte Informatik, University of Linz, Austria 3.Mitcom Neue Medien GmbH (ABBYY Europe), Germany 4.CCS Compact Computer Systeme, Germany 5.Universidad de Alicante, Spain 6.Friedrich-Ebert-Stiftung, Germany 7.Cornell University Library. Department of Preservation and Conservation, USA 8.Bibliothèque nationale de France 9.The National Library of Norway, Rana division, Norway 10.Biblioteca Statale A. Baldini, Italy 11.Dipartimento di Sistemi e Informatica, University of Florence, Italy 12.Karl-Franzens-Universität Graz, Universitätsbibliothek, Austria 13.Scuola Normale Superiore, Centro di Ricerche Informatiche per i Beni Culturali, Italy 14.Higher Education Digitisation Service HEDS, UK
4 April 2004 – METS Opening Day West CCS – Offices Challenges Digitization and retro-conversion of printed or textual material is getting more and more important: Keep knowledge and cultural heritage alive Preserve the origin Enable quick and enhanced access by high structured documents Open up new dimensions of research Provide standardized output formats
5 April 2004 – METS Opening Day West CCS – Offices Goals Automate the conversion process Make digitization more effective and safer Increase the added value of digitized collections Provide a standardized output format in order to allow transformation of metadata into various applications and systems
6 April 2004 – METS Opening Day West CCS – Offices docWORKS – System Overview document METS ALTO TIFF JPEG Image Pre-Processing Layout Analysis Character Recognition Structural Analysis Scanning Import Correction Export Rules DB docWORKS engineInputOutput
7 April 2004 – METS Opening Day West CCS – Offices docWORKS – as much metadata as possible! Available data Descriptive metadata Administra- tive metadata Structural metadata - logical Structural metadata - physical Formats Library records, e.g. MARC TIFF Images METS Dublin Core linking to catalogue record METS incl. NISO (mix) METS Structural map ALTO (Analyzed Layout and Text Object) docWORKS engine Import of subsets, linking to record Creates descriptive records for articles, pictures,… Records metadata Suggests labels of logical elements and structures Provides suggestion for physical structure User mode AutomatedSemi- automated Correction recommended Fully- automated after defining a profile Automated Correction recommended Automated Correction in special cases
8 April 2004 – METS Opening Day West CCS – Offices docWORKS – Matching of Image Files and Page Numbers Image- file PaginationPage- Number tifNot countedNp tifNot countedNp tifCountedI tifCountedII tifCountedIII tifCountedIV tifCountedV tifCountedVI tifCounted tifCounted, not paginated(2) tifCounted tifCounted4 placeholderMissing page5 placeholderMissing page tifCounted tifCounted8
9 April 2004 – METS Opening Day West CCS – Offices Traditional OCR - Output THE AMERICAN MISSIONARY. Vo.. XXXII JANUARY, 1878 No. 1 American Missionary Association xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
10 April 2004 – METS Opening Day West CCS – Offices More information available Title page Title of series Volume number Issue number Motto Date
11 April 2004 – METS Opening Day West CCS – Offices docWORKS – Structural Analysis FRONT MAIN BACK
12 April 2004 – METS Opening Day West CCS – Offices docWORKS – Structural Analysis Chapter 1 Chapter 2 Subchapter 1 Subchapter 2
13 April 2004 – METS Opening Day West CCS – Offices docWORKS – Structural Analysis Preface Table of contents Title page Statement page
14 April 2004 – METS Opening Day West CCS – Offices docWORKS – Document layers Various document layers are differentiated automatically and while using certain levels enable well directed searches as well as the presentation of electronic text without unnecessary items Body text independently from its presentation Margin notes, footnotes Pictures and captions Advertisement Annex and supplements Navigation layer: Table of contents, running title, document index, page number, volume index Book: Separation of „intellectual“ and „artifical“ content
15 April 2004 – METS Opening Day West CCS – Offices docWORKS – Digitization of books and journals (METAe)
16 April 2004 – METS Opening Day West CCS – Offices docWORKS – Digitization of books and journals (METAe)
17 April 2004 – METS Opening Day West CCS – Offices docWORKS – Digitization of scientific documents
18 April 2004 – METS Opening Day West CCS – Offices docWORKS – Basic Workflow Digitization Scanning Digitization Scanning DB OPAC MARC Quality Control Images Quality Control Images Conversion Quality Control Output Quality Control Output Export Presentation XML/METS PDF Presentation XML/METS PDF
19 April 2004 – METS Opening Day West CCS – Offices docWORKS – Scalable Client / Server architecture Server 1 Server 2 Server n.... Scan Import Scan Import Quality Control Quality Control Server 3 Auto-Import Image Preprocessing Layout Analysis OCR Structural Analysis Export
20 April 2004 – METS Opening Day West CCS – Offices docWORKS – METS / ALTO METS document TIFFALTO ALTO – Analyzed Layout and Text Object
21 April 2004 – METS Opening Day West CCS – Offices docWORKS – METS Header DC, descriptive metadata NISO (mix), technical metadata Structural Map: Physical Structure Structural Map: Logical Structure
22 April 2004 – METS Opening Day West CCS – Offices docWORKS – ALTO Styles - Paragraph (alignment, linespacing, etc.) - Font (name, size, bold, italic, etc.) Layout - Printspace - TopMargin - InnerMargin - OuterMargin - BottomMargin Objects in 5 areas above: - Text block - Text lines - Strings [coordinates, string (as printed), substitution (hyphenation)] - Spaces - Composed block - Picture - Table - Formula
23 April 2004 – METS Opening Day West CCS – Offices docWORKS – METS / physical structure METS DC FILEGRP PHYS LOGICAL DC FILEGRP PHYS LOGICAL ORDER … LABEL II III IV V VI … ORDERLABEL I II III IV V VI …
24 April 2004 – METS Opening Day West CCS – Offices docWORKS – METS / physical structure par fptr METS DC FILEGRP PHYS LOGICAL DIV (page) FILEID ALTO FILEID IMAGE
25 April 2004 – METS Opening Day West CCS – Offices docWORKS – METS / logical structure seq fptr METS DC FILEGRP PHYS LOGICAL DIV (paragraph) DIV (volume) DCMD_PHYS DCMD_ELEC DIV (issue) DCMD_ISSUE# DIV (contrib.) DCMD_#CONT# FILEID ALTO Those who have read the History of Columbus will, doubtless, remember the character and exploits... XSLT text block BEGIN FILEID Coordinates DIV (chapter) DCMD_CHAP#
26 April 2004 – METS Opening Day West CCS – Offices docWORKS – ALTO / page layout and text content
27 April 2004 – METS Opening Day West CCS – Offices docWORKS – ALTO / hyphenated word
28 April 2004 – METS Opening Day West CCS – Offices docWORKS – ALTO / hyphenated word
29 April 2004 – METS Opening Day West CCS – Offices Daniel!
30 April 2004 – METS Opening Day West CCS – Offices Thank you! Claus Gravenhorst Daniel Lanz Content Conversion Specialists