Download presentation
Presentation is loading. Please wait.
Published byLorraine Newton Modified over 9 years ago
1
1 April 2004 – METS Opening Day West www.ccs-gmbh.de1 docWORKS/METAe Automated Conversion Of Printed Documents Into Fully Tagged METS Objects Claus Gravenhorst Content Conversion Specialists
2
2 April 2004 – METS Opening Day West www.ccs-gmbh.de2 CCS – Offices What is docWORKS/METAe? Production tool for conversion of printed documents into fully tagged digital objects The METAe edition of docWORKS is the result of the EU-funded project METAe Start of project: September 2000 End of project: August 2003 Product launch: March 2003, CeBIT exhibition
3
3 April 2004 – METS Opening Day West www.ccs-gmbh.de3 CCS – Offices The project group 1.Leopold-Franzens-Universität Innsbruck (Co-ordinator), Austria 2.Universität Linz, Institut für Angewandte Informatik, University of Linz, Austria 3.Mitcom Neue Medien GmbH (ABBYY Europe), Germany 4.CCS Compact Computer Systeme, Germany 5.Universidad de Alicante, Spain 6.Friedrich-Ebert-Stiftung, Germany 7.Cornell University Library. Department of Preservation and Conservation, USA 8.Bibliothèque nationale de France 9.The National Library of Norway, Rana division, Norway 10.Biblioteca Statale A. Baldini, Italy 11.Dipartimento di Sistemi e Informatica, University of Florence, Italy 12.Karl-Franzens-Universität Graz, Universitätsbibliothek, Austria 13.Scuola Normale Superiore, Centro di Ricerche Informatiche per i Beni Culturali, Italy 14.Higher Education Digitisation Service HEDS, UK
4
4 April 2004 – METS Opening Day West www.ccs-gmbh.de4 CCS – Offices Challenges Digitization and retro-conversion of printed or textual material is getting more and more important: Keep knowledge and cultural heritage alive Preserve the origin Enable quick and enhanced access by high structured documents Open up new dimensions of research Provide standardized output formats
5
5 April 2004 – METS Opening Day West www.ccs-gmbh.de5 CCS – Offices Goals Automate the conversion process Make digitization more effective and safer Increase the added value of digitized collections Provide a standardized output format in order to allow transformation of metadata into various applications and systems
6
6 April 2004 – METS Opening Day West www.ccs-gmbh.de6 CCS – Offices docWORKS – System Overview document METS ALTO TIFF JPEG Image Pre-Processing Layout Analysis Character Recognition Structural Analysis Scanning Import Correction Export Rules DB docWORKS engineInputOutput
7
7 April 2004 – METS Opening Day West www.ccs-gmbh.de7 CCS – Offices docWORKS – as much metadata as possible! Available data Descriptive metadata Administra- tive metadata Structural metadata - logical Structural metadata - physical Formats Library records, e.g. MARC TIFF Images METS Dublin Core linking to catalogue record METS incl. NISO (mix) METS Structural map ALTO (Analyzed Layout and Text Object) docWORKS engine Import of subsets, linking to record Creates descriptive records for articles, pictures,… Records metadata Suggests labels of logical elements and structures Provides suggestion for physical structure User mode AutomatedSemi- automated Correction recommended Fully- automated after defining a profile Automated Correction recommended Automated Correction in special cases
8
8 April 2004 – METS Opening Day West www.ccs-gmbh.de8 CCS – Offices docWORKS – Matching of Image Files and Page Numbers Image- file PaginationPage- Number 000001.tifNot countedNp 000002.tifNot countedNp 000003.tifCountedI 000004.tifCountedII 000005.tifCountedIII 000006.tifCountedIV 000007.tifCountedV 000008.tifCountedVI 000009.tifCounted1 000010.tifCounted, not paginated(2) 000011.tifCounted3 000012.tifCounted4 placeholderMissing page5 placeholderMissing page6 000013.tifCounted7 000014.tifCounted8
9
9 April 2004 – METS Opening Day West www.ccs-gmbh.de9 CCS – Offices Traditional OCR - Output THE AMERICAN MISSIONARY. Vo.. XXXII JANUARY, 1878 No. 1 American Missionary Association 1877 - 1888 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
10
10 April 2004 – METS Opening Day West www.ccs-gmbh.de10 CCS – Offices More information available Title page Title of series Volume number Issue number Motto Date
11
11 April 2004 – METS Opening Day West www.ccs-gmbh.de11 CCS – Offices docWORKS – Structural Analysis FRONT MAIN BACK
12
12 April 2004 – METS Opening Day West www.ccs-gmbh.de12 CCS – Offices docWORKS – Structural Analysis Chapter 1 Chapter 2 Subchapter 1 Subchapter 2
13
13 April 2004 – METS Opening Day West www.ccs-gmbh.de13 CCS – Offices docWORKS – Structural Analysis Preface Table of contents Title page Statement page
14
14 April 2004 – METS Opening Day West www.ccs-gmbh.de14 CCS – Offices docWORKS – Document layers Various document layers are differentiated automatically and while using certain levels enable well directed searches as well as the presentation of electronic text without unnecessary items Body text independently from its presentation Margin notes, footnotes Pictures and captions Advertisement Annex and supplements Navigation layer: Table of contents, running title, document index, page number, volume index Book: Separation of „intellectual“ and „artifical“ content
15
15 April 2004 – METS Opening Day West www.ccs-gmbh.de15 CCS – Offices docWORKS – Digitization of books and journals (METAe)
16
16 April 2004 – METS Opening Day West www.ccs-gmbh.de16 CCS – Offices docWORKS – Digitization of books and journals (METAe)
17
17 April 2004 – METS Opening Day West www.ccs-gmbh.de17 CCS – Offices docWORKS – Digitization of scientific documents
18
18 April 2004 – METS Opening Day West www.ccs-gmbh.de18 CCS – Offices docWORKS – Basic Workflow Digitization Scanning Digitization Scanning DB OPAC MARC Quality Control Images Quality Control Images Conversion Quality Control Output Quality Control Output Export Presentation XML/METS PDF Presentation XML/METS PDF
19
19 April 2004 – METS Opening Day West www.ccs-gmbh.de19 CCS – Offices docWORKS – Scalable Client / Server architecture Server 1 Server 2 Server n.... Scan Import Scan Import Quality Control Quality Control Server 3 Auto-Import Image Preprocessing Layout Analysis OCR Structural Analysis Export
20
20 April 2004 – METS Opening Day West www.ccs-gmbh.de20 CCS – Offices docWORKS – METS / ALTO METS document TIFFALTO ALTO – Analyzed Layout and Text Object
21
21 April 2004 – METS Opening Day West www.ccs-gmbh.de21 CCS – Offices docWORKS – METS Header DC, descriptive metadata NISO 39.087 (mix), technical metadata Structural Map: Physical Structure Structural Map: Logical Structure
22
22 April 2004 – METS Opening Day West www.ccs-gmbh.de22 CCS – Offices docWORKS – ALTO Styles - Paragraph (alignment, linespacing, etc.) - Font (name, size, bold, italic, etc.) Layout - Printspace - TopMargin - InnerMargin - OuterMargin - BottomMargin Objects in 5 areas above: - Text block - Text lines - Strings [coordinates, string (as printed), substitution (hyphenation)] - Spaces - Composed block - Picture - Table - Formula
23
23 April 2004 – METS Opening Day West www.ccs-gmbh.de23 CCS – Offices docWORKS – METS / physical structure METS DC FILEGRP PHYS LOGICAL DC FILEGRP PHYS LOGICAL ORDER 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 … LABEL II III IV V VI 2 3 4 5 6 … ORDERLABEL I II III IV V VI 1 2 3 4 5 6 …
24
24 April 2004 – METS Opening Day West www.ccs-gmbh.de24 CCS – Offices docWORKS – METS / physical structure par fptr METS DC FILEGRP PHYS LOGICAL DIV (page) FILEID ALTO FILEID IMAGE
25
25 April 2004 – METS Opening Day West www.ccs-gmbh.de25 CCS – Offices docWORKS – METS / logical structure seq fptr METS DC FILEGRP PHYS LOGICAL DIV (paragraph) DIV (volume) DCMD_PHYS DCMD_ELEC DIV (issue) DCMD_ISSUE# DIV (contrib.) DCMD_#CONT# FILEID ALTO Those who have read the History of Columbus will, doubtless, remember the character and exploits... XSLT text block BEGIN FILEID Coordinates DIV (chapter) DCMD_CHAP#
26
26 April 2004 – METS Opening Day West www.ccs-gmbh.de26 CCS – Offices docWORKS – ALTO / page layout and text content
27
27 April 2004 – METS Opening Day West www.ccs-gmbh.de27 CCS – Offices docWORKS – ALTO / hyphenated word
28
28 April 2004 – METS Opening Day West www.ccs-gmbh.de28 CCS – Offices docWORKS – ALTO / hyphenated word
29
29 April 2004 – METS Opening Day West www.ccs-gmbh.de29 CCS – Offices Daniel!
30
30 April 2004 – METS Opening Day West www.ccs-gmbh.de30 CCS – Offices Thank you! Claus Gravenhorst claus.gravenhorst@ccs-gmbh.de Daniel Lanz daniel.lanz@ccs-gmbh.de Content Conversion Specialists www.ccs-gmbh.de http://meta-e.uibk.ac.at/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.