1 April 2004 – METS Opening Day West www.ccs-gmbh.de1 docWORKS/METAe Automated Conversion Of Printed Documents Into Fully Tagged METS Objects Claus Gravenhorst.

Slides:



Advertisements
Similar presentations
Home-Grown Digital Library System Built Upon Open Source XML Technologies and Metadata Standards David Lacy Villanova University
Advertisements

The European(a) Newspapers Project A Gateway to European Newspapers Online Paris, Thorsten Siegmann, Staatsbibliothek zu Berlin, Germany.
METS Awareness Training An Introduction to METS Digital libraries – where are we now? Digitisation technology now well established and well-understood.
Putting together a METS profile. Questions to ask when setting down the METS path Should you design your own profile? Should you use someone elses off.
Delivering textual resources. Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How.
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
MacKenzie Smith Associate Director for Technology MIT Libraries.
Microsoft Word 2013 An Overview. Your Environment Quick Access Toolbar Customizable toolbar for one-click shortcuts Tabs Backstage View Tools located.
Services Digitisation & Content Management. 600 People – India.
PDF (Portable Document Format) for Digital Preservation and Delivery John Laurie Digital Initiatives Librarian The University of Auckland Library National.
These ain’t “Old News”! Creating access to historic newspapers Christine Guenther OCLC Product Manager, Digital Services Preservation Service Centers Bethlehem,
Joachim Bauer Senior System Engineer, CCS
Object Re-Use and Exchange Mellon Retreat, Nassau Inn, Princeton, NJ, March Herbert Van de Sompel, Carl Lagoze The OAI Object Re-Use & Exchange.
Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
1 Uppsala University Library Eva Müller Peter Hansson Stefan Andersson Uwe Klosa Electronic Publishing Centre Krister Östlund Waller project.
EAD in A2A Bill Stockting, Senior Editor A2A and EAD Working Group: Central Archives of Historical Records, Warsaw, 26 April 2003.
6/15/20151 Opportunities for Collaboration: The HEARTH Project Joy Paulson and Nathan Rupp Cornell University Digital Library Federation Spring Forum New.
Ingest and Loading DigiTool Version 3.0. Ingest and Loading 2 Ingest Agenda Ingest Overview and Introduction Ingest activity steps Transformers Task Chains.
William Y. Arms Corporation for National Research Initiatives March 22, 1999 Object models, overlay journals, and virtual collections.
JSTOR & OCR - A Case Study Kiffany Francis. What is JSTOR? “JSTOR is a not-for- profit organization with a dual mission to create and maintain a trusted.
The Cornell Veterinarian A Metadata Perspective.
Library Electronic Resources in the EUI Library Veerle Deckmyn, Library Director Aimee Glassel, Electronic Resources Librarian 5 September
A METS Application Profile for Historical Newspapers
Create and Manage METS in retrodigitization Markus Enders Goettingen State and University Library
Leonardo da Vinci Programme Project ACCELERATE Nicosia, May 2001 Services offered toVIP by the University of Graz, Austria Services to individuals Services.
European Metadata Initiatives: The METAe Metadata Engine Simon Tanner Higher Education Digitisation Service
1 © Netskills Quality Internet Training, University of Newcastle Metadata Explained © Netskills, Quality Internet Training.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Open Textbooks and Electronic Publishing Formats/Standards Arctic Virtual Learnng Tools
The TARO Project Texas Archival Resources Online Fred Gilmore Sr Operating Systems Specialist UT Austin General Libraries April.
© January/2008 CCS Content Conversion Specialists GmbH Weidestr. 134, Hamburg, Germany consulting technology digitization services.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
The DigiTool to FDA Program Lydia Motyka Florida Center for Library Automation.
Metadata Considerations Implementing Administrative and Descriptive Metadata for your digital images 1.
Ms. Irene Onyancha ISTD/Library & Information Management Services United Nations Economic Commission for Africa The Second Session of the Committee on.
Standard Grade Computing General Purpose Packages WORD-PROCESSING WORD-PROCESSING Chapter 2.
Advanced Word Microsoft Word Overview Teaching Advanced Topics Chart, Word Art, Pictures Advanced Mail merge Advanced Find and Replace Outline Views.
An Introduction to METS Morgan Cundiff Network Development and MARC Standards Office Library of Congress Metadata Encoding and Transmission Standard.
1 Bridging the gap between the paper past and digital future.
International Seminary on Digitisation: Experience and Technology 11 th May 2004 | National Library | Lisbon – Portugal DIGITAL ARCHIVE OF PORTUGUESE ART.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
METS: Implementing a metadata standard in the digital library Richard Gartner Oxford University Library Services
Best Practices for Digital Imaging and Metadata Roy Tennant The Library, University of California, Berkeley
PAN-European Exploitation of the Results of the Libraries Programme - EXPLOIT German Libraries Institute Berlin EXPLOIT 1 Electronic library materials.
Collecting History: Profiles in Science Alexa T. McCray National Library of Medicine Bethesda, MD Stanford University August 21, 1999.
Digital library of Spanish old newspapers and magazines National Library of Spain.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
National Library of Finland Metadata in the Digitisation Process Cultural unity and diversity of the Baltic Sea Region – common history, different languages,
Tiziana // Alessandra Lenzi - MG Breaking down the walls Project Museo Galileo and the Linked Open Data A joint project between.
Topic Maps for Cultural Heritage Collections Conal Tuohy Senior Developer New Zealand Electronic Text Centre
WORLD CONSORTIUM Welcome to. An overview by Phil Elliott Satzconcept Skandinavia a.s.
International Forum on “Local Wisdom as Power to Social and Economic Development” ELECTRONIC RESOURCES OF LOCAL INFORMATION IN NATIONAL LIBRARY OF VIETNAM.
FACES General Overview ViRR (Virtueller Raum Reichsrecht) Software Solutions Kristina Büchner and Bastien Saquet Contact:Kristina Buechner:
CERN Document Server 19 tth January 2006 CERN Document Server Jean-Yves Le Meur 19 th January 2006.
Digitizing Historical Newspapers South Carolina Digital Newspaper Program's participation with the Library of Congress' Chronicling America: Historic American.
Developing a Dark Archive for OJS Journals Yu-Hung Lin, Metadata Librarian for Continuing Resources, Scholarship and Data Rutgers University 1 10/7/2015.
1 July 2004 – METS Opening Day UK docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst Content.
THESIS & DISSERTATION FORMATTING
S.Rajeswari Head , Scientific Information Resource Division
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
Introduction to Metadata
DIGITAL LIBRARY.
Metadata - Catalogues and Digitised works
Metadata to fit your needs... How much is too much?
My Program Session Title
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
5.00 Apply procedures to organize content by using Dreamweaver. (22%)
Current Challenges in Digitization
Quick and Dirty: the art of OCR
Presentation transcript:

1 April 2004 – METS Opening Day West docWORKS/METAe Automated Conversion Of Printed Documents Into Fully Tagged METS Objects Claus Gravenhorst Content Conversion Specialists

2 April 2004 – METS Opening Day West CCS – Offices What is docWORKS/METAe? Production tool for conversion of printed documents into fully tagged digital objects The METAe edition of docWORKS is the result of the EU-funded project METAe Start of project: September 2000 End of project: August 2003 Product launch: March 2003, CeBIT exhibition

3 April 2004 – METS Opening Day West CCS – Offices The project group 1.Leopold-Franzens-Universität Innsbruck (Co-ordinator), Austria 2.Universität Linz, Institut für Angewandte Informatik, University of Linz, Austria 3.Mitcom Neue Medien GmbH (ABBYY Europe), Germany 4.CCS Compact Computer Systeme, Germany 5.Universidad de Alicante, Spain 6.Friedrich-Ebert-Stiftung, Germany 7.Cornell University Library. Department of Preservation and Conservation, USA 8.Bibliothèque nationale de France 9.The National Library of Norway, Rana division, Norway 10.Biblioteca Statale A. Baldini, Italy 11.Dipartimento di Sistemi e Informatica, University of Florence, Italy 12.Karl-Franzens-Universität Graz, Universitätsbibliothek, Austria 13.Scuola Normale Superiore, Centro di Ricerche Informatiche per i Beni Culturali, Italy 14.Higher Education Digitisation Service HEDS, UK

4 April 2004 – METS Opening Day West CCS – Offices Challenges  Digitization and retro-conversion of printed or textual material is getting more and more important: Keep knowledge and cultural heritage alive Preserve the origin Enable quick and enhanced access by high structured documents Open up new dimensions of research Provide standardized output formats

5 April 2004 – METS Opening Day West CCS – Offices Goals  Automate the conversion process  Make digitization more effective and safer  Increase the added value of digitized collections  Provide a standardized output format in order to allow transformation of metadata into various applications and systems

6 April 2004 – METS Opening Day West CCS – Offices docWORKS – System Overview document METS ALTO TIFF JPEG Image Pre-Processing Layout Analysis Character Recognition Structural Analysis Scanning Import Correction Export Rules DB docWORKS engineInputOutput

7 April 2004 – METS Opening Day West CCS – Offices docWORKS – as much metadata as possible! Available data Descriptive metadata Administra- tive metadata Structural metadata - logical Structural metadata - physical Formats Library records, e.g. MARC TIFF Images METS Dublin Core linking to catalogue record METS incl. NISO (mix) METS Structural map ALTO (Analyzed Layout and Text Object) docWORKS engine Import of subsets, linking to record Creates descriptive records for articles, pictures,… Records metadata Suggests labels of logical elements and structures Provides suggestion for physical structure User mode AutomatedSemi- automated Correction recommended Fully- automated after defining a profile Automated Correction recommended Automated Correction in special cases

8 April 2004 – METS Opening Day West CCS – Offices docWORKS – Matching of Image Files and Page Numbers Image- file PaginationPage- Number tifNot countedNp tifNot countedNp tifCountedI tifCountedII tifCountedIII tifCountedIV tifCountedV tifCountedVI tifCounted tifCounted, not paginated(2) tifCounted tifCounted4 placeholderMissing page5 placeholderMissing page tifCounted tifCounted8

9 April 2004 – METS Opening Day West CCS – Offices Traditional OCR - Output THE AMERICAN MISSIONARY. Vo.. XXXII JANUARY, 1878 No. 1 American Missionary Association xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

10 April 2004 – METS Opening Day West CCS – Offices More information available Title page Title of series Volume number Issue number Motto Date

11 April 2004 – METS Opening Day West CCS – Offices docWORKS – Structural Analysis FRONT MAIN BACK

12 April 2004 – METS Opening Day West CCS – Offices docWORKS – Structural Analysis Chapter 1 Chapter 2 Subchapter 1 Subchapter 2

13 April 2004 – METS Opening Day West CCS – Offices docWORKS – Structural Analysis Preface Table of contents Title page Statement page

14 April 2004 – METS Opening Day West CCS – Offices docWORKS – Document layers  Various document layers are differentiated automatically and while using certain levels enable well directed searches as well as the presentation of electronic text without unnecessary items Body text independently from its presentation Margin notes, footnotes Pictures and captions Advertisement Annex and supplements Navigation layer: Table of contents, running title, document index, page number, volume index Book: Separation of „intellectual“ and „artifical“ content

15 April 2004 – METS Opening Day West CCS – Offices docWORKS – Digitization of books and journals (METAe)

16 April 2004 – METS Opening Day West CCS – Offices docWORKS – Digitization of books and journals (METAe)

17 April 2004 – METS Opening Day West CCS – Offices docWORKS – Digitization of scientific documents

18 April 2004 – METS Opening Day West CCS – Offices docWORKS – Basic Workflow Digitization Scanning Digitization Scanning DB OPAC MARC Quality Control Images Quality Control Images Conversion Quality Control Output Quality Control Output Export Presentation XML/METS PDF Presentation XML/METS PDF

19 April 2004 – METS Opening Day West CCS – Offices docWORKS – Scalable Client / Server architecture Server 1 Server 2 Server n.... Scan Import Scan Import Quality Control Quality Control Server 3  Auto-Import  Image Preprocessing  Layout Analysis  OCR  Structural Analysis  Export

20 April 2004 – METS Opening Day West CCS – Offices docWORKS – METS / ALTO METS document TIFFALTO ALTO – Analyzed Layout and Text Object

21 April 2004 – METS Opening Day West CCS – Offices docWORKS – METS Header DC, descriptive metadata NISO (mix), technical metadata Structural Map: Physical Structure Structural Map: Logical Structure

22 April 2004 – METS Opening Day West CCS – Offices docWORKS – ALTO Styles - Paragraph (alignment, linespacing, etc.) - Font (name, size, bold, italic, etc.) Layout - Printspace - TopMargin - InnerMargin - OuterMargin - BottomMargin Objects in 5 areas above: - Text block - Text lines - Strings [coordinates, string (as printed), substitution (hyphenation)] - Spaces - Composed block - Picture - Table - Formula

23 April 2004 – METS Opening Day West CCS – Offices docWORKS – METS / physical structure METS DC FILEGRP PHYS LOGICAL DC FILEGRP PHYS LOGICAL ORDER … LABEL II III IV V VI … ORDERLABEL I II III IV V VI …

24 April 2004 – METS Opening Day West CCS – Offices docWORKS – METS / physical structure par fptr METS DC FILEGRP PHYS LOGICAL DIV (page) FILEID ALTO FILEID IMAGE

25 April 2004 – METS Opening Day West CCS – Offices docWORKS – METS / logical structure seq fptr METS DC FILEGRP PHYS LOGICAL DIV (paragraph) DIV (volume) DCMD_PHYS DCMD_ELEC DIV (issue) DCMD_ISSUE# DIV (contrib.) DCMD_#CONT# FILEID ALTO Those who have read the History of Columbus will, doubtless, remember the character and exploits... XSLT text block BEGIN FILEID Coordinates DIV (chapter) DCMD_CHAP#

26 April 2004 – METS Opening Day West CCS – Offices docWORKS – ALTO / page layout and text content

27 April 2004 – METS Opening Day West CCS – Offices docWORKS – ALTO / hyphenated word

28 April 2004 – METS Opening Day West CCS – Offices docWORKS – ALTO / hyphenated word

29 April 2004 – METS Opening Day West CCS – Offices Daniel!

30 April 2004 – METS Opening Day West CCS – Offices Thank you! Claus Gravenhorst Daniel Lanz Content Conversion Specialists