Creating textual resources Printed documents. Content of this session Types of printed documents Methods of capture Some examples.

Slides:



Advertisements
Similar presentations
Strategic issues for digital projects... …or, what are we doing here?
Advertisements

Creating textual and visual resources. Overview of this session Types of manuscripts Types of printed documents Types of visual resources Methods of capture.
Delivering textual resources. Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How.
Strategic issues for digital projects... …or, what are we doing here?
Current Awareness Services. Definition n A service which provides the recipient with information on the latest developments within the subject areas in.
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Digitization from Newspaper Microfilm: The Colorado Experience Brenda Bailey-Hainer CHNC Program Director Colorado State Library June
Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall
File Management Chapter 3
OUP in support of digital libraries Main objectives Historical Context Why Xml ? Librarian Resource Centre Oxford Index Marzena Giers Fidler 5 th June.
Services Digitisation & Content Management. 600 People – India.
Connected Histories Sources for Building British History, Funded under the JISC eContent Capital Programme for 18 months Partners:  Prof. Tim.
Commercial Data Processing Lesson 2: The Data Processing Cycle.
Purdue library’s online resources for research: modern period patricia sullivan fall 2008.
These ain’t “Old News”! Creating access to historic newspapers Christine Guenther OCLC Product Manager, Digital Services Preservation Service Centers Bethlehem,
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
1 THE AUSTRALIAN NEWSPAPERS DIGITISATION PROGRAM (NDP) Rose Holley – Manager Newspaper Digitisation Program Presentation at the Association of Parliamentary.
DART 261 Library Research Melinda Reinhart Visual Arts Librarian October 2010.
Strategic Thinking and Significant Characteristics Hamish James.
1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.
AUTOMATIC DATA CAPTURE  a term to describe technologies which aim to immediately identify data with 100 percent accuracy.
Digitization of Historical Materials Dana Logalbo-Baij LIBR559L June 9, 2011.
JSTOR & OCR - A Case Study Kiffany Francis. What is JSTOR? “JSTOR is a not-for- profit organization with a dual mission to create and maintain a trusted.
Beyond the Shelf: Providing Access to Historic Microfilmed Materials A presentation for EDUCAUSE Mary Molinaro University of Kentucky Libraries.
1 Australian Newspapers Digitisation Program Development of the Newspapers Content Management System Rose Holley – ANDP Manager ANPlan/ANDP Workshop, 28.
The Voice of A Community Chinese Times Digitization Project Ian Song Prepared for the Multicultural Canada Conference
Research Methods & Data AD140Brendan Rapple 2 March, 2005.
New Innovative Access to Educational and Cultural Multimedia Contents Yuka Egusa Educational Resources Research Center, National Institute for Educational.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Lecture Five: Searching for Articles INST 250/4.  What are LCSH? ◦ Why should one hyperlink on the LCSH in the Library catalogue search?  Subject vs.
June Overview of Operations & the INIS Record INIS Training Seminar 2-6 June 2003 Vienna, Austria Seyda RIEDER INIS Section Supervisor, Bibliographic.
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and.
© January/2008 CCS Content Conversion Specialists GmbH Weidestr. 134, Hamburg, Germany consulting technology digitization services.
25-27 June 2003Clearing House Workshop, Paris1 Direct access to UNESCO Documents UNESDOC.
Meta-Knowledge Computer-age study skill or What kids need to know to be effective students Graham Seibert Copyright 2006.
2002 September -- ejk/UF RESEARCH TOPICS Web-Interface Performance DTD Extensibility Imaging Distillation Other topics?
Options for digital delivery Record Society Conference, April 19 th 2007 Bruce Tate Project Manager British History Online.
Preservation, New Media, Oral Cultures How to Build a Digital Library Ian H. Witten and David Bainbridge.
1 Helping communities access and explore their newspaper heritage. Rose Holley – Manager Newspaper Digitisation Program
Chapter 14 a Guide to Print, Electronic, and Other Sources.
FMST 211 workshop - What we will do How to prepare an annotated bibliography How to find documents (using a citation) How to check if a magazine is available.
1. 2 Content The Romanische Bibliographie Online is the only comprehensive specialist bibliography for Romance language and literature studies –available.
System Analysis and Design
An Overview of Projects and Processes Higher Education Digitisation Service Joanne Lomax Smith
New information solutions from Gale Charles-Louis Moreau Regional Sales Manager.
1 Using Digital Technologies to unlock history for researchers. Rose Holley – Manager Newspaper Digitisation Program Australian Academy of the Humanities.
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
1 Bridging the gap between the paper past and digital future.
1 UNOG Library Digitization and Microform Unit (DMU) – December 2009.
EDT 612 Unit 6 © 2004 James Lockard, Peter D. Abrams.
PAN-European Exploitation of the Results of the Libraries Programme - EXPLOIT German Libraries Institute Berlin EXPLOIT 1 Electronic library materials.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Nikola Tesla Museum Clipping Library Saša Malkov Nenad Mitić Žarko Mijajlović 3 rd SEEDI Int.Conf. Cetinje, Montenegro 14. September 2007.
Teaching and Learning with Technology Master title style  Allyn and Bacon 2002 Teaching and Learning with Technology to edit Master title style  Allyn.
Help with History Dissertations March 2004
1/16/2016I. Revels Digital Imaging Workshop 1 Selection Considerations For Digital Imaging Projects.
Delivering textual and visual resources. Overview Case studies Methods for providing access Structures for delivery Full text Marked-up Image and text.
1 Overview of Progress Cathy Pilgrim – Director ANDP Presentation to NSLA 19 February 2009, National Library of Australia Australian Newspapers Digitisation.
1 THE AUSTRALIAN NEWSPAPERS DIGITISATION PROGRAM (NDP) Rose Holley – Manager Newspaper Digitisation Program Presentation for Spydus 31 October 2007, NLA,
Digitizing Historical Newspapers South Carolina Digital Newspaper Program's participation with the Library of Congress' Chronicling America: Historic American.
CENTRAL/WESTERN MASSACHUSETTS AUTOMATED RESOURCE SHARING Digitization GOALS & THEIR LOGISTICS Michael J. Bennett Digital Initiatives Librarian C/WMARS,
1 Australian newspaper digitisation program Bronwyn Lee National Library of Australia Presentation to 13 th IASI World Congress – 13 March 2009 Sports.
February 22, 2012 Jim Duran and Julia Stringfellow
Drill Workflow- Make a workflow using the task and decision boxes on the board to simulate a student getting up and going to school in the morning. Use.
Locating News Resources
Film Studies 600 Navigating Concordia Library and other libraries
Optical Data Capture: Optical Character Recognition (OCR)
RESEARCH TOPICS Web-Interface Performance DTD Extensibility Imaging
Current Challenges in Digitization
Presentation transcript:

Creating textual resources Printed documents

Content of this session Types of printed documents Methods of capture Some examples

Types of documents: largely textual Books Periodicals Newspapers Grey literature Documentary surrogates: microfilm etc

Other types of documents Miscellaneous materials including musical scores ephemera advertisements cartoons posters, etc These fall more closely into the visual images category to be discussed later

Diamond sutra, worlds earliest printed book, AD 868

Gutenberg Bible, 1450s

Goettingen British Library TexasKeio, Japan

News of the World, June 1851News of the World, June 1918

Penny Illustrated, October 1861 Weekly Dispatch, June 1856

Chopin First Edition

Trade card, 18th C.

Advertisement for booksellers`

Imperial War Museum Spanish Civil War Collection: Poster

Reel of microfilm

Microfiche

Characteristics of documents: books Printed books can date back to the 1470s Gutenberg Bible Early English Books Online may need to be treated more like manuscript materials

Characteristics of documents: books Almost certain to be bound Is it possible to disbind? Will they be discarded after scanning? May be printed on unstable media Different sizes May have image-rich content Likely to have language/font/character set issues

Characteristics of documents: books Varied internal structures depending on topic and type recipe books art history books childrens books Some common structural features TofC, index, bibliography, chapters, footnotes, pages

Characteristics of documents: periodicals May be bound Is it possible to disbind? Will they be discarded after scanning? May be printed on unstable media Different sizes, supplements etc May have image-rich content Likely to have language/font/character set issues

Characteristics of documents: periodicals Will have different structures according to type, but structure likely to be regular within a title comics popular magazines trade magazines academic journals Some common features … articles, images, advertisements, columns, diagrams, footnotes, bibliography, TofC, etc

Characteristics of newspapers Large in format Prolific in output Designed as essentially ephemeral Fragile Complex and multipart Change over time Many different types of content: text, images, advertisements

Characteristics of newspapers Difficult to index Difficult to store because of bulk and volume Inherently unstable paper weak and brittle, deteriorates rapidly Great interest to researchers Difficult to extract information from

Characteristics of documents: grey literature Catch-all category Includes many different kinds of un-published or semi-published materials reports personal papers conference papers newsletters

Characteristics of documents: grey literature Difficult to characterize A collection may have many different formats, periods, conditions Difficult to catalogue

Characteristics of documents: microform A good long-term storage alternative but a poor substitute for reading loss of the sense of the physicality of the original linear small format tiring to read impossible to search harder to scan (by eye) than the originals

Capture methods Depends on the type of material There may be more than one option What is the purpose of the digitization? A forensic record of the original? The textual content? Both?

Capture methods The more human input to the materials the higher the cost is likely to be It is possible to create good, searchable digital surrogates from certain kinds of documents by largely automated means Other materials may need more handling and human intervention

Capture methods Scanning book scanner flat bed scanner drum feed scanner microfilm scanner

Digitization issues Preparation of materials Assessing the collection Organization of data resources

Scanning into electronic formats Preparation of materials Assess the collection STOP POINT 1

Scanning into electronic formats STOP: 2 OCR for indexing STOP: 3 OCR/Rekeying for end user presentation STOP: 5 SGML/XML STOP: 4 Metadata

Digitization issues In every case you have to: assess the nature of the collection prepare the collection for digitization Decide how to organize the end information resource

Creating full text If digital images are scanned with no added value digital microfilm is the result This has many advantages for access But much more is possible...

Creating full text There are a number of ways to create manipulable text rekeying (relatively expensive) OCR (Optical Character Recognition) with correction (expensive) uncorrected OCR (relatively low cost)

Creating full text There are a number of ways to create manipulable text rekeying (relatively expensive) OCR (Optical Character Recognition) with correction (expensive) uncorrected OCR (relatively low cost) These will be discussed further later

Rekeying Most costly option But less expensive than it was! Very accurate if done well Can be used instead of providing a digital image Or attached to a digital image as a means of searching

Case study: Old Bailey Papers Largest single digital resource on non-elite peoples. 58,000 pages = >250 million characters rekeyed Rekeying is the most effective way to address the content of the originals XML markup the only way to deliver the content in a structured way

OCR Pattern recognition algorithms which can convert images of alphanumeric characters into ASCII code Been around since the 1970s KDEM (Kurtzweil Data Entry Machine), hardware and software very expensive so specialist bureaux offered it as a service move to desktop OCR in the mid-late 1990s See handout for OCR guidance

OCR accuracy This depends on the quality of the image being processed 99% is possible To what degree is accuracy important? this can depend on the intended use of the captured text

Case study: Refugee Studies Centre Digital Library Grey literature collection Earliest documents from the 1960s so copyright a critical issue Making content widely available the key aim Forensic fidelity unimportant Need to capture a large volume

Case study: Refugee Studies Centre Digital Library Methods: Can do destructive scanning Digitization outsourced to HEDS Initially uncorrected OCR also done by HEDS Later, use Olive Software Active Paper Archive OCR for searching, page image for viewing

Case study: British Library Newspaper Pilot Methods scanned from microfilm by OCLC Olive Softwares Active Paper Archive used for processing and delivery all processing and metadata extraction is automatic papers divided into components using profiles articles (title/body), images (picture/caption), ads etc