Creating textual and visual resources. Overview of this session Types of manuscripts Types of printed documents Types of visual resources Methods of capture.

Slides:



Advertisements
Similar presentations
1 of 18 Information Dissemination New Digital Opportunities IMARK Investing in Information for Development Information Dissemination New Digital Opportunities.
Advertisements

Electronic Library and Information Resources Introduction and overview.
History Study Center Primary and secondary sources documenting global history 2010.
A question of cost: choices on the road to digitisation Simon Tanner Director KCL Digital Consultancy Services Web:
Strategic issues for digital projects... …or, what are we doing here?
Creating textual resources Printed documents. Content of this session Types of printed documents Methods of capture Some examples.
Delivering textual resources. Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How.
Creating textual resources Manuscripts. Content of this session Types and characteristics of manuscripts and artefacts Methods of capture We are dealing.
Strategic issues for digital projects... …or, what are we doing here?
Creating visual resources Visual artefacts, photographs and large format originals.
Services Digitisation & Content Management. 600 People – India.
A Digital Imaging Primer Nick Dvoracek Instructional Resources Center University of Wisconsin Oshkosh.
Using Digital Photography in Family History Work Using digital cameras to save document images By: Bob Curry.
Commercial Data Processing Lesson 2: The Data Processing Cycle.
These ain’t “Old News”! Creating access to historic newspapers Christine Guenther OCLC Product Manager, Digital Services Preservation Service Centers Bethlehem,
Multimedia for the Web: Creating Digital Excitement Multimedia Element -- Graphics.
Strategic Thinking and Significant Characteristics Hamish James.
1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.
Digitization of Historical Materials Dana Logalbo-Baij LIBR559L June 9, 2011.
Part of the Arts and Humanities Data Service and the UK Data Archive. Funded by the Joint Information Systems Committee and the Arts and Humanities Research.
JSTOR & OCR - A Case Study Kiffany Francis. What is JSTOR? “JSTOR is a not-for- profit organization with a dual mission to create and maintain a trusted.
Software and Multimedia
Elizabeth Newbold and Samantha Tillett GL8 New Orleans, December 2006
Research Methods & Data AD140Brendan Rapple 2 March, 2005.
Discover Your Ancestors 2013/2014 Media Planning Kit.
New Innovative Access to Educational and Cultural Multimedia Contents Yuka Egusa Educational Resources Research Center, National Institute for Educational.
Prepared by George Holt Digital Photography BITMAP GRAPHIC ESSENTIALS.
Copyright © Allyn & Bacon 2008 POWER PRACTICE Chapter 6 Academic Software START This multimedia product and its contents are protected under copyright.
Unit 30 P1 – Hardware & Software Required For Use In Digital Graphics
Managing your References Sue Bird Bodleian Bio- & Environmental Sciences October 2010.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Information Formats And Their Characteristics Questions about this activity? Contact Kimberley Stephenson at
Digitisation of Archival and Manuscript Materials in Libraries Presentation by Martin Bradley.
WORKFLOWS AND OTHER CONSIDERATIONS FOR DIGITIZATION  Steve Bingo  Processing Archivist Washington State University Libraries  Alex Merrill  Assistant.
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
Cataloguing Electronic resources Prepared by the Cataloguing Team at Charles Sturt University.
Lecture Four: Steps 3 and 4 INST 250/4.  Does one look for facts, or opinions, or both when conducting a literature search?  What is the difference.
Meta-Knowledge Computer-age study skill or What kids need to know to be effective students Graham Seibert Copyright 2006.
Mark Sullivan Digital Library of the Caribbean. Imaging  Imaging Theory & Specifications  Recommended Equipment and Software 2 dLOC Training (7/29/2013)
Chapter 14 a Guide to Print, Electronic, and Other Sources.
Fill in the blanks: The _____________ utility in Windows’ Accessories is used for running audio CD. For recording sound, there must be _________ in the.
TECHNOLOGY SUPPORT FOR ESSSS Progress, Issues, and Challenges Marshall Breeding Director for Innovative Technology and Research Vanderbilt University Library.
Clip Art (& collections of thumbnail images). Learning Objectives: By the end of this topic you should be able to: discuss the advantages & disadvantages.
An Overview of Projects and Processes Higher Education Digitisation Service Joanne Lomax Smith
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
National and University Library Zagreb Digitisation Activities.
University of Florida Digital Collections.
1 UNOG Library Digitization and Microform Unit (DMU) – December 2009.
EDT 612 Unit 6 © 2004 James Lockard, Peter D. Abrams.
PAN-European Exploitation of the Results of the Libraries Programme - EXPLOIT German Libraries Institute Berlin EXPLOIT 1 Electronic library materials.
Collecting History: Profiles in Science Alexa T. McCray National Library of Medicine Bethesda, MD Stanford University August 21, 1999.
Teaching and Learning with Technology Master title style  Allyn and Bacon 2002 Teaching and Learning with Technology to edit Master title style  Allyn.
Help with History Dissertations March 2004
1/16/2016I. Revels Digital Imaging Workshop 1 Selection Considerations For Digital Imaging Projects.
Scanners. Using a Scanner Scanners are used to digitize any flat object. Several types of scanners- flatbed, sheet fed, handheld, film. Most common is.
Delivering textual and visual resources. Overview Case studies Methods for providing access Structures for delivery Full text Marked-up Image and text.
Do Now You have 10 minutes to finish your About Me essay. When you are done, print out both your new About Me Ad and your typed essay.
The Big Picture Things to think about What different ways are there to collect information automatically? What are the advantages and disadvantages of.
Digitizing Historical Newspapers South Carolina Digital Newspaper Program's participation with the Library of Congress' Chronicling America: Historic American.
DIGITIZATION IN THEORY AND PRACTICE WEBSITE: Helen Nneka Okpala Presentation done at University of.
ITL conference 2003 Putting Your Content on a Diet Using rich online media without download woes.
CENTRAL/WESTERN MASSACHUSETTS AUTOMATED RESOURCE SHARING Digitization GOALS & THEIR LOGISTICS Michael J. Bennett Digital Initiatives Librarian C/WMARS,
Global Rangelands Data Entry Guidelines March 23, 2015.
Digital Stewardship Curriculum
Application Software Productivity Tools for Educators
Software and Multimedia
Software and Multimedia
Application Software Productivity Tools for Educators
Current Challenges in Digitization
Presentation transcript:

Creating textual and visual resources

Overview of this session Types of manuscripts Types of printed documents Types of visual resources Methods of capture Some examples How to guidance for rekeying and OCR OCR exercise

Types of documents: largely textual Manuscripts Books Periodicals Newspapers Grey literature Documentary surrogates: microfilm etc

Types of manuscripts Huge range Several centuries BC to the present day 2500 years of materials Written on many different materials papyrus, animal skins, lead tablets, stone, paper, etc Many different languages font/script issues Music Images

Characteristics of manuscripts Unique even if there are many copies, they will all be different May be fragile May have bindings Will need special handling Will need specialist equipment

What do you want from mss? Capture once for all time? Complete record? Covers bindings blank pages glosses erasures palimpsests

What do you want from mss? Complete colour fidelity? Enhancements at capture stage? UV Infrared Record of minute details? What is significant? e.g. with parchment, do you want to see the pore marks?

Copyright Corpus Christi College, Cambridge

DIAMM Mss

Handling Conservation practice in human handling Special stands and cradles Light levels Dust-free environment Temperature and humidity to be rigorously controlled Heat is a real danger

A note on metadata Visual images are very difficult to create intellectual metadata for Describing or categorizing images in words is difficult Can use full text descriptions Or thesauri or classification systems Art and Architecture Thesaurus (AAT), Visual Resources Association (VRA), ICONCLASS

Printed items: largely textual Books Periodicals Newspapers Grey literature Documentary surrogates: microfilm etc Miscellaneous materials including musical scores ephemera advertisements cartoons posters, etc

Diamond sutra, worlds earliest printed book, AD 868

Goettingen British Library TexasKeio, Japan

News of the World, June 1851News of the World, June 1918

Penny Illustrated, October 1861 Weekly Dispatch, June 1856

Chopin First Edition

Trade card, 18th C.

Advertisement for booksellers`

Imperial War Museum Spanish Civil War Collection: Poster

Reel of microfilm

Microfiche

Characteristics of documents: books Printed books can date back to the 1470s Gutenberg Bible Early English Books Online may need to be treated more like manuscript materials

Characteristics of documents: books Almost certain to be bound Is it possible to disbind? Will they be discarded after scanning? May be printed on unstable media Different sizes May have image-rich content Likely to have language/font/character set issues

Characteristics of documents: books Varied internal structures depending on topic and type recipe books art history books childrens books Some common structural features Table of Contents, index, bibliography, chapters, footnotes, pages

Characteristics of documents: periodicals Will have different structures according to type, but structure likely to be regular within a title comics popular magazines trade magazines academic journals Some common features … articles, images, advertisements, columns, diagrams, footnotes, bibliography, Table of Contents, etc

Characteristics of newspapers Large in format Prolific in output Designed as essentially ephemeral Fragile Complex and multipart Change over time Many different types of content: text, images, advertisements

Characteristics of newspapers Difficult to index Difficult to store because of bulk and volume Inherently unstable paper weak and brittle, deteriorates rapidly Great interest to researchers Difficult to extract information from

Characteristics of documents: grey literature Catch-all category Includes many different kinds of un-published or semi-published materials reports personal papers conference papers newsletters Difficult to characterize A collection may have many different formats, periods, conditions Difficult to catalogue

Characteristics of documents: microform A good long-term storage alternative but a poor substitute for reading lose sense of the physicality of the original linear small format tiring to read impossible to search harder to scan (by eye) than the originals

Visual artefacts Huge category of visual materials paintings and drawings fabrics art objects technical drawings maps 3-D objects

Croyland, Lincolnshire, John Sell Cotman

Bacchanal, Cecily Brown

Suffragette Banner, Womens Library

A Cosy Couple, Amanda Francis

Technical drawing design

1930 map locating Painswick village inside folded printed change of address flier for Pyllis Barron and Dorothy Larcher

Spellman Music Covers Collection, Reading University

Types of photographs Wide range Prints Negatives (acetate, nitrate, glass plates, paper) Transparencies Slides Daguerreotypes and other special formats Digital originals

Dressmaking class 1936: preparation for dress parade

John Ruskin's Daguerreotype of a group of windows in the façade of the Casa degli Zane, Venice

Glass plate negative

35mm B&W negative

Digital original

Characteristics of photographs Multiple versions possible Negative and the print and copy photography Colour and monotones – fidelity is vital May be fragile, dirty and even combustible May be flexible or rigid, mounted or in strips (e.g. albums, slides, negative strips) Will probably need special handling Will benefit from specialist equipment

Handling Every single interaction with a fragile original can compromise it Many of these may be hundreds of years old … … we want them to last for hundreds more years So special handling is crucial

Handling Conservation practice in human handling Heat levels – most critical due to build up Light levels Dust-free environment

Image Quality How do we know if it is good enough? Visual sharpness Laterally reversed images Dirt Skew Image completeness Guidance available from the RLG Publications by Franziska Frey (Rochester Institute of Technology)

Capture methods Depends on the nature of the original material Depends upon available resources Depends on the purpose of the digitization? A forensic record of the original? to externalise the textual content? audience delivery options information goals

Capture methods Scanning book scanner flat bed scanner drum feed scanner microfilm scanner Sunrise microfilm scanner Zeutschel OS10000 A1 Bookscanner

Image Quality How do we know if it is good enough? Visual sharpness Laterally reversed images Dirt Skew Image completeness Guidance available from the RLG Publications by Franziska Frey (Rochester Institute of Technology)

Case Study: CVMA Corpus Vitrearum Medii Aevi – medieval stained glass The content is only renderable from photographs of the subject. Comprehensive database with high levels of descriptive metadata. Further additions will include maps and church plans linked to window images.

Specifications of CVMA Digitisation Source: 35mm slides, medium & large format transparencies, photographic prints Scanning dpi: 35mm – 2,700 dpi Medium format – 1,200 dpi Large format – 1,000 dpi Print – 600 dpi All 24-bit RGB colour File formats: TIFF master (uncompressed) JPEG for web Courtesy of CVMA Project, Courtauld Institute of Art

Case Study: Shetland Isles Museum Glass plate collection - >80,000 items In-house scanning using flatbed scanners 600 dpi, 8-bit greyscale specification Delivered on the web with the option to buy content. Online images are thus relatively small.

Digitization issues Preparation of materials Assessing the collection Organization of data resources

Scanning into electronic formats Preparation of materials Assess the collection STOP POINT 1

Scanning into electronic formats STOP: 2 OCR for indexing STOP: 3 OCR/Rekeying for end user presentation STOP: 5 SGML/XML STOP: 4 Metadata

Digitization issues In every case you have to: assess the nature of the collection prepare the collection for digitization Decide how to organize the end information resource

Creating full text If digital images are scanned with no added value digital microfilm is the result This has many advantages for access But much more is possible...

Creating full text There are a number of ways to create manipulable text rekeying OCR (Optical Character Recognition) with correction uncorrected OCR These will be discussed in detail later

Rekeying Most costly option But less expensive than it was! Very accurate if done well Can be used instead of providing a digital image Or attached to a digital image as a means of searching

Case study: Old Bailey Court Session Papers Largest single digital resource on non-elite peoples. 58,000 pages = >250 million characters rekeyed Rekeying is the most effective way to address the content of the originals XML markup the only way to deliver the content in a structured way

OCR Pattern recognition algorithms which can convert images of alphanumeric characters into ASCII code Been around since the 1970s KDEM (Kurtzweil Data Entry Machine), hardware and software very expensive so specialist bureaux offered it as a service move to desktop OCR in the mid-late 1990s See handout for OCR guidance

OCR accuracy This depends on the quality of the image being processed 99%+ is possible To what degree is accuracy important? this can depend on the intended use of the captured text

Case study: Refugee Studies Centre Library Grey literature collection Earliest documents from the 1960s so copyright a critical issue Making content widely available the key aim Forensic fidelity unimportant Need to capture a large volume

Methods: Can do destructive scanning Digitization outsourced Initially uncorrected OCR also outsourced Later, use Olive Software Active Paper Archive OCR for searching, page image for viewing Case study: Refugee Studies Centre Library

How to guidance: Rekeying Single rekeying one pass with checks. Generally 99.5% accurate Double rekeying keyed twice, differences checked. Generally 99.99% accurate Rekeyers should key what they see not what they think! Assume they know nothing Textual layout and structure provide clues for rekeyers Detail all variations, special characters, spellings that you can

How to guidance: Rekeying Example From the hand out Note the detail the variations quality assurance

How to guidance: OCR Handout Note the need to understand the nature of the document nature of original nature of printing language uniformity text alignment complexity of alignment lines, graphics and pictures handwriting

OCR Quiz Look at the examples on screen Make a note of any features you think might affect OCR accuracy Have a guess of what you think the accuracy in % terms might be