Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge
Planning a Digital Library Responsibilities Responsibilities Technology to be used Technology to be used Greenstone, DSpace, Fedora, Eprints Greenstone, DSpace, Fedora, Eprints Metadata standard to be used Metadata standard to be used Dublin Core, METS, etc. Dublin Core, METS, etc. Types of access Types of access Retrospective or Born Digital? Retrospective or Born Digital?
Responsibilities Legal Issues Legal Issues Distributing information carries responsibilities Distributing information carries responsibilities Copyright Copyright Social Issues Social Issues Respect customs of the community Respect customs of the community Both source and use communities Both source and use communities Ethical issues Ethical issues
Ideology Ideology – a clear conception of what you plan to achieve with the collection of information Ideology – a clear conception of what you plan to achieve with the collection of information Ideology of a Collection: Ideology of a Collection: Purpose Purpose Objectives Objectives Principles Principles guide what is to be included in the collection guide what is to be included in the collection Placed in Introduction to Digital Library Placed in Introduction to Digital Library
Document versus Work Work Work The disembodied content of a message The disembodied content of a message Pure information Pure information Document Document Traditional library: a physical object that embodies the work Traditional library: a physical object that embodies the work Digital library: a particular electronic encoding of a work Digital library: a particular electronic encoding of a work How are distinctions made between different manifestations of a single work? How are distinctions made between different manifestations of a single work?
Converting an Existing Library Digitizing an existing paper-based collection is the most expensive kind of project Digitizing an existing paper-based collection is the most expensive kind of project Consider whether it is worth the effort and expense Consider whether it is worth the effort and expense 16 th Century Mexican Library 16 th Century Mexican Library Incunabula Incunabula Broadsides Broadsides
Advantages of Digital Libraries Easier to access remotely than conventional libraries Easier to access remotely than conventional libraries Powerful search and browsing Powerful search and browsing Easier to add additional services Easier to add additional services Easier to organize and reorganize Easier to organize and reorganize Easier to maintain? Easier to maintain? Easier to preserve? Easier to preserve? Does your collection have these advantages? Does your collection have these advantages?
Questions to Address Will the digital library coexist with an existing physical one? Will the digital library coexist with an existing physical one? What is the collection’s growth rate? What is the collection’s growth rate? How dynamic is the collection? How dynamic is the collection? Should you consider outsourcing the whole digital library operation? Should you consider outsourcing the whole digital library operation? Could user needs be satisfied in alternative ways? Could user needs be satisfied in alternative ways?
Prioritizing Materials Special collections and unique materials Special collections and unique materials Rare books and manuscripts Rare books and manuscripts High use items High use items Research and teaching materials Research and teaching materials Low-use items Low-use items
Criteria for Digital Conversion Intellectual content Intellectual content Scholarly value Scholarly value Desire to enhance access to information Desire to enhance access to information Funding available Funding available Educational value Educational value Classroom support Classroom support Background reading Background reading Distance education Distance education Institutional Institutional Resource sharing Resource sharing Promote strengths of an institution Promote strengths of an institution Reduce handling of fragile originals Reduce handling of fragile originals Cost and space savings Cost and space savings
Building a New Collection New material New material The copyright holder may be the best one to create a digital collection The copyright holder may be the best one to create a digital collection Metadata Metadata Where will it come from? Where will it come from?
Bibliographic Entities Documents Documents Works Works Distinction between document and work Distinction between document and work Editions Editions Electronic documents use terms such as version, release and revision Electronic documents use terms such as version, release and revision Authors Authors Authority control – standardized names for authors Authority control – standardized names for authors Titles Titles Attributes of works Attributes of works
Bibliographic Entities Subjects Subjects Two approaches to automatically assign subject: Two approaches to automatically assign subject: Key-phrase extraction Key-phrase extraction Key-phrase assignment Key-phrase assignment Literary and artistic works Literary and artistic works Style, form, content, genre Style, form, content, genre Library of Congress Subject Headings (LCSH) Library of Congress Subject Headings (LCSH) Controlled vocabularies: 30,000 pages, 2,000,000 entries Controlled vocabularies: 30,000 pages, 2,000,000 entries Hierarchical relationship of broader and narrower topics Hierarchical relationship of broader and narrower topics Subject classifications Subject classifications Traditional libraries have a linear arrangement Traditional libraries have a linear arrangement Digital collection can be rearranged at the click of a mouse Digital collection can be rearranged at the click of a mouse
Digitizing Documents Digitization Digitization The process of taking traditional library materials and converting them to electronic form The process of taking traditional library materials and converting them to electronic form Allows storage and manipulation by a computer Allows storage and manipulation by a computer The process is time-consuming and expensive The process is time-consuming and expensive
Stages of Digitization Scanning Scanning Creates a digitized image of each page Creates a digitized image of each page Usually presented to the user Usually presented to the user Optical Character Recognition (OCR) Optical Character Recognition (OCR) Creates an encoded representation of the textual content of the pages Creates an encoded representation of the textual content of the pages Necessary for full-text indexing Necessary for full-text indexing Allows searching Allows searching
Decisions in Scanning Black-and-white, grayscale or color Black-and-white, grayscale or color Resolution Resolution number of pixels per linear unit number of pixels per linear unit Bits per pixel Bits per pixel Monochrome display: 16 or 256 levels of gray Monochrome display: 16 or 256 levels of gray Color display: up to 24 or 32 bpp Color display: up to 24 or 32 bpp Quality Quality Increases storage space and time to access Increases storage space and time to access
Optical Character Recognition Manual cleanup is necessary Manual cleanup is necessary Less efficient than manual keying when error rate drops below 95 percent Less efficient than manual keying when error rate drops below 95 percent
Interactive OCR Optical character recognition should be done as an interactive process Optical character recognition should be done as an interactive process Acquisition Acquisition Input from scanner or read a file Input from scanner or read a file Cleanup Cleanup Filtering, deskewing and manual cleanup of unwanted areas Filtering, deskewing and manual cleanup of unwanted areas Page analysis Page analysis Examine layout Examine layout Recognition Recognition The “OCR” part The “OCR” part Checking Checking Saving Saving Plain text, HTML, RTF, PDF, MS Word Plain text, HTML, RTF, PDF, MS Word
Page Handling Unbinding Unbinding Microfiche or microfilm Microfiche or microfilm Two most expensive parts Two most expensive parts Handling the paper Handling the paper OCR OCR
Planning a Digitization Project Outsourcing Outsourcing Cost Cost $1 to $2 for scanning and OCR $1 to $2 for scanning and OCR Quality control Quality control Verification Verification