Chapter Two Preliminaries: Sorting out the ingredients How to Build a Digital Library Ian H. Witten and David Bainbridge
Planning a Digital Library Responsibilities Responsibilities Technology to be used Technology to be used Greenstone software Greenstone software Metadata Metadata Summary information Summary information Types of access Types of access Digitizing documents Digitizing documents Majority of the work Majority of the work
Responsibilities Legal Issues Legal Issues Distributing information carries responsibilities Distributing information carries responsibilities Copyright Copyright Social Issues Social Issues Respect customs of the community Respect customs of the community Both source and use communities Both source and use communities Ethical issues Ethical issues
Fundamental Questions What is the purpose of the library? What is the purpose of the library? What are the principles for including documents? What are the principles for including documents? When does one document differ from another? When does one document differ from another?
Sources of Material Existing library to be converted to digital form Existing library to be converted to digital form An existing collection of material to be made available as a digital library An existing collection of material to be made available as a digital library Material already existing on the Web to be organized and presented via a portal Material already existing on the Web to be organized and presented via a portal
Sources of Material Ideology Ideology Converting an existing library Converting an existing library Building a new collection Building a new collection Virtual libraries Virtual libraries
Ideology Ideology – a clear conception of what you plan to achieve with the collection of information Ideology – a clear conception of what you plan to achieve with the collection of information Ideology of a Collection: Ideology of a Collection: Purpose Purpose Objectives Objectives Principles Principles guide what is to be included in the collection
Introduction to Digital Library State the purpose of the collection State the purpose of the collection Describe how the collection is organized Describe how the collection is organized
Document versus Work Work Work The disembodied content of a message The disembodied content of a message Pure information Pure information Document Document Traditional library: a physical object that embodies the work Traditional library: a physical object that embodies the work Digital library: a particular electronic encoding of a work Digital library: a particular electronic encoding of a work How are distinctions made between different manifestations of a single work? How are distinctions made between different manifestations of a single work?
Converting an Existing Library Digitizing an existing paper-based collection is the most expensive kind of project Digitizing an existing paper-based collection is the most expensive kind of project Consider whether it is worth the effort and expense Consider whether it is worth the effort and expense
Advantages of Digital Libraries Easier to access remotely than conventional libraries Easier to access remotely than conventional libraries Powerful search and browsing Powerful search and browsing Easier to add additional services Easier to add additional services
Questions Will the digital library coexist with an existing physical one? Will the digital library coexist with an existing physical one? What is the collection’s growth rate? What is the collection’s growth rate? How dynamic is the collection? How dynamic is the collection? Should you consider outsourcing the whole digital library operation? Should you consider outsourcing the whole digital library operation? Could user needs be satisfied in alternative ways? Could user needs be satisfied in alternative ways?
Prioritizing Materials Special collections and unique materials Special collections and unique materials Rare books and manuscripts Rare books and manuscripts High use items High use items Research and teaching materials Research and teaching materials Low-use items Low-use items
Criteria for Digital Conversion Intellectual content Intellectual content Scholarly value Scholarly value Desire to enhance access to information Desire to enhance access to information Funding available Funding available Educational value Educational value Classroom support Classroom support Background reading Background reading Distance education Distance education Institutional Institutional Resource sharing Resource sharing Promote strengths of an institution Promote strengths of an institution Reduce handling of fragile originals Reduce handling of fragile originals Cost and space savings Cost and space savings Copyright Copyright
Principles for Development Utility Utility Local imperative Local imperative Novelty Novelty Intertextuality Intertextuality Resources Resources Commitment to the transition Commitment to the transition
Building a New Collection New material New material The copyright holder may be the best one to create a digital collection The copyright holder may be the best one to create a digital collection Metadata Metadata Where will it come from? Where will it come from?
Virtual Libraries A portal to information that is in electronic form but located elsewhere on the Internet A portal to information that is in electronic form but located elsewhere on the Internet Source information is already available Source information is already available Some metadata is available Some metadata is available
Virtual Libraries Select the content Select the content Define a purpose or theme for the library Define a purpose or theme for the library Seek and filter information Seek and filter information Focused Web crawling Focused Web crawling Obtain additional metadata Obtain additional metadata Aids in the organization of the collection Aids in the organization of the collection The higher the educational value of a resource, the more time should be taken in generating its description The higher the educational value of a resource, the more time should be taken in generating its description
Generating Metadata in a Virtual Library Automatically generated Automatically generated URL URL Author supplied metadata Author supplied metadata Keyword extraction Keyword extraction Manual review Manual review Edit and enrich the automatically generated metadata Edit and enrich the automatically generated metadata Intensive description by a human expert Intensive description by a human expert Provides extensive metadata Provides extensive metadata
Bibliographic Organization Objectives of a bibliographic system Objectives of a bibliographic system Bibliographic entities Bibliographic entities
Original Objectives of a Bibliographic System Finding Finding User seeks a known document when information such as author, title or subject is known User seeks a known document when information such as author, title or subject is known Collocation Collocation “To place together or in proper order” “To place together or in proper order” Locating similar information by subject matter, author, etc. Locating similar information by subject matter, author, etc. Choice Choice User must choose between similar documents User must choose between similar documents Bibliographically in terms of edition Bibliographically in terms of edition Topically in terms of character Topically in terms of character
Current Objectives of a Bibliographic System Locate Locate Find entities in a file or database as the result of a search using attributes or relationships of the entities Find entities in a file or database as the result of a search using attributes or relationships of the entities Identify Identify Confirm entity described in a record is the one sought Confirm entity described in a record is the one sought Select Select Verify that entity is what the user needs Verify that entity is what the user needs Acquire Acquire Obtain access through purchase, loan or online access Obtain access through purchase, loan or online access Navigate Navigate Go through a bibliographic database Go through a bibliographic database Find works related by generalization, association, aggregation Find works related by generalization, association, aggregation Find attributes related by equivalence, association and hierarchy Find attributes related by equivalence, association and hierarchy
Documents in Digital Libraries Document Document A particular electronic encoding of a work A particular electronic encoding of a work Can be easily duplicated Can be easily duplicated Uncertain boundaries Uncertain boundaries Digital libraries should present users with an image of stability and continuity Digital libraries should present users with an image of stability and continuity as though electronic documents were identifiable, discrete objects like physical ones as though electronic documents were identifiable, discrete objects like physical ones
Bibliographic Entities Documents Documents Works Works Distinction between document and work Distinction between document and work Editions Editions Electronic documents use terms such as version, release and revision Electronic documents use terms such as version, release and revision Authors Authors Authority control – standardized names for authors Authority control – standardized names for authors Titles Titles Attributes of works Attributes of works
Bibliographic Entities Subjects Subjects Two approaches to automatically assign subject: Two approaches to automatically assign subject: Key-phrase extraction Key-phrase extraction Key-phrase assignment Key-phrase assignment Literary and artistic works Literary and artistic works Style, form, content, genre Style, form, content, genre Library of Congress Subject Headings (LCSH) Library of Congress Subject Headings (LCSH) Controlled vocabularies: 30,000 pages, 2,000,000 entries Controlled vocabularies: 30,000 pages, 2,000,000 entries Hierarchical relationship of broader and narrower topics Hierarchical relationship of broader and narrower topics Subject classifications Subject classifications Traditional libraries have a linear arrangement Traditional libraries have a linear arrangement Digital collection can be rearranged at the click of a mouse Digital collection can be rearranged at the click of a mouse
Modes of Access Web Web Terminal in physical library Terminal in physical library Standalone computer with CD-ROM or DVD Standalone computer with CD-ROM or DVD Distributed System Distributed System Restricting Access Restricting Access Firewalls Firewalls Password protection Password protection Watermarking Watermarking
Digitizing Documents Digitization Digitization The process of taking traditional library materials and converting them to electronic form The process of taking traditional library materials and converting them to electronic form Allows storage and manipulation by a computer Allows storage and manipulation by a computer The process is time-consuming and expensive The process is time-consuming and expensive
Stages of Digitization Scanning Scanning Creates a digitized image of each page Creates a digitized image of each page Usually presented to the user Usually presented to the user Optical Character Recognition (OCR) Optical Character Recognition (OCR) Creates a digital representation of the textual content of the pages Creates a digital representation of the textual content of the pages Necessary for full-text indexing Necessary for full-text indexing Allows searching Allows searching
Digitizing Documents Scanning Scanning Optical character recognition Optical character recognition Interactive OCR Interactive OCR Page handling Page handling Planning an image digitization project Planning an image digitization project Inside an OCR shop Inside an OCR shop An example project An example project
Scanning Produces a digitized image of each page Produces a digitized image of each page Resembles digitized photograph Resembles digitized photograph
Decisions in Scanning Black-and-white, grayscale or color Black-and-white, grayscale or color Resolution Resolution number of pixels per linear unit number of pixels per linear unit Bits per pixel Bits per pixel Monochrome display: 16 or 256 levels of gray Monochrome display: 16 or 256 levels of gray Color display: up to 24 or 32 bpi Color display: up to 24 or 32 bpi Quality Quality Increases storage space and time to access Increases storage space and time to access
Optical Character Recognition Produces a character-by-character representation of the document Produces a character-by-character representation of the document Transforms the scanned image into a digitized representation of the page content Transforms the scanned image into a digitized representation of the page content Manual cleanup is necessary Manual cleanup is necessary Less efficient than manual keying when error rate drops below 95 percent Less efficient than manual keying when error rate drops below 95 percent
Interactive OCR Optical character recognition should be done as an interactive process Optical character recognition should be done as an interactive process Acquisition Acquisition Input from scanner or read a file Input from scanner or read a file Cleanup Cleanup Filtering, skewing and manual cleanup of unwanted areas Filtering, skewing and manual cleanup of unwanted areas Page analysis Page analysis Examine layout Examine layout Recognition Recognition The “OCR” part The “OCR” part Checking Checking Saving Saving Plain text, HTML, RTF, PDF, MS Word Plain text, HTML, RTF, PDF, MS Word
Page Handling Unbinding Unbinding Microfiche or microfilm Microfiche or microfilm Two most expensive parts Two most expensive parts Handling the paper Handling the paper OCR OCR
Planning a Digitization Project Outsourcing Outsourcing Cost Cost $1 to $2 for scanning and OCR $1 to $2 for scanning and OCR Quality control Quality control Verification Verification