Lecture 12 Why metadata? CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel herbertv@cs.cornell.edu
Notes Carl Lagoze on Wednesday No Lab on Friday But Paul Ginsparg on 04/03 XML Schema & XSLT - later
Content – Data - Metadata Content refers to digital library materials as information that is of interest to a user. Data emphasize bits and bytes to be processed by a computer. Metadata : data about data
Metadata – focus on description/discovery data about data origins in library cataloguing, A&I databases now: an amplification of traditional bibliographic cataloguing practices in an electronic environment; now: any data used to aid the identification, description and location of networked electronic resources. actually, it is more
Metadata - broader descriptive: facilitating resource discovery and identification (record in OPAC system) administrative: facilitating resource management within a collection (loan record in OPAC system) structural: binding together the components of complex information objects (series title in record in OPAC system)
descriptive/discovery Metadata - evolution descriptive/discovery library objects descriptive/discovery administrative structural library objects networked resources
Metadata Traditionally stored separately from the objects that it describes, For digital objects, sometimes is embedded in the objects (cf. KWF). Usually the metadata is a set of text fields. Textual metadata can be used to describe non-textual objects, e.g., software, images, music, …
Metadata – why? Some methods of information discovery search descriptive metadata about the objects. Generally, it enables digital library services: explicitly (discovery metadata) or implicitly (terms and conditions) helps to impose order on chaos enables automated discovery/manipulation of objects
Metadata – generation (traditional) cataloguing rules object metadata record reference data
Metadata – generation (traditional) Advantage: Human expertise leads to high-quality catalogs and indexes Disadvantages: Expensive ($50+ per record) Time consuming Requires cumbersome cataloguing rules Slow to adapt to new formats and types of digital objects Human cataloging and indexing is too expensive to apply to all but a small proportion of digital objects => automatic generation of metadata
Metadata – roots (Library cataloguing) Anglo American Cataloguing Rules (AACR2) • rules for what goes into each field of a catalog record MARC format • an exchange format for catalog records "MARC Catalog" • catalog in MARC format, where content of each field follows AACR2
Citation: a monograph -- book! Caroline R. Arms, editor, Campus strategies for libraries and electronic information. Bedford, MA: Digital Press, 1990.
MARC tags MARC subfield code MARC subfield MARC field MARC indicator
ISBN Title statement Imprint – location, publisher, year Collation Series Title
directory leader field terminator 001 field
MARC: the good news A great achievement: Developed in 1960s Magnetic tape exchange format for printing catalog records The dawn of computing: mixed upper and lower case variable length fields, repeated fields non-Roman scripts 100(?) million records with standard content and format Thousands of trained librarians (millions?)
MARC: the bad news A great problem: Not designed for computer algorithms One record per item (poor links between records) Tied to traditional materials and traditional practices Not Unicode 100 million records at $50+/record A classic legacy system!
Metadata –- simplicity/complexity Variety of metadata formats for description/discovery: basic, proprietary, records used in global internet search services; simple attribute/value records such as the ROADS templates used in eLib subject services; unqualified Dublin Core (12 elements only) the more structured TEI and MARC formats; qualified Dublin Core detailed formats such as CIMI and EAD, typically applied to archival material.
Metadata –- one-size-fits-all/application-profiles There is an evolution from a “one size fits all” concept for metadata towards: the use of a specific format depending on the purpose; the co-existence of formats in relation to an object; combining metadata elements from various formats; Choice of format can depend on: the functional purpose of the metadata –- [description/ discovery/location] ; [administration] ; [structuring] level of detail required to fulfill the purpose discipline/domain/audience of the objects that are described legacy issues interoperability requirements
Internet Commons Metadata – interoperability Commerce Home Pages Whatever... Home Pages Geo Internet Commons Library Museums Scientific Data
Metadata – descriptive/other There is an evolution towards the creation of standards for non-discovery related metadata formats: Preservation metadata [NedLib, CEDARS, …] (see OCLC overview document - http://www.oclc.org/digitalpreservation/presmeta_wp.pdf Data Dictionary for Technical Metadata for Digital Still Images“ (http://www.niso.org/pdfs/DataDict.pdf) book e-commerce [ONIX] resource administration: Circulation Interchange Protocol (NCIP) Standard – see http://www.niso.org/drafts/Z3982v1.html Electronic resources (cf. Adam Chandler)