An introduction to metadata in digital projects Jenn Riley Metadata Librarian L566 Fall 2006
10/17/06L566 Fall Topics we’ll cover Choosing descriptive metadata standards Choosing controlled vocabularies Using controlled vocabularies to enhance searching and browsing Wrapping it all up
Choosing descriptive metadata standards
10/17/06L566 Fall Descriptive metadata Enables users to find relevant materials Used by many different knowledge domains Many potential representations Controlled by Data structure standards Data content standards Syntax encoding schemes Vocabulary encoding schemes
10/17/06L566 Fall Some data structure standards Dublin Core (DC) Unqualified (simple) Qualified MAchine Readable Cataloging (MARC) MARC in XML (MARCXML) Metadata Object Description Schema (MODS)
10/17/06L566 Fall How do I pick one? (1) Institution Nature of holding institution Resources available for metadata creation What others in the community are doing Formats supported by your delivery software The standard Purpose Structure Context History
10/17/06L566 Fall How do I pick one? (2) Materials Genre Format Likely audiences What metadata already exists for these materials Project goals Robustness needed for the given materials and users Describing multiple versions Mechanisms for providing relationships between records Plan for interoperability, including repeatability of elements More information on handouthandout
10/17/06L566 Fall Dublin Core (DC) 15-element set National and international standard 2001: Released as ANSI/NISO Z39.85ANSI/NISO Z39.85 2003: Released as ISO 15836ISO Maintained by the Dublin Core Metadata Initiative (DCMI) Other players DCMI Working Groups DC Usage Board
10/17/06L566 Fall DCMI mission The Dublin Core Metadata Initiative provides simple standards to facilitate the finding, sharing and management of information. DCMI does this by: Developing and maintaining international standards for describing resources Supporting a worldwide community of users and developers Promoting widespread use of Dublin Core solutions
10/17/06L566 Fall DC Principles “Core” across all knowledge domains No element required All elements repeatable 1:1 principle
10/17/06L566 Fall DCMI Abstract Model Released in 2005 “A reference model against which particular DC encoding guidelines can be compared” Heavily influenced by RDF thinking New XML and RDF encodings under development to conform to the abstract model Two schools of thought on its development Clarifies model underlying the metadata standard Overly complicates a standard intended to be simple
10/17/06L566 Fall DC encodings HTML XML RDF [Spreadsheets] [Databases]
10/17/06L566 Fall Content/value standards for DC None required Some elements recommend a content or value standard as a best practice Relation Source Subject Type Coverage Date Format Language Identifier
10/17/06L566 Fall Some limitations of simple DC Can’t indicate a main title vs. other subordinate titles No method for specifying creator roles W3CDTF format can’t indicate date ranges or uncertainty Can’t by itself provide robust record relationships
10/17/06L566 Fall Good times to use DC Cross-collection searching Cross-domain discovery Metadata sharing Describing some types of simple resources Metadata creation by novices
DC [record]record QDC [record]record [collection]collection MARC [record]record [collection]collection MARCXML [record]record MODS [record]record [collection]collection Record format XML RDF (X)HTML Field labelsText Reliance on AACR None Common method of creation By novices, by specialists, and by derivation
10/17/06L566 Fall Qualified Dublin Core (QDC) Adds some increased specificity to Unqualified Dublin Core Same governance structure as DC Same encodings as DC Same content/value standards as DC Listed in DMCI TermsDMCI Terms Additional principles Extensibility Dumb-down principle
10/17/06L566 Fall Types of DC qualifiers Additional elements Element refinements Encoding schemes Vocabulary encoding schemes Syntax encoding schemes
10/17/06L566 Fall DC qualifier status Recommended Conforming Obsolete Registered
10/17/06L566 Fall Limitations of QDC Widely misunderstood No method for specifying creator roles W3CDTF format can’t indicate date ranges or uncertainty Split across 3 XML schemas
10/17/06L566 Fall Best times to use QDC More specificity needed than simple DC, but not a fundamentally different approach to description Want to share DC with others, but need a few extensions for your local environment Describing some types of simple resources Metadata creation by novices
DC [record]record QDC [record]record [collection]collection MARC [record]record [collection]collection MARCXML [record]record MODS [record]record [collection]collection Record format XML RDF (X)HTML XML RDF (X)HTML Field labelsText Reliance on AACR None Common method of creation By novices, by specialists, and by derivation
10/17/06L566 Fall MAchine Readable Cataloging (MARC) Format for the records in IUCAT, WorldCat and other library catalogs Used for library metadata since 1960s Adopted as national standard in 1971 Adopted as international standard in 1973 Maintained by: Network Development and MARC Standards Office at the Library of Congress Standards and the Support Office at the National Library of Canada
10/17/06L566 Fall More about MARC Actually a family of MARC standards throughout the world U.S. & Canada use MARC21 MARC Bibliographic is for descriptive metadata Structured as a binary interchange format ANSI/NISO Z39.2 ISO 2709 Field names Numeric fields Alphabetic subfields
10/17/06L566 Fall Content/value standards for MARC None required by the format itself But US record creation practice relies heavily on: AACR2r ISBD LCNAF LCSH
10/17/06L566 Fall Limitations of MARC Use of all its potential is time-consuming OPACs don’t make full use of all possible data OPACs virtually the only systems to use MARC data Requires highly-trained staff to create Local practice differs greatly
10/17/06L566 Fall Good times to use MARC Integration with other records in OPAC Resources are like those traditionally found in library catalogs Maximum compatibility with other libraries is needed Have expert catalogers for metadata creation
DC [record]record QDC [record]record [collection]collection MARC [record]record [collection]collection MARCXML [record]record MODS [record]record [collection]collection Record format XML RDF (X)HTML XML RDF (X)HTML ISO 2709 [ANSI Z39.2] Field labelsText Numeric Reliance on AACR None Strong Common method of creation By novices, by specialists, and by derivation By specialists
10/17/06L566 Fall MARC in XML (MARCXML) Copies the exact structure of MARC21 in an XML syntax Numeric fields Alphabetic subfields Implicit assumption that content/value standards are the same as in MARC
10/17/06L566 Fall Limitations of MARCXML Not appropriate for direct data entry Extremely verbose syntax Full content validation requires tools external to XML Schema conformance
10/17/06L566 Fall Good times to use MARCXML As a transition format between a MARC record and another XML-encoded metadata format Materials lend themselves to library-type description Need more robustness than DC offers Want XML representation to store within larger digital object but need lossless conversion to MARC
DC [record]record QDC [record]record [collection]collection MARC [record]record [collection]collection MARCXML [record]record MODS [record]record [collection]collection Record format XML RDF (X)HTML XML RDF (X)HTML ISO 2709 [ANSI Z39.2] XML Field labelsText Numeric Reliance on AACR None Strong Common method of creation By novices, by specialists, and by derivation By specialists By derivation
10/17/06L566 Fall Metadata Object Description Schema (MODS) Developed and managed by the Library of Congress Network Development and MARC Standards Office First released for trial use June 2002 MODS 3.2 released June 2006 “Schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications.”
10/17/06L566 Fall Differences between MODS and MARC MODS is “MARC-like” but intended to be simpler Textual tag names Encoded in XML Some specific changes Some regrouping of elements Removes some elements Adds some elements
10/17/06L566 Fall Content/value standards for MODS Some elements indicate a given content/value standard should be used Generally follows MARC/AACR2/ISBD conventions But not all enforced by the MODS XML schema Authority attribute available on some elements
10/17/06L566 Fall Limitations of MODS No lossless round-trip conversion from and to MARC Still largely implemented by library community only Some semantics of MARC lost Format still growing to meet the needs of the digital library community
10/17/06L566 Fall Good times to use MODS Materials lend themselves to library-type description Want to reach both library and non-library audiences Need more robustness than DC offers Want XML representation to store within larger digital object
DC [record]record QDC [record]record [collection]collection MARC [record]record [collection]collection MARCXML [record]record MODS [record]record [collection]collection Record format XML RDF (X)HTML XML RDF (X)HTML ISO 2709 [ANSI Z39.2] XML Field labelsText Numeric Text Reliance on AACR None Strong Implied Common method of creation By novices, by specialists, and by derivation By specialists By derivation By specialists and by derivation
10/17/06L566 Fall Picking a format Consider all options Match format to the types of discovery you want to support Your choice has to fit in your larger technological infrastructure Realize the constraints you’re operating under Or, expand infrastructure! Don’t have to choose just one, can use several for different purposes
10/17/06L566 Fall Mapping between metadata formats Also called “crosswalking” To create “views” of metadata for specific purposes Mapping from robust format to more general format is common Mapping from general format to more robust format is ineffective
10/17/06L566 Fall Types of mapping logic Mapping the complete contents of one field to another Splitting multiple values in a single local field into multiple fields in the target schema Translating anomalous local practices into a more generally useful value Splitting data in one field into two or more fields Transforming data values Boilerplate values to include in output schema
10/17/06L566 Fall Common mapping pitfalls Cramming in too much information Leaving in trailing punctuation Missing context of records Meaningless placeholder data ALWAYS remember the purpose of the metadata you are creating!
10/17/06L566 Fall No, really, which one do I pick? It depends. Sorry. Be as robust as you can afford Plan for future uses of the metadata you create Leverage existing expertise as much as possible Focus on content and value standards as much as possible
10/17/06L566 Fall More information Dublin Core DC Element Set version 1.1 DC Element Set version 1.1 DCMI Metadata Terms DCMI Metadata Terms MODS MARC MARCXML
Break time!
Choosing controlled vocabularies
10/17/06L566 Fall Some characteristics of CVs Also known as “vocabulary encoding schemes” Enumerated lists of all possible choices for a field value Often organized into a syndetic structure Usually intended to be human-readable
10/17/06L566 Fall CVs in libraries Many library CVs grow constantly with catalogers contributing new terms Many library CVs use content standards to dictate the form of headings Fields that use CVs are said to be under “authority control”
10/17/06L566 Fall Traditional uses of CVs in library catalog records Collocation Disambiguation Interoperability BROWSING! (Although this isn’t used much in libraries…)
10/17/06L566 Fall Other considerations Human cataloging using CVs is expensive Developing and maintaining CVs is expensive Current library systems usually rely on the same string being present in all records rather than true relational structures linking records to CV terms
10/17/06L566 Fall When a controlled vocabulary is useful User browsing of a small number of categories each with a large number of members When many different things have the same label When recall is a priority for a given access point
10/17/06L566 Fall Some common fields using CVs Names Places “Subjects”
10/17/06L566 Fall Names Seeking works by or about a certain individual is frequent Individuals are often known by many different names Many different individuals have the same name Name authority lists often create uniqueness by adding qualifiers Some example vocabularies: Library of Congress Name Authority File (LCNAF) Getty Union List of Artists’ Names (ULAN)
10/17/06L566 Fall Places Common in libraries to control place names in subjects, but not publication places Many different places with the same name Often organized hierarchically Commonly used vocabularies: Library of Congress Subject Headings (LCSH) Getty Thesaurus of Geographic Names (TGN) GEONet Names Server
10/17/06L566 Fall “Subjects” Libraries traditionally group topic, location, genre, form, time period and other related concepts all under “subject” Often organized into a rich syndetic structure General rule is to apply the most specific heading applicable Involves subjective judgment on the part of the individual assigning the heading
10/17/06L566 Fall Deciding which fields to place under authority control Consider your budgetary restraints Learn about the functionalities possible in your system Identify appropriate vocabularies that meet defined needs Develop a clear plan for how the fields with controlled values will be used
Using controlled vocabularies to enhance searching and browsing
10/17/06L566 Fall Case Study: Cushman CollectionCushman Collection Funded with an Institute of Museum & Library Services (IMLS) grant ~15,000 color slides taken between Cushman provided a significant amount of description description Additional metadata created to enhance genre, subject and geographic access
10/17/06L566 Fall Metadata for the Cushman Collection Cushman’s description Dates Location Names TGM I – LC Thesaurus for Graphic Materials: Subject Terms TGM II - LC Thesaurus for Graphic Materials: Genre & Physical Characteristics TGN – Getty Thesaurus of Geographic Names We wanted to use this high-quality metadata to improve on past search systems
10/17/06L566 Fall TGM I: Subject Terms Strengths and Weaknesses Strengths include: Pre-defined relationships between concepts Some lead-in vocabulary Weaknesses include: Syndetic relationship lacking for new terms Language not user-friendly Not enough lead-in vocabulary Form and number of top-level categories not useful for a browse structure
10/17/06L566 Fall User studies performed Two types Group walkthroughs of prototypes Task scenario study Some functionality suggested by the studies Refinement while searching Search suggestions Faceted browsing Browsing on subject terms at all levels CV interaction
10/17/06L566 Fall Browsing Image Collections Research shows: Browsing is exploratory (Bawden) Guided, flexible browsing in context works (Flamenco and SI Art Image Browser projects) Our usability studies show: Structure is important Contents should be easily exposed Flexible and combinatorial browsing is desired Browsing cultivates searching
10/17/06L566 Fall Searching Image Collections Research shows: Using thesaurus structure helps searching (Greenberg) Automatic expansion of synonyms and narrower terms User-initiated expansion of broader and related terms Our Usability studies show: Referencing an A-Z list with no lead-in terms for searching is NOT helpful at all Concerns about word choice Iterative reformulation of queries in context is desired
10/17/06L566 Fall Cushman Specifications: Browsing Date Genre Subjects (hierarchical) Retrieval of all records with narrower terms Location (hierarchical) Combination of categories
10/17/06L566 Fall Cushman Specifications: Searching Integrated search against BOTH “free-text” descriptions and thesaurus Integrated search Mapping from lead-in vocabulary Retrieval of all records with narrower terms User-initiated broadening and narrowing User-initiated
Wrapping it all up
10/17/06L566 Fall What next? After choosing metadata standards and controlled vocabularies Figure out where metadata creation fits in the overall workflow Write metadata creation guidelines Design and implement a metadata creation process
10/17/06L566 Fall And there’s more Other types of metadata Content markup Technical metadata Rights metadata Preservation metadata Structural metadata Specialized metadata standards When to create a local metadata format
10/17/06L566 Fall In a grant proposal (1) Give specific information on all the decisions you’ve made Metadata standards Controlled vocabularies Metadata creation workflow Discovery functionality the metadata will support Describe what metadata already exists for these materials
10/17/06L566 Fall In a grant proposal (2) Indicate who will do the metadata creation work Give reasonable cost estimates The more planning you do, the more likely you are to Receive funding Complete the project on schedule Complete the project within your budget
10/17/06L566 Fall That’s all for today! These presentation slides: Handout: