SLIDE 1IS FALL 2004 Lecture 18: Metadata & Controlled Vocabulary Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 am Fall 2004 SIMS 202: Information Organization and Retrieval
SLIDE 2IS FALL 2004 Lecture Contents Review –Lexical Relations –WordNet Organization of Information Metadata Dublin Core Controlled Vocabularies Discussion
SLIDE 3IS FALL 2004 Lecture Contents Review –Lexical Relations –WordNet Organization of Information Metadata Dublin Core Controlled Vocabularies Discussion
SLIDE 4IS FALL 2004 Syntax The syntax of a language is to be understood as a set of rules which accounts for the distribution of word forms throughout the sentences of a language These rules codify permissible combinations of classes of word forms
SLIDE 5IS FALL 2004 Semantics Semantics is the study of linguistic meaning Two standard approaches to lexical semantics (cf., sentential semantics; and, logical semantics): –(1) compositional –(2) relational
SLIDE 6IS FALL 2004 Pragmatics Deals with the relation between signs or linguistic expressions and their users Deixis (literally “pointing out”) –E.g., “I’ll be back in an hour” depends upon the time of the utterance Conversational implicature –A: “Can you tell me the time?” –B: “Well, the milkman has come.” [I don’t know exactly, but perhaps you can deduce it from some extra information I give you.] Presupposition –“Are you still such a bad driver?” Speech acts –Constatives vs. performatives –E.g., “I second the motion.” Conversational structure –E.g., turn-taking rules
SLIDE 7IS FALL 2004 Lexical Relations Conceptual relations link concepts –Goal of Artificial Intelligence Lexical relations link words –Goal of Linguistics
SLIDE 8IS FALL 2004 Major Lexical Relations Synonymy Polysemy Metonymy Hyponymy/Hypernymy Meronymy/Holonymy Antonymy
SLIDE 9IS FALL 2004 WordNet Started in 1985 by George Miller, students, and colleagues at the Cognitive Science Laboratory, Princeton University –Miller also known as the author of the paper “The Magical Number Seven, Plus or Minus Two: Some Limits on our Capacity for Processing Information” (1956) Can be downloaded for free: –
SLIDE 10IS FALL 2004 Structure of WordNet
SLIDE 11IS FALL 2004 Structure of WordNet
SLIDE 12IS FALL 2004 Structure of WordNet
SLIDE 13IS FALL 2004 Lecture Contents Review –Lexical Relations –Wordnet Organization of Information Metadata Dublin Core Controlled Vocabularies Discussion
SLIDE 14IS FALL 2004 Organization of Information Is there a basic human need to put things into some sort of order? –Much of natural language concerns categories of things rather than individual things –Why do we organize things and information? Why do spoons go in THAT drawer in the kitchen and not in a can in the garage? Why do your favorite books go on one shelf and not-so-favorite on another?
SLIDE 15IS FALL 2004 Why Organize Information? The main reason –So that you can find things more effectively I.e., effective retrieval is predicated on some sort of organization applied to information resources Historically there have been many institutions and tools devoted to information organization –Libraries –Museums –Archives –Indexes and catalogs, dictionaries, phone books, etc.
SLIDE 16IS FALL 2004 Why Organize Information? A question of scale –Using your own ad hoc set of categories and methods to organize your own collection of books or CDs seems to work fine… –What if your collection grew to 10 Times the size? How would you organize it? 100 Times? 1000 Times? times?
SLIDE 17IS FALL 2004 What is Information Organization? Identifying the existence of all types of information-bearing entities as they are made available Identifying the works contained within those information-bearing entities or as parts of them Systematically pulling together these information-bearing entities into collections in libraries, archives, museums, Internet communications files and other such depositories From Hagler via Taylor, Chap. 1
SLIDE 18IS FALL 2004 What is Information Organization? Producing lists of these information- bearing entities prepared according to standard rules for citation Providing name, title, subject and other useful access to these information-bearing entities Providing the means of locating each information-bearing entity or a copy of it
SLIDE 19IS FALL 2004 Key Issues in This Course How to describe information resources or information-bearing objects in ways so that they may be effectively used by those who need to use them –Organizing How to find the appropriate information resources or information-bearing objects for someone’s (or your own) needs –Retrieving
SLIDE 20IS FALL 2004 Key Issues Creation UtilizationSearching Active Inactive Semi-Active Retention/ Mining Disposition Discard Using Creating Authoring Modifying Organizing Indexing Storing Retrieval Distribution Networking Accessing Filtering
SLIDE 21IS FALL 2004 Organizing/Indexing Collecting and integrating information Affects data, information and metadata “Metadata” describes data and information –More on this shortly Organizing information –Types of organization? Indexing
SLIDE 22IS FALL 2004 Accessing/Filtering Using the organization created in the O/I stage to –Select desired (or relevant) information –Locate that information –Retrieve the information from its storage location (often via a network)
SLIDE 23IS FALL 2004 Structure of an IR System Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System
SLIDE 24IS FALL 2004 Lecture Contents Review –Lexical Relations –WordNet Organization of Information Metadata Dublin Core Controlled Vocabularies Discussion
SLIDE 25IS FALL 2004 Metadata Metadata is –“Data about Data” (database systems) –Information about Information First used (to the best we can discover) in 1978 (meta-data) Used for databases in (Meta-Data Base) –“a data base which itself contains the structural and semantic data of other data bases” »Thomas R. Cousins & Wayne D. Dominick, “The Management of Data Bases of Data Bases” ASIS Proceedings, 1978.
SLIDE 26IS FALL 2004 Metadata Structures and languages for the description of information resources and their elements (components or features) “Metadata is information on the organization of the data, the various data domains, and the relationship between them” (Baeza-Yates p. 142)
SLIDE 27IS FALL 2004 Metadata Often two main types of metadata are distinguished –Descriptive metadata Describes the information/data object and its properties May use a variety of descriptive formats and rules –Topical metadata Describes the topic or “aboutness” of an information/data object May include a variety of vocabularies for describing, subjects, topics, categories, etc.
SLIDE 28IS FALL 2004 Types of Metadata Element names Element description Element representation Element coding Element semantics Element classification
SLIDE 29IS FALL 2004 Metadata Systems and Standards Naming and ID systems Bibliographic description –Texts Music Images and objects Numeric data Geospatial data Collections Video and motion pictures
SLIDE 30IS FALL 2004 The Same Item in Different Metadata Systems ISBD RFC 1807 TEI Header MARC Record Dublin Core (a bit later)
SLIDE 31IS FALL 2004 ISBD Punctuation Title Proper (GMD) = Parallel title : other title info / First statement of responsibility ; others. -- Edition information. -- Material. -- Place of Publication : Publisher Name, Date. -- Material designation and extent ; Dimensions of item. -- (Title of Series / Statement of responsibility). -- Notes. -- Standard numbers: terms of availability (qualifications).
SLIDE 32IS FALL 2004 Bibliographic Record Introduction to cataloging and classification / Bohdan S. Wynar. -- 8th ed. / Arlene G. Taylor. -- Englewood, Colo. : Libraries Unlimited, (Library science text series).
SLIDE 33IS FALL 2004 RFC 1807 BIB-VERSION:: CS-TR-v2.1 ID:: UCB// ENTRY:: September 9, 1997 TYPE:: BOOK TITLE:: Introduction to cataloging and classification AUTHOR:: Wynar, Bohdan S. AUTHOR:: Taylor, Arlene G. DATE:: 1992 PAGES:: 633 COPYRIGHT:: Libraries Unlimited, 1992 SERIES:: Library Science Text Series END:: UCB//123456
SLIDE 34IS FALL 2004 Minimal TEI Header Introduction to cataloging and classification Bohdan S. Wynar 8th edition by Arlene G. Taylor Libraries Unlimited Introduction to cataloging and classification / Bohdan S. Wynar. -- 8th ed. / Arlene G. Taylor. -- Englewood, Colo. : Libraries Unlimited, 1992.
SLIDE 35IS FALL 2004 MARC Record (Display) ID:DCLC B RTYP:c ST:p FRN: MS:c EL: AD: CC:9110 BLT:am DCF:a CSC: MOD: SNR: ATC: UD: CP:cou L:eng INT: GPC: BIO: FIC:0 CON:b PC:s PD:1992/ REP: CPI:0 FSI:0 ILC:a II:1 MMD: OR: POL: DM: RR: COL: EML: GEN: BSE: (cloth) (paper) 040 DLC$cDLC$dDLC Z693$b.W $ Wynar, Bohdan S Introduction to cataloging and classification /$cBohdan S. Wynar th ed. /$bArlene G. Taylor. 260 Englewood, Colo. :$bLibraries Unlimited,$c xvii, 633 p. :$bill. ;$c24 cm Library science text series 504 Includes bibliographical references (p ) and index Cataloging Subject cataloging Classification$xBooks Anglo-American cataloguing rules Taylor, Arlene G.,$d1941-
SLIDE 36IS FALL 2004 Lecture Contents Review –Lexical Relations –WordNet Organization of Information Metadata Dublin Core Controlled Vocabularies Discussion
SLIDE 37IS FALL 2004 Dublin Core Simple metadata for describing internet resources For “Document-Like Objects” 15 Elements (in base DC)
SLIDE 38IS FALL 2004 Dublin Core TITLE: Introduction to cataloging and classification CREATOR: Taylor, Arlene G. OTHER CONTRIBUTOR: Wynar, Bohdan S. DATE: 1992 FORMAT: BOOK LANGUAGE: ENG PAGES: 633 PUBLISHER: Libraries Unlimited SUBJECT: Cataloging. SUBJECT: subject cataloging. SUBJECT: Classification -- Books DESCRIPTION: Textbook on cataloging and classification RESOURCE TYPE: text.monograph RESOURCE IDENTIFIER: (ISBN)
SLIDE 39IS FALL 2004 Dublin Core Elements Title Creator Subject Description Publisher Other Contributors Date Resource Type Format Resource Identifier Source Language Relation Coverage Rights Management
SLIDE 40IS FALL 2004 Mega-Metadata Standards METS - Metadata Encoding and Transmission Standard ( –Developed by the Digital Library Federation as an implementation strategy for preservation metadata –"XML document format for encoding metadata necessary for both management of digital library objects within a repository and exchange of such objects between repositories (or between repositories and their users)” –Provides a flexible mechanism for encoding descriptive, administrative, and structural metadata for a digital library object, and for expressing the complex links between these various forms of metadata
SLIDE 41IS FALL 2004 Metadata Resources Check the Links section from the class home page Best site is the “Digital Library: Metadata Resources” page from IFLA at For another good source of information on metadata standards see
SLIDE 42IS FALL 2004 Lecture Contents Review –Lexical Relations –WordNet Organization of Information Metadata Dublin Core Controlled Vocabularies (Introduction) Discussion
SLIDE 43IS FALL 2004 Controlled Vocabularies Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information That is, it is an attempt to provide a consistent set of descriptions for use in (or as) metadata
SLIDE 44IS FALL 2004 Controlled Vocabularies Names and name authorities Gazetteers (geographic names) Code lists (e.g., LC language codes) Subject heading lists Classification schemes Thesauri
SLIDE 45IS FALL 2004 Control of Names Cutter’s (1876) objectives of bibliographic description –To enable a person to find a document of which The author, or The title, or The subject is known –To show what a library has By a given author On a given subject (and related subjects) In a given kind (or form) of literature. First serves access Second serves collocation
SLIDE 46IS FALL 2004 Problems with Names How many names should be associated with a document? Which of these should be the “main entry?” What form should each of the names take? What references should be made from other possible forms of names that haven’t been used?
SLIDE 47IS FALL 2004 The Problem Proliferation of the forms of names –Different names for the same person –Different people with the same names Examples –from Books in Print (semi-controlled but not consistent) –ERIC author index (not controlled)
SLIDE 48IS FALL 2004 Goethe …etc…
SLIDE 49IS FALL 2004 John Muir
SLIDE 50IS FALL 2004 Pauline Cochrane nee Atherton
SLIDE 51IS FALL 2004 Pauline Cochrane nee Atherton
SLIDE 52IS FALL 2004 Rules for Description AACR II and other sets of descriptive cataloging rules provide guidelines for: –Determining the number of name entries –Choosing a main entry –Deciding on the form of name to be used –Deciding when to make references
SLIDE 53IS FALL 2004 Authority Control Authority control is concerned with creation and maintenance of a set of terms that have been chosen as the standard representatives (also know as established) based on some set of rules If you have rules, why do you need to keep track of all of the headings? Can’t you just infer the headings from the rules?
SLIDE 54IS FALL 2004 Conditions of Authorship? Single person or single corporate entity Unknown or anonymous authors –Fictitiously ascribed works Shared responsibility Collections or editorially assembled works Works of mixed responsibility (e.g., translations) Related works
SLIDE 55IS FALL 2004 Choice of Name AACR II says that the predominant form of the name used in a particular author’s writings should be chosen as the form of name References should be made from the other forms of the name
SLIDE 56IS FALL 2004 Form of the Name When names appear in multiple forms, one form needs to be chosen Criteria for choice are: –Fullness (e.g., full names vs. initials only) –Language of the name –Spelling (choose predominant form) Entry element: –John Smith or Smith, John? –Mao Zedong or Zedong, Mao? (Mao Tse Tung?)
SLIDE 57IS FALL 2004 Name Authority Files ID:NAFL ST:p EL:n STH:a MS:c UIP:a TD: KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF: RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 053 PR6005.R Creasey, John Cooke, M. E Cooke, Margaret,$d Cooper, Henry St. John,$d Credo,$d Fecamps, Elise Gill, Patrick,$d Hope, Brian,$d Hughes, Colin,$d Marsden, James Matheson, Rodney Ranger, Ken St. John, Henry,$d Wilde, Jimmy $wnnnc$aAshe, Gordon,$d Different names for the same person
SLIDE 58IS FALL 2004 Name Authority Files ID:NAFO ST:p EL:n STH:a MS:n UIP:a TD: KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF: RFE:a CSC:c SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d OCoLC$cOCoLC Marric, J. J.,$d $wnnnc$aCreasey, John 663 Works by this author are entered under the name used in the item. For a listing of other names used by this author, search also under$bCrease y, John 670 OCLC : His Gideon's day, 1955$b(hdg.: Creasey, John; usage: J.J. Marric) 670 LC data base, 6/10/91$b(hdg.: Creasey, John; usage: J.J. Marric) 670 Pseuds. and nicknames dict., c1987$b(Creasey, John, ; Britis h author; pseud.: Marric, J. J.)
SLIDE 59IS FALL 2004 Name Authority Files ID:NAFL ST:p EL:n STH:a MS:c UIP:a TD: KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF: RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC Butler, William Vivian,$d Butler, W. V.$q(William Vivian),$d Marric, J. J.,$d His The durable desperadoes, His The young detective's handbook, c1981:$bt.p. (W.V. Butler) 670 His Gideon's way, 1986:$bCIP t.p. (William Vivian Butler writing as J.J. Marric) Different people writing with the same name
SLIDE 60IS FALL 2004 The Haunting of Lauran Paine 1. Paine, Lauran. ALSO KNOWN AS: Carrel, Mark. Thompson, Russ. Andrews, A. A. Benton, Will. Bradford, Will. Bradley, Concho. Brennan, Will. Carter, Nevada. Allen, Clay. Almonte, Rosa. Armour, John. Cassady, Claude. Glendenning, Donn. Kelley, Ray. Kilgore, John. Martin, Tom. Slaughter, Jim. Standish, Buck. … Batchelor, Reg. Beck, Harry. Bedford, Kenneth. Bosworth, Frank. Bovee, Ruth. Cassidy, Claude. Custer, Clint. Dana, Amber. Dana, Richard. Davis, Audrey. Drexler, J. F. Duchesne, Antoinette. Fisher, Margot. Fleck, Betty. Frost, Joni. Gordon, Angela. Gorman, Beth. Hayden, Jay. Houston, Will. Howard, Troy. Ingersol, Jared. … Kelly, Ray. Ketchum, Jack. Liggett, Hunter. Lucas, J. K. Lyon, Buck. Morgan, Arlene. Morgan, Valerie. O'Connor, Clint. St. George, Arthur. Sharp, Helen. Thorn, Barbara. Archer, Dennis. Clark, Badger.
SLIDE 61IS FALL 2004 Some Interesting Ones…
SLIDE 62IS FALL 2004 Structure of an IR System Search Line Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System Adapted from Soergel, p. 19
SLIDE 63IS FALL 2004 Uses of Controlled Vocabularies Library subject headings, classification, and authority files Commercial journal indexing services and databases Yahoo, and other web classification schemes Online and manual systems within organizations –SunSolve –MacArthur
SLIDE 64IS FALL 2004 Types of Indexing Languages Uncontrolled keyword indexing Indexing languages –Controlled, but not structured Thesauri –Controlled and structured Classification systems –Controlled, structured, and coded Faceted thesauri and classification systems Much more on these topics later…
SLIDE 65IS FALL 2004 Lecture Contents Review –Lexical Relations –WordNet Organization of Information Metadata Dublin Core Controlled Vocabularies Discussion
SLIDE 66IS FALL 2004 Discussion
SLIDE 67IS FALL 2004 Next Time Introduction to the Phone Project Readings/discussion –Information Architecture (Rosenfeld)