Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 430: Information Discovery

Similar presentations


Presentation on theme: "CS 430: Information Discovery"— Presentation transcript:

1 CS 430: Information Discovery
Lecture 14 Library Catalogs 2

2 Course Administration

3 Dublin Core Dublin Core is an attempt to apply cataloguing methods to online materials, notably the Web. History The methods of full text indexing that were used by the early Web search engines, such as Lycos, would not scale up. "... indexes are most useful in small collections within a given domain. As the scope of their coverage expands, indexes succumb to problems of large retrieval sets and problems of cross disciplinary semantic drift. Richer records, created by content experts, are necessary to improve search and retrieval." Weibel 1995

4 Dublin Core Simple set of metadata elements for online information
15 basic elements intended for all types and genres of material all elements optional all elements repeatable Developed by an international group chaired by Stuart Weibel since 1995. (Diane Hillmann and Carl Lagoze of Cornell have been very active in this group.)

5 Dublin Core elements 1. Title The name given to the resource by the creator or publisher. 2. Creator The person or organization primarily responsible for the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources. 3. Subject The topic of the resource. Typically, subject will be expressed as keywords or phrases that describe the subject or content of the resource. The use of controlled vocabularies and formal classification schemes is encouraged.

6 Dublin Core elements 4. Description A textual description of the content of the resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources. 5. Publisher The entity responsible for making the resource available in its present form, such as a publishing house, a university department, or a corporate entity. 6. Contributor A person or organization not specified in a creator element who has made significant intellectual contributions to the resource but whose contribution is secondary to any person or organization specified in a creator element (for example, editor, transcriber, and illustrator).

7 Dublin Core elements 7. Date A date associated with the creation or availability of the resource. 8. Type The category of the resource, such as home page, novel, poem, working paper, preprint, technical report, essay, dictionary. 9. Format The data format of the resource, used to identify the software and possibly hardware that might be needed to display or operate the resource. 10. Identifier A string or number used to uniquely identify the resource. Examples for networked resources include URLs and URNs.

8 Dublin Core elements 11. Source Information about a second resource from which the present resource is derived. 12. Language The language of the intellectual content of the resource. 13. Relation An identifier of a second resource and its relationship to the present resource. This element permits links between related resources and resource descriptions to be indicated. Examples include an edition of a work (IsVersionOf), or a chapter of a book (IsPartOf).

9 Dublin Core elements 14. Coverage The spatial locations and temporal durations characteristic of the resource. 15. Rights A rights management statement, an identifier that links to a rights management statement, or an identifier that links to a service providing information about rights management for the resource.

10

11 Dublin Core publisher: OCLC creator: Weibel, Stuart L.
creator: Miller, Eric J. title: Dublin Core Reference Page date: format: text/html (MIME type) language: en (English) identifier:

12 Representations of Dublin Core: Meta Tags
<meta name="publisher" content="OCLC"> <meta name="creator" content="Weibel, Stuart L."> <meta name="creator" content="Miller, Eric J."> <meta name="title" content="Dublin Core Reference Page"> <meta name="date" content=" "> <meta name="format" content="text/html"> <meta name="language" content="en"> <meta name="identifier" content="

13 Qualifiers Element qualifier Example: Date DC.Date.Created 1997-11-01
DC.Date.Issued DC.Date.Available / DC.Date.Valid /

14 Qualifiers Value qualifiers Example: Subject DC.Subject.DDC 509.123
DC.Subject.LCSH Digital libraries-United States

15 Dumbing Down Principle
"The theory behind this principle is that consumers of metadata should be able to strip off qualifiers and return to the base form of a property. ... this principle makes it possible for client applications to ignore qualifiers in the context of more coarse-grained, cross-domain searches." Lagoze 2001

16 Dumbing Down Principle
Qualified version DC.Date.Created DC.Subject.LCSH Digital libraries-United States Dumbed-down version DC.Date a valid date DC.Subject Digital libraries-United States a valid subject description

17 Representations of Dublin Core: Text (with qualifiers)
See next two slides for an example of a Dublin Core record for a web site prepared by a professional cataloguer at the Library of Congress. Note that the record does not follow the principle of dumbing-down.

18

19

20 Old Midterm Examination
Dumbing-down failures: Description.note Title from home page as viewed on Nov. 1, 2000. Description Title from home page as viewed on Nov. 1, 2000. which is not a description of the object Publisher.place Nashville, Tenn. : Publisher Nashville, Tenn. : which is not the publisher of the object Correct dumbing-down: Subject.class.LCC E840.8.G65 Subject E840.8.G65 which is a subject code

21 What to Catalog: IFLA Model
Work A work is the underlying abstraction, e.g., The Iliad The Computer Science departmental web site Beethoven's Fifth Symphony Unix operating system The 1996 U.S. census This is roughly equivalent to the concept of "literary work" used in copyright law.

22 IFLA Model Expression. A work is realized through an expression, e.g.,
The Illiad has oral expressions and written expressions A musical work has score and performance(s). Software has source code and machine code Many works have only a single expression, e.g. a web page, or a book.

23 IFLA Model Manifestation. A expression is given form in one or more manifestations, e.g., The text of The Iliad has been manifest in numerous manuscripts and printed books. A musical performance can be distributed on CD, or broadcast on television. Software is manifest as files, which may be stored or transmitted in any digital medium.

24 IFLA Model Item. When many copies are made of a manifestation, each is a separate item, e.g., a specific copy of a book computer file [Works, expressions, manifestations and items are explored in CS 431, Architecture of Web Information Systems.]

25 Limits of Dublin Core and MARC: Complex Objects
Metadata records Complete object Sub-objects Article within a journal Page within a Web site A thumbnail of another image The March 28 final edition of a newspaper

26 Flat v. linked records Flat record
All information about an item is held in a single Dublin Core record, including information about related items convenient for access and preservation information is repeated -- maintenance problem Linked record Related information is held in separate records with a link from the item record less convenient for access and preservation information is stored once Compare with normal forms in relational databases

27

28 Representations of Dublin Core: XML (with qualifiers)
<title>Digital Libraries and the Problem of Purpose</title> <creator>David M. Levy</creator> <publisher>Corporation for National Research Initiatives</publisher> <date date-type = "publication">January 2000</date> <type resource-type = "work">article</type> <identifier uri-type = "DOI"> /january2000-levy</identifier> <identifier uri-type = "URL"> <language>English</language> <rights>Copyright (c) David M. Levy</rights> to be continued

29 Dublin Core with flat record extension
Continuation of D-Lib Magazine record <relation rel-type = "InSerial"> <serial-name>D-Lib Magazine</serial-name> <issn> </issn> <volume>6</volume> <issue>1</issue> </relation>

30 Limits of Dublin Core and MARC: Events
Version 1 Version 2 New material Should Version 2 have its own record or should extra information be added to the Version 2 record? How are these represented in Dublin Core or MARC?

31 Using Catalog Data for Information Retrieval
The basic operation of information retrieval is to match the way that a user describes an information requirement (a query), against the way that items are described (an index). The success of conventional catalogs (e.g., MARC + Anglo-American Cataloguing Rules) or indexing services (e.g., Medline) comes from the use of precise language to describe items combined with trained and experienced users to formulate queries.

32 Why is Dublin Core not used to Index and Search the Web?
Technology: The methods used in early Infoseek, Lycos and Altavista have been greatly enhanced. (Note that these methods provide quite good precision at the expense of low recall.) Users: The typical user who searches the Web has limited training and does not understand catalogs. Economics: The size of the Web makes human indexing of every important site impossible. The rate of change requires frequent re-indexing.

33 Dublin Core in Many Languages
See: Thomas Baker, Languages for Dublin Core, D-Lib Magazine December 1998,


Download ppt "CS 430: Information Discovery"

Similar presentations


Ads by Google