1 CS/INFO 430 Information Retrieval Lecture 16 Metadata 3.

Slides:



Advertisements
Similar presentations
Content and Bibliographic Theory CS 431 Architecture of Web Information Systems Carl Lagoze Cornell University Acks to H. Van de Sompel.
Advertisements

SLIDE 1IS 257 – Fall 2007 Codes and Rules for Description: History 2 University of California, Berkeley School of Information IS 245: Organization.
Information Retrieval in Practice
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
William Y. Arms Corporation for National Research Initiatives March 22, 1999 Object models, overlay journals, and virtual collections.
1 CS 502: Computing Methods for Digital Libraries Lecture 6 DTDs.
Publishing on the WWW Search Engines & Metadata. Aims and Objectives To identify and discuss the different types of search engine Understand the basic.
1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.
1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Identifiers and Reference Links.
LSTA Digital Imaging Grants Presentation Projects Workshop September 13, 2002 Wendy Sistrunk Music Catalog Librarian University of Missouri—Kansas City.
Online the Library Michaelmas Term 2011 Trinity College Library Dublin 1 1.
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
UKOLUG - July Metadata for the Web RDF and the Dublin Core Andy Powell UKOLN, University of Bath UKOLN.
Metadata and identifiers for e- journals Copenhagen Juha Hakala Helsinki University Library
Publishing Digital Content to a LOR Publishing Digital Content to a LOR 1.
Data Exchange Tools (DExT) DExT PROJECTAN OPEN EXCHANGE FORMAT FOR DATA enables long-term preservation and re-use of metadata,
Cornell CS Bibliographic Concepts CS 502 – Carl Lagoze – Cornell University Acks to H. Van de Sompel.
1 © Netskills Quality Internet Training, University of Newcastle Metadata Explained © Netskills, Quality Internet Training.
Chapter 16 The World Wide Web. 2 The Web An infrastructure of information combined and the network software used to access it Web page A document that.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
8/28/97Organization of Information in Collections Introduction to Description: Dublin Core and History University of California, Berkeley School of Information.
Interoperable Digitised Content “Discover, search, extract, link, associate, and view digitised content” Les Carr.
1 CS 430: Information Discovery Lecture 17 Library Catalogs 2.
1 CS 430: Information Discovery Lecture 14 Automatic Extraction of Metadata.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation.
Metadata Considerations Implementing Administrative and Descriptive Metadata for your digital images 1.
Organizing Internet Resources OCLC’s Internet Cataloging Project -- funded by the Department of Education -- from October 1, 1994 to March 31, 1996.
1 CS/INFO 430 Information Retrieval Lecture 20 Metadata 2.
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
JENN RILEY METADATA LIBRARIAN IU DIGITAL LIBRARY PROGRAM Introduction to Metadata.
1 CS 430: Information Discovery Lecture 6 Descriptive Metadata 2 Library Catalogs Dublin Core.
1 CS 430: Information Discovery Lecture 7 Descriptive Metadata 3 Dublin Core Automatic Generation of Catalog Records.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
Metadata and Documentation Iain Wallace Performing Arts Data Service.
Introduction to metadata
1 CS/INFO 430 Information Retrieval Lecture 21 Metadata 3.
Introduction to Metadata Jenn Riley Metadata Librarian IU Digital Library Program.
Alternative Architecture for Information in Digital Libraries Onno W. Purbo
METADATA & META TAGS Presented by Jong Hun Kim INF 385E Information Architecture and Design I September 28, 2004.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
1 CS 430: Information Discovery Lecture 5 Descriptive Metadata 1 Libraries Catalogs Dublin Core.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
1 CS 430: Information Discovery Lecture 6 Descriptive Metadata 2 Library Catalogs.
Sharing Digital Scores: Will the Open Archives Initiative Protocol for Metadata Harvesting Provide the Key? Constance Mayer, Harvard University Peter Munstedt,
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
Differences and distinctions: metadata types and their uses Stephen Winch Information Architecture Officer, SLIC.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
A centre of expertise in digital information management UKOLN is supported by: Metadata – what, why and how Ann Chapman.
Digitizing Historical Newspapers South Carolina Digital Newspaper Program's participation with the Library of Congress' Chronicling America: Historic American.
1 CS 430: Information Discovery Lecture 7 Automatic Generation of Catalog Records.
1 CS 430: Information Discovery Lecture 23 Non-Textual Materials.
Attributes and Values Describing Entities. Metadata At the most basic level, metadata is just another term for description, or information about an entity.
1 Midterm Examination. 2 General Observations Examination was too long! Most people submitted by .
CS 430: Information Discovery
Lecture 12 Why metadata? CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
CS 430: Information Discovery
Using computers to search electronic databases
CS 430: Information Discovery
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
Introduction to Metadata
Attributes and Values Describing Entities.
Metadata - Catalogues and Digitised works
Metadata to fit your needs... How much is too much?
Attributes and Values Describing Entities.
Presentation transcript:

1 CS/INFO 430 Information Retrieval Lecture 16 Metadata 3

2 Course Administration Assignment 2 and Midterm Examinations Grades were sent out by yesterday. Assignment 3 Will be posted tomorrow.

3 Theoretical Problems in Metadata: What to Catalog The IFLA Model Work A work is the underlying abstraction, e.g., The Iliad The Computer Science departmental web site Beethoven's Fifth Symphony Unix operating system The 1996 U.S. census This is roughly equivalent to the concept of "literary work" used in copyright law.

4 IFLA Model Expression. A work is realized through an expression, e.g., The Illiad has oral expressions and written expressions A musical work has score and performance(s). Software has source code and machine code Many works have only a single expression, e.g. a Web page, or a book.

5 IFLA Model Manifestation. A expression is given form in one or more manifestations, e.g., The text of The Iliad has been manifest in numerous manuscripts and printed books. A musical performance can be distributed on CD, or broadcast on television. Software is manifest as files, which may be stored or transmitted in any digital medium.

6 IFLA Model Item. When many copies are made of a manifestation, each is a separate item, e.g., a specific copy of a book computer file [Works, expressions, manifestations and items are explored in CS 431, Architecture of Web Information Systems.]

7 Theoretical Problems in Metadata: : Events Version 1 New material Version 2 Should Version 2 have its own record or should extra information be added to the Version 2 record? How are these represented in Dublin Core or MARC?

8 Theoretical Problems in Metadata: : Complex Objects Complex objects Article within a journal Page within a Web site A thumbnail of another image The March 28 final edition of a newspaper Complete object Sub-objects Metadata records

9 Theoretical Problems in Metadata: Packaging Rules When an object consists of various parts, how should their interaction be described? Example: An object on the Web may consist of several html pages with images, applets, etc. Metadata Object Description Schema (MODS) MPEG 21

10 MPEG 21

11 Theoretical Problems in Metadata: Flat v. linked records Flat record All information about an item is held in a single record (e.g., a Dublin Core record), including information about related items convenient for access and preservation information is repeated -- maintenance problem Linked record Related information is held in separate records with a link from the item record less convenient for access and preservation information is stored once Compare with normal forms in relational databases

12

13 Representations of Dublin Core: XML (with qualifiers) Digital Libraries and the Problem of Purpose David M. Levy Corporation for National Research Initiatives January 2000 article /january2000-levy English Copyright (c) David M. Levy to be continued

14 Dublin Core with flat record extension Continuation of D-Lib Magazine record D-Lib Magazine

15 Theoretical Problems in Metadata: Many Languages See: Thomas Baker, Languages for Dublin Core, D-Lib Magazine December 1998,

16 Automatic extraction of catalog data Strategies Manual by trained cataloguers - high quality records, but expensive and time consuming Entirely automatic - fast, almost zero cost, but poor quality Automatic followed by human editing - cost and quality depend on the amount of editing Manual collection level record, automatic item level record - moderate quality, moderate cost

17 DC-dot DC-dot is a Dublin Core metadata editor for Web pages, created by Andy Powell at UKOLN DC-dot has two parts: (a) A skeleton Dublin Core record is created automatically from clues in the web page (b) A user interface is provided for cataloguers to edit the record

18

19 Automatic record for CS 430 home page DC-dot applied to continued on next slide

20 Automatic record for CS 430 home page (continued) DC-dot applied to

21 Observations on DC-dot applied to CS430 home page DC.Title is a copy of the html field DC.Publisher is the owner of the IP address where the page was stored DC.Subject is a list of headings and noun phrases presented for editing DC.Date is taken from the Last-Modified field in the http header DC.Type and DC.Format are taken from the MIME type of the http response DC.Identifier was supplied by the user as input

22

23 Observations on DC-dot applied to George W. Bush home page The home page has several meta tags: [The page has no html ] <META NAME="KEYWORDS" CONTENT="George W. Bush, Bush, George Bush, President, republican, 2000 election and more

24 DC-dot applied to continued on next slide Automatic record for George W. Bush home page

25 DC-dot applied to Automatic record for George W. Bush home page (continued)

26 Collection-level metadata Several of the most difficult fields to extract automatically are the same across all pages in a web site. Therefore create a collection record manually and combine it with automatic extraction of other fields at item level. See: Jenkins and Inman

27 Collection-level metadata Compare: (a) Metadata extracted automatically by DC-dot (b) Collection-level record (c) Combined item-level record (DC-dot plus collection-level) (d) Manual record

28

29 Metadata extracted automatically by DC-dot D.C. Field Qualifier Content title Digital Libraries and the Problem of Purpose subject not included in this slide publisher Corporation for National Research Initiatives date W3CDTF type DCMIType Text format text/html format bytes identifier

30 Collection-level record D.C. Field Qualifier Content publisher Corporation for National Research Initiatives type article type resource work relation rel-type InSerial relation serial-name D-Lib Magazine relation issn language English rights Permission is hereby given for the material in D-Lib Magazine to be used for...

31 Combined item-level record (DC-dot plus collection-level) D.C. Field Qualifier Content title Digital Libraries and the Problem of Purpose publisher (*) Corporation for National Research Initiatives date W3CDTF type (*) article type resource (*) work type DCMIType Text format text/html format bytes (*) indicates collection-level metadata continued on next slide

32 Combined item-level record (DC-dot plus collection-level) D.C. Field Qualifier Content relation rel-type (*) InSerial relation serial-name (*) D-Lib Magazine relation issn (*) language (*) English rights (*) Permission is hereby given for the material in D-Lib Magazine to be used for... identifier (*) indicates collection-level metadata

33 Manually created record D.C. Field Qualifier Content title Digital Libraries and the Problem of Purpose creator (+) David M. Levy publisher Corporation for National Research Initiatives date publication January 2000 type article type resource work (+) entry that is not in the automatically generated records continued on next slide

34 Manually created record D.C. Field Qualifier Content relation rel-type InSerial relation serial-name D-Lib Magazine relation issn relation volume (+) 6 relation issue (+) 1 identifier DOI (+) /january2000-levy identifier URL language English rights (+) Copyright (c) David M. Levy (+) entry that is not in the automatically generated records

35 Search Engine Spam D-Lib Magazine Web pages created for user, with good quality control and no attempt to impress search engines. (The editor originally trained as a librarian.) The site lends itself to automatic indexing. Political Web Sites (Bush and Gore) Web pages created for marketing, with little consistency, designed to impress search engines. (The editors are specialists in public relations.) The sites are difficult to index automatically.

36 Metatest Metatest is a research project led by Liz Liddy at Syracuse with participation from the Human Computer Interaction group at Cornell. The aim is to compare the effectiveness as perceived by the user of indexing based on: (a) Manually created Dublin Core (b) Automatically created Dublin Core (higher quality than DC-dot) (c) Full text indexing Preliminary results suggest remarkably little difference in effectiveness.

37 Why is Dublin Core not used to Index and Search the Web? Technology: The methods used in early Infoseek, Lycos and Altavista have been greatly enhanced. (Note that these methods provide quite good precision at the expense of low recall.) Users: The typical user who searches the Web has limited training and does not understand catalogs. Economics: The size of the Web makes human indexing of every important site impossible. The rate of change requires frequent re-indexing.

38 Why is Dublin Core not used to Index and Search the Web? For Web pages, information retrieval by automatic indexing works of full text works at least as well as metadata based methods, and is much, much cheaper. However, we will see later an effective example of automated extraction of metadata from video sequences (Informedia).