Presentation is loading. Please wait.

Presentation is loading. Please wait.

Metadata for the Web From Discovery to Description

Similar presentations


Presentation on theme: "Metadata for the Web From Discovery to Description"— Presentation transcript:

1 Metadata for the Web From Discovery to Description
CS 502 – Carl Lagoze – Cornell University Cornell CS 502

2 The fifteen Dublin Core Elements
Cornell CS 502

3 A Pidgin for Digital Tourists
Metadata is language Dublin Core is a small and simple language -- a pidgin -- for finding resources across domains. Speakers of different languages naturally "pidginize" to communicate E.g., tourists using simple phrases to order beer ("zwei Bier bitte" "dva pivo" "biru o san bai"...) We are all "tourists" on the global Internet. Cornell CS 502

4 What is the Dublin Core (1)
A simple set of properties to support resource discovery on the web (fuzzy search buckets)? Domain Independent view Cornell CS 502

5 An extensible ontology for resource desciption?
What is Dublin Core (2)? An extensible ontology for resource desciption? Greater Functionality & Cost Cornell CS 502

6 What is the Dublin Core (3)?
A cross-domain switchboard for interoperable metadata? Dublin Core MARC INDECS IMS - projections to application-specific metadata vocabularies Switchboard Cornell CS 502

7 Dublin Core Qualifiers
From fuzzy buckets to more specific description Model of “graceful degradation” Support both simplicity and specificity Intra-domain and inter-domain semantics Cornell CS 502

8 Varieties of qualifiers: Element Refinements
Make the meaning of an element narrower or more specific. Narrowing implies an is a relationship a "date created“ is a "date“ an "is part of relation“ is a "relation“ If your software does not understand the qualifier, you can safely ignore it. Cornell CS 502

9 Varieties of Qualifiers: Value Encoding Schemes
Says that the value is a term from a controlled vocabulary (e.g., Library of Congress Subject Headings) a string formatted in a standard way (e.g., " " means May 3, not February 5) Even if a scheme is not known by software, the value should be "appropriate" and usable for resource discovery. Cornell CS 502

10 A Grammar of Dublin Core
By design not as subtle as mother tongues, but easy to learn and extremely useful in practice Pidgins: small vocabularies (Dublin Core: fifteen special nouns and lots of optional adjectives) Simple grammars: sentences (statements) follow a simple fixed pattern... Cornell CS 502

11 Example Dublin Core statements
Resource has Title 'Grammar of Dublin Core'. Resource has Creator 'Tom Baker'. Resource has Subject 'Metadata'. Resource has Relation Cornell CS 502

12 Resource has property X implied verb one of 15 properties
property value (an appropriate literal) DC:Creator DC:Title DC:Subject DC:Date... implied subject Resource has property X Cornell CS 502

13 Resource has property X implied verb one of 15 properties
property value (an appropriate literal) DC:Creator DC:Title DC:Subject DC:Date... implied subject Resource has property X qualifiers (adjectives) [optional qualifier] [optional qualifier] Cornell CS 502

14 Resource has Subject "Languages -- Grammar" Resource has Date
LCSH Resource has Date " " ISO8601 Revised Cornell CS 502

15 Dumb-Down Principle for Qualifiers
The fifteen elements should be usable and understandable with or without the qualifiers Qualifiers refine meaning (but may be harder to understand) Nouns can stand on their own without adjectives If your software encounters an unfamiliar qualifier, look it up -- or just ignore it! "has a“ relations break the model E.g., a creator has a hair color Cornell CS 502

16 Resource has Subject "Languages -- Grammar" Resource has Date
Test for “good““ qualifiers: cover and ask: -- Does the statement still make sense? -- Is it still correct? Resource has Subject "Languages -- Grammar" LCSH Resource has Date " " ISO8601 Revised Cornell CS 502

17 “Incorrect” Qualification
Resource has creator “Cornell University” affiliation Resource has subject “pre-schoolers” audience Cornell CS 502

18 Open questions in this model
Are uncontrolled and unconstrained values really useful for discovery? Is it possible for an organization (DCMI) to control the evolution of a language? How can "simple discovery metadata" be combined with complex descriptions? Is there a notion of graceful degradation? Can DC serve as a lingua franca (mapping template) among more complex models Cornell CS 502

19 Models for Deploying Metadata
Embedded in the resource low deployment threshold Limited flexibility, limited model Linked to from resource Using xlink Is there only one source of metadata? Independent resource referencing resource Model of accessing the object through its surrogate Resource doesn’t ‘have’ metadata, metadata is just one resource annotating another Cornell CS 502

20 Syntax Alternatives: HTML
Advantages: Simple Mechanism – META tags embedded in content Widely deployed tools and knowledge Disadvantages Limited structural richness (won’t support hierarchical,tree-structured data or entity distinctions). Cornell CS 502

21 Dublin Core in HTML HTML constructs <link> to establish pseudo-namespace <meta> for metadata statements name attribute for DC element (DC.element.ER) content attribute for element value scheme attribute for encoding scheme or controlled vocabulary lang attribute for language of element value Cornell CS 502

22 Dublin Core in HTML example
<link rel="schema.DC" href=" <meta name="DC.Title" content="Business Unusual”> <meta name=“DC.Title” lang=“es” content=“negocio inusual”> <meta name="DC.Creator" content="Carl Lagoze"> <meta name="DC.Subject" content="bibliographic control web cataloging "> <meta name="DC.Date.Created" scheme="W3CDTF" content=" "> <meta name="DC.Format" content="text/html"> <meta name="DC.Identifier" content=" Cornell CS 502

23 Unqualified Dublin Core in XML
Cornell CS 502

24 Multi-entity nature of object description
Photographer Camera type Software Computer artist Cornell CS 502

25 Attribute/Value approaches to metadata…
The playwright of Hamlet was Shakespeare subject implied verb metadata noun literal Hamlet has a creator Shakespeare Playwright metadata adjective R1 “Shakespeare” “Hamlet” dc:creator.playwright dc:title Cornell CS 502

26 …run into problems for richer descriptions…
The playwright of Hamlet was Shakespeare, who was born in Stratford Hamlet has a creator Stratford birthplace “Stratford” R1 “Shakespeare” dc:creator.playwright dc:creator.birthplace Cornell CS 502

27 …because of their failure to model entity distinctions
“Shakespeare” name R1 R2 creator birthplace title “Stratford” “Hamlet” Cornell CS 502

28 Applying a Model-Centric Approach
Formally define common entities and relationships underlying multiple metadata vocabularies Describe them (and their inter-relationships) in a simple logical model Provide the framework for extending these common semantics to domain and application-specific metadata vocabularies. Cornell CS 502

29 Events are key to understanding resource complexity?
Events are implicit in most metadata formats (e.g., ‘date published’, ‘translator’) Modeling implied events as first-class objects provides attachment points for common entities – e.g., agents, contexts (times & places), roles. Clarifying attachment points facilitates understanding and querying “who was responsible for what when”. Cornell CS 502

30 ABC/Harmony Event-aware metadata ontology
Recognizing inherent lifecycle aspects of description (esp. of digital content) Modeling incorporates time (events and situations) as first-class objects Supplies clear attachment points for agents, roles, existential properties Resource description as a “story-telling” activity Cornell CS 502

31 Resource-centric Metadata
Title Anna Karenina Author Leo Tolstoy Illustrator Orest Vereisky Translator Margaret Wettlin Date Created 1877 Date Translated 1978 Description Adultery & Depression ? Birthplace Moscow Birthdate 1828 Cornell CS 502

32 “Orest Vereisky” “Leo Tolstoy” "Moscow" “1828” “Margaret Wettlin” “illustrator” “author” “translator” “creation” “1877” “1978” “translation” “Russian” “English” “Anna Karenina” “Tragic adultery and the search for meaningful love” Cornell CS 502

33 Breaking the metadata bottleneck Human vs. machine generation
Simple text scraping HTML tags as hint Other structural methods Natural language methods and machine learning Contextual methods Google (text and image search) Cornell CS 502

34 Putting metadata in its place
Cornell CS 502

35 Query engine architecture space
Cornell CS 502


Download ppt "Metadata for the Web From Discovery to Description"

Similar presentations


Ads by Google