O N T O P E D I A The Identity of Everything Subject Identity Steve Pepper INF5909, 2009-02-23.

O N T O P E D I A The Identity of Everything www.ontopedia.net Subject Identity Steve Pepper pepper.steve@gmail.com INF5909, 2009-02-23

O N T O P E D I A The Identity of Everything www.ontopedia.net Agenda Merging in Topic Maps The Importance of Identity The Topic Maps Approach to Identity The Identity Crisis of the Web Published Subjects (Subject-centric Computing)

O N T O P E D I A The Identity of Everything www.ontopedia.net Merging in Topic Maps An Example of Knowledge Federation

O N T O P E D I A The Identity of Everything www.ontopedia.net Merging topic maps Topic Maps can be merged automatically – Arbitrary topic maps can be merged into a single topic map – This cannot be done with databases or XML documents Merging enables many advanced applications – Information integration across repositories – Sharing and reusing taxonomies – Automated content aggregation – Distributed knowledge management – Global knowledge federation Merging made possible by subject identity

O N T O P E D I A The Identity of Everything www.ontopedia.net Principles of merging By definition: Every topic represents exactly one subject The goal: Every subject represented by exactly one topic 1. When two topic maps are merged, topics that represent the same subject should be merged to a single topic 2. When two topics are merged, the resulting topic has the union of the characteristics of the two original topics name occurrence association role T name occurrence association role name A second topic (in another topic map) “about” the same subject T Merge the two topics together......and the resulting topic has the union of the original characteristics name occurrence association role name T (Demo of merging in the Omnigator…)

O N T O P E D I A The Identity of Everything www.ontopedia.net The vision of seamless knowledge Starting with ITU in 2001, Norway has seen an explosion in the number of portals that are based on Topic Maps – Today there are dozens, especially in the public section As the number of portals multiplies, the amount of overlap increases… – The potential for integration is … staggering Take these three portals as an example: – forskning.no (Research Council web site aimed at young adults) – forbrukerportalen.no (Norwegian Consumer Association) – matportalen.no (Biosecurity portal of the Department of Agriculture)

O N T O P E D I A The Identity of Everything www.ontopedia.net Genetically modified food at forskning.no

O N T O P E D I A The Identity of Everything www.ontopedia.net Genetically modified food at Forbukerrådet Terefe Badenod

O N T O P E D I A The Identity of Everything www.ontopedia.net Genetically modified foodstuffs at Matportalen

O N T O P E D I A The Identity of Everything www.ontopedia.net Three portals – one subject  one “virtual portal” with seamless navigation in all directions

O N T O P E D I A The Identity of Everything www.ontopedia.net The Importance of Identity

O N T O P E D I A The Identity of Everything www.ontopedia.net Identity and knowledge federation Knowledge federation requires subject-based merging subject

O N T O P E D I A The Identity of Everything www.ontopedia.net The big challenge is Knowing when we’re talking about the same thing the computer domain the real world

O N T O P E D I A The Identity of Everything www.ontopedia.net Humans get by using names But names are ambiguous (homonyms) – Humans disambiguate using (a) context and (b) negotiation Many names have the same referent (synonyms) – Humans can generally handle this – Computers can’t – at least not without our help... Computers need a simpler mechanism – Local identifiers (database keys, XML IDs, controlled vocabularies, code sets, etc.) work OK in closed systems – but not across systems or domains (e.g. the code ”nor” ) – Open and multilingual systems need global identifiers

O N T O P E D I A The Identity of Everything www.ontopedia.net Requirements on global identifiers The mechanism as a whole should be – open and democratic: top-down solutions won’t work – scaleable: the number of potential subjects is open-ended – easy to adopt: based on existing tools and methods The identifiers themselves should be – easy for humans to use: locate, create, interpret, apply given a subject, find an identifier (if one exists) given a subject, create an identifier given an identifier, find out what subject it identifies given an identifier, attach it to the information in question – efficient for computers to use: comparison of identifiers lexical comparison simplest avoid normalization, network access, other computation

O N T O P E D I A The Identity of Everything www.ontopedia.net Some proposed solutions URL based proposals For web documents – HTTP URIs (URLs) – address = identifier For resources in general – Source: SemWeb community – URIs for arbitrary “resources” (esp. classes og properties) Published Subjects – Source: Topic Maps community – Continuation of SemWeb practice Non-URL based proposals URN (RFC 1737) – Uniform Resource Names XRI (OASIS) – Extensible Resource Identifiers Domain specific – ISBN (books) – DOI (“digital objects”) – GUID & UUID – UPC & EAN – RFID – (what else is out there?)

O N T O P E D I A The Identity of Everything www.ontopedia.net The Topic Maps Approach to Identity Direct identification (subject locators) Indirect identification (subject identifiers)

O N T O P E D I A The Identity of Everything www.ontopedia.net Subjects and topics Topics are surrogates, or “proxies” (inside the computer) for the ineffable subjects that you want to talk about, such as Puccini, love, these slides, or the second law of thermodynamics A subject in the real world (referent) T A topic in the computer domain (symbol)

O N T O P E D I A The Identity of Everything www.ontopedia.net Topics and subjects Topics represent subjects – By definition every topic represents exactly one subject – The goal when merging is to ensure that every subject is represented by exactly one topic (the collocation objective) A subject can be anything you want – ISO 13250 definition: A subject is any “thing” whatsoever, whether or not it exists or has any other specific characteristics, about which anything whatsoever may be asserted by any means whatsoever.” Some examples... (Puccini)(Lucca) (Tosca) (Madame Butterfly)

O N T O P E D I A The Identity of Everything www.ontopedia.net The identity of subjects Topics exist in order to allow us to talk about subjects – The relationship between the two is sometimes called intentionality We need to know exactly which subject a topic represents – That is, we need to establish its subject identity – The collocation objective depends on knowing when applications are talking about the same thing LuccaTosca Puccini Madame Butterfly

O N T O P E D I A The Identity of Everything www.ontopedia.net Subject locators Sometimes the subject is an information resource (like these slides) – It exists somewhere within the computer system – It has a location and can be “addressed”, e.g. http://www.ontopedia.net/tutorials/tm-intro.ppt – The address of such an addressable subject can be used to unequivocably establish the subject’s identity – An address used in this way to identify a subject directly is called a subject locator But most subjects are not information resources – Puccini, Tosca, love, subject-centric computing, … – Outside the computer domain and cannot be addressed directly... subject topic subject locator http://www.ontopedia.net/tutorials/tm-intro.ppt (These slides)

O N T O P E D I A The Identity of Everything www.ontopedia.net Life, the Universe and Everything The Computer Domain The Topic Map Domain Subject identifiers The identity of most subjects can only be established indirectly –A–An information resource can provide an indication of the subject’s identity to a human –S–Such a resource is called a subject descriptor* A subject descriptor has an address, even though the subject it indicates does not –C–Computers can use the address of the subject descriptor to establish identity –S–Such addresses are called subject identifiers Subject descriptors and subject identifiers represent the two faces of the human-computer dichotomy * also known as “subject indicator” subject Giacomo Puccini, Italian composer, b. Lucca 22nd Dec 1858, d. Brussels, 29th Nov 1924. Best known for his operas, of which Tosca is the most... subject descriptor Puccini http://psi.ontopedia.net/Puccini subject identifier topic

O N T O P E D I A The Identity of Everything www.ontopedia.net A dual mechanism The subject is identified by a URL The URL is called a subject identifier Giacomo Puccini topic http://psi.ontopedia.net/Puccini subject identifier The URL is the address of a web page The web page describes the subject such that a human can know what subject is referred to This web page is called a subject descriptor Giacomo Puccini Italian composer, b. Lucca 22nd Dec 1858, d. Brussels, 29th Nov 1924. Best known for his operas, of which Tosca is one of the most popular and well-known. subject descriptor http://psi.ontopedia.net/Puccini Humans use the descriptor By inspecting the web page the person responsible for assigning the identifier can be sure that it does not refer to, say, Giacomo’s grandfather Domenico (who was also a composer of operas) Machines use the identifier The link is not resolved. Instead simple lexical comparison is used. If the strings are identical, the subject is deemed to be the same and the topics are merged. subject

O N T O P E D I A The Identity of Everything www.ontopedia.net Summary of the TM approach Allows both direct and indirect identification of subjects Direct identification is for information resources – “addressable subjects” only – subject locators (orig. subject addresses) Indirect identification is for anything – both “addressable” and “non-addressable subjects” – subject identifiers and subject descriptors (orig. subject indicators) There is also a construct called “item identifier” – used under the covers for mapping between syntax and internal representation

O N T O P E D I A The Identity of Everything www.ontopedia.net The Identity Crisis of the Web Also known as the httpRange14 issue

O N T O P E D I A The Identity of Everything www.ontopedia.net “Identity crisis” Article on XML.com September 2002 by Kendall Clark – http://www.xml.com/pub/a/2002/09/11/deviant.html http://www.xml.com/pub/a/2002/09/11/deviant.html Based on a review of the work of the W3C’s Technical Architecture Group (TAG) – Architectural Principles of the World Wide Web – http://www.w3.org/TR/webarch/ Part of a larger discussion in the “Web community” – What do HTTP URIs identify? (Tim Berners-Lee) – Disambiguating RDF Identifiers (Sandro Hawke) – Four Uses of a URL (David Booth) – Web Proper Names (Harry Halpin & Henry S. Thompson)

O N T O P E D I A The Identity of Everything www.ontopedia.net The problem in a nutshell: What do URIs identify? – Sandro Hawke: “To date, RDF has not been clear about whether a URI like http://www.w3.org/Consortium identifies the W3C or a web page about the W3C. Throughout RDF, strings like http://www.w3.org/1999/02/22-rdf-syntax-ns#type are used with no consistent explanation of how they relate to the web.” Why is this important? Because without clarity on this issue – The challenge of the Semantic Web cannot be solved – Web services cannot be implemented in a scaleable manner – Ontologies and taxonomies will not be reusable – The goal of Global Knowledge Federation is unreachable – The problem of Infoglut will never go away

O N T O P E D I A The Identity of Everything www.ontopedia.net Introducing Eric Miller Formerly of OCLC: Dublin Core, RDF Later Technical Lead of the W3C’s Semantic Web Activity “I see both RDF … as well as Topic Maps working toward enabling the Semantic Web”

O N T O P E D I A The Identity of Everything www.ontopedia.net A simple example (1) RDF Primer – http://www.w3.org/TR/2003/WD-rdf-primer-20030123/ Example 1: RDF/XML Describing Eric Miller <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#"> Eric Miller Dr.

O N T O P E D I A The Identity of Everything www.ontopedia.net A simple example (2) Eric Miller Person Dr. mailto:em@w3.org http://www.w3.org/People/EM/contact#me

O N T O P E D I A The Identity of Everything www.ontopedia.net Resolving the URI Clicking on this URL displays the following document Now let’s add some DC metadata to this document <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://www.w3.org/2000/10/swap/pim/contact#"> Eric Miller, em@w3.org Eric Miller | Semantic Web Activity Lead W3C World Wide Web Consortium 614.763.1100 April 2nd 2002 dc:creation-date

O N T O P E D I A The Identity of Everything www.ontopedia.net Encoding the metadata in RDF Ex2: RDF/XML Describing the document about EM <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.0/"> Eric Miller 2002/06/04 Document about Eric Miller April 2nd 2002 http://www.w3.org/People/EM/contact#me dc:creation-date / Eric Miller Person Dr. mailto:em@w3.org http://www.w3.org/People/EM/contact#me dc:creator Person Dr. mailto:em@w3.org http://www.w3.org/People/EM/contact#me April 2nd 2002 http://www.w3.org/People/EM/contact#me dc:creation-date

O N T O P E D I A The Identity of Everything www.ontopedia.net The cause of the problem URIs are being used for two distinct purposes – To identify information resources – To identify the thing that an information resource describes or indicates And we don’t know the difference!

O N T O P E D I A The Identity of Everything www.ontopedia.net Problem recognized in W3C Architectural Principles of the World Wide Web: 2.2. Uses of URIs The two primary uses of URIs are (1) To compare identifiers and (2) Dereference a URI (that is, as identifiers and as addresses) 2.2.5. Consistent use of URIs It is confusing and costly when people use the same URI to refer to different resources (i.e., where there is some inconsistency in usage compared to the authoritative meaning of the resource). Suppose company A uses http://example.com/coolcompany to refer to CoolCompany's home page, while company B uses http://example.com/coolcompany to refer to CoolCompany. Company A then buys company B, but when they try to merge their databases, they cannot due to this inconsistent usage of the URI.

O N T O P E D I A The Identity of Everything www.ontopedia.net Original solution (2003) was… … ineffectual handwaving: 2.2.5. Consistent use of URIs Good practice: Consistent URIs: Indiscriminate use of a URI undermines its value and interferes with people who rely on it. In fairness, individuals in the Web and RDF communities have proposed solutions Larry Masinter: tdb URN namespace (“Thing Described By”) Sandro Hawke: Distinguish between “page mode” and “subject mode” David Booth: Distinguish between “names”, “concepts”, “web locations,” and “documents” Not taken seriously by the W3C (There is also the hash/slash proposal)

O N T O P E D I A The Identity of Everything www.ontopedia.net How the situation came about In the Beginning the Web was a web of information resources – URIs (Uniform Resource Identifiers) originally called UDIs (Uniform Document Identifiers) – Name changed to avoid narrow interpretation of “document” – But “resources” were still information resources Most important kind of URI was the URL – the Uniform Resource Locator – A locator is the address of something (e.g., an information resource) – An address is a fairly robust way of identifying something – So URLs started to be regarded as identifiers All of this worked fine until someone had the bright idea of using URLs to identify things that were not information resources...

O N T O P E D I A The Identity of Everything www.ontopedia.net Redefining “resource” Imperceptibly, “resource” acquired a new meaning… – No longer just an information resource… – Came to mean anything whatsoever… Practice codified in RFC 2396 in August 1998 – “A resource can be anything that has identity. Familiar examples include an electronic document, an image, a service (e.g., "today's weather report for Los Angeles"), and a collection of other resources. Not all resources are network "retrievable"; e.g., human beings, corporations, and bound books in a library can also be considered resources.” (RFC 2396) This was a mistake – Because it obscures a fundamental ontological feature of the Web… – … that information resources have special significance

O N T O P E D I A The Identity of Everything www.ontopedia.net Information resources are special They have locations within the system – A document has an address, a location – Any information resource has an address The address can be used to identify the resource But nothing else has an address – Eric Miller does not have a location within the computer system This fundamental ontological fact is recognized in Topic Maps – Direct identification vs. indirect identification Not recognized in RDF, or the Web Architecture in general

O N T O P E D I A The Identity of Everything www.ontopedia.net URIs as resource identifiers subject locator

O N T O P E D I A The Identity of Everything www.ontopedia.net URIs as arbitrary subject identifiers subject descriptor

O N T O P E D I A The Identity of Everything www.ontopedia.net httpRange14: The TAG’s resolution Agreed on 15 Jun 2005: The TAG provides advice to the community that they may mint "http" URIs for any resource provided that they follow this simple rule for the sake of removing ambiguity: – If an "http" resource responds to a GET request with a 2xx response, then the resource identified by that URI is an information resource; – If an "http" resource responds to a GET request with a 303 (See Other) response, then the resource identified by that URI could be any resource; – If an "http" resource responds to a GET request with a 4xx (error) response, then the nature of the resource is unknown. This resolution (known as the “303 hack”) has not ended the debate...

O N T O P E D I A The Identity of Everything www.ontopedia.net Published Subjects

O N T O P E D I A The Identity of Everything www.ontopedia.net Published Subjects In order for identifiers to be reused, they must made publicly available – A subject identifier that has been made available for use outside one particular application is called a published subject identifier (PSI) – Its descriptor is called a published subject descriptor (PSD) Anyone can publish PSI sets – Adoption of PSI sets will be an evolutionary process based on trust – It will lead to greater and greater interoperability – between topic map applications, between Topic Maps and RDF, and across information and knowledge management in general – Check out http://psi.ontopedia.net (under development)

O N T O P E D I A The Identity of Everything www.ontopedia.net What is “Published Subjects”? An extremely simple mechanism (or convention) for defining and sharing globally unique identifiers for arbitrary subjects – The identifier is an HTTP URI (i.e. a URL) – It’s called a published subject identifier (PSI) It resolves to a web page – The contents of this page convey the identity of the subject in a form that is human-interpretable – This pages is called a published subject descriptor* (PSD)

O N T O P E D I A The Identity of Everything www.ontopedia.net The advantages of PSIs URLs (HTTP URIs) are easier to use than, e.g. URNs – The resolution mechanism is now very widely supported The PSI / PSD duality is simple and useful – Makes it possible for users to understand the publisher’s “intentionality” Open and democratic – Anyone can create a PSI – no top-down supervision Common sets of PSI can emerge through consensus based on Trust in the publisher (stability, longevity) Degree of adoption in particular communities

O N T O P E D I A The Identity of Everything www.ontopedia.net A little terminologi Topic Maps standard (1999) – Public Subjects – Public Subject Descriptor XTM 1.0 (2001) – Published Subject Indicator (PSI) OASIS PubSubj TC (2003) – Published Subject Indicator (PSI) – Published Subject Identifier (PSID) W3C Call for Action (2006) – Public Resource Identifier (PRI) – Public Resource Descriptor (PRD) Current usage – PSI abbreviation for the identifier – Confusion identifier / indicator My proposal – Published Subject Identifier (PSI) – Published Subject Descriptor (PSD) Rationale – PSI most often used for the identifier – Term “indicator” a little too opaque – “Identifier” and “indicator” too similar – One abbreviation for two different terms leads to confusion

O N T O P E D I A The Identity of Everything www.ontopedia.net Proposed definitions Published Subjects – a paradigm for creating globally unique identifiers for arbitrary subjects published subject – a subject for which a published subject identifier has been published published subject identifier (PSI) – a HTTP URI that was created explicitly for the purpose of serving as the identifier for some subject published subject descriptor (PSD) – an information resource to which a published subject identifier resolves and whose purpose is to convey to a human the identity of the subject thus identified, i.e. the intentionality of the publisher of the PSI

O N T O P E D I A The Identity of Everything www.ontopedia.net OASIS PubSubj TC (oppdatert) Requirements – A PSI must be a URI – A PSI must resolve to a PSD – A PSD must explicitly state its PSI Recommendations – A PSD should provide human-readable metadata – A PSD may provide machine-readable metadata – Human-readable and machine-readable metadata should be consistent but need not be equivalent – A PSD should indicate its intended use as a PSD – A PSD should identify its publisher

O N T O P E D I A The Identity of Everything www.ontopedia.net Frequently Asked Questions What happens if two people create PSIs for the same subject? – This will happen, but it’s no catastrophe – Over time, stable sets of PSIs will emerge as de facto standards – In the interim, mapping between PSIs (or between PSI sets) is simple – With structured information, batch updates of identifiers is easy How do I go about finding a PSI? – As of today there are no registries or lookup services – We envisage an open, distributed system based on, or similar to, UDDI What if I disagree with assertions made by the publisher? – Doesn’t matter. You aren’t being asked to agree! – The assertions are only there to give you sufficient indication of the identity of the subject to be able to decide if it’s the same subject as the one you’re interested in.

O N T O P E D I A The Identity of Everything www.ontopedia.net Discussion points Should we only use HTTP URIs? – Only HTTP URIs have a widely supported resolution mechanism What form should the URI take? – Readability, use of fragment identifiers, queries, etc. Are Wikipedia URLs suitable? – If so, what about other sources, e.g. Ethnologue http://www.ethnologue.com/show_language.asp?code=nsl http://www.ethnologue.com/show_language.asp?code=nsl What information should a PSD contain? – Content of descriptor itself, metadata What kinds of discovery mechanism could be used? – Registries, search engines,... What is the role of the PSI server? – In addition to published the PSD, what services might it offer? Norwegian terminology – publisert tema, publisert temaidentifikator, publisert temadeskriptor?

O N T O P E D I A The Identity of Everything Subject Identity Steve Pepper INF5909, 2009-02-23.

Similar presentations

Presentation on theme: "O N T O P E D I A The Identity of Everything Subject Identity Steve Pepper INF5909, 2009-02-23."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

O N T O P E D I A The Identity of Everything Subject Identity Steve Pepper INF5909, 2009-02-23.

Similar presentations

Presentation on theme: "O N T O P E D I A The Identity of Everything Subject Identity Steve Pepper INF5909, 2009-02-23."— Presentation transcript:

Similar presentations

About project

Feedback