Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cornell CS 431 Identifiers and Types CS431 – Architecture of Web Information Systems Carl Lagoze – Cornell University – Feb. 7 2005.

Similar presentations


Presentation on theme: "Cornell CS 431 Identifiers and Types CS431 – Architecture of Web Information Systems Carl Lagoze – Cornell University – Feb. 7 2005."— Presentation transcript:

1 Cornell CS 431 Identifiers and Types CS431 – Architecture of Web Information Systems Carl Lagoze – Cornell University – Feb. 7 2005

2 Cornell CS 431 Identity Change Persistence Paradox: reality contains things that persist and change over time –Heraclitus and Plato: can you step into the same river twice? –Ship of Theseus: over the years, the Athenians replaced each plank in the original ship of Theseus as it decayed, thereby keeping it in good repair. Eventually, there was not a single plank left of the original ship. So, did the Athenians still have one and the same ship that used to belong to Theseus

3 Cornell CS 431 Identity Change Persistence

4 Cornell CS 431 Identifiers Provide a key or handle linking abstract concepts to physical or perceptible entities Provide us with a necessary figment of persistence They are perhaps the one essential and common form of metadata Why bother? –Finding things –Referring to things (Citations) –Asserting ownership over things

5 Cornell CS 431 I have lots of identifiers Carl Jay Lagoze, Dad, Hey you 123-456-7890 (SSN) 1234-5678-1234-1234 (Visa Card) FZBMLH (US Airways locator on January 18 flight to San Diego)

6 Cornell CS 431 Identifier Issues Object granularity Identifier Context –Object atomicity –Part/whole relationships Location independence Global uniqueness Persistent across time Human vs. machine generation Machine resolution Administration (centralized vs. decentralized) Intrinsic semantics Type specificity

7 Cornell CS 431 Two common pre-digital identifiers ISBN (International Standard Book Number) –Uniquely identifies every monograph (book) –One ISBN for each format HP & SS hardback 0590353403 HP & SS softcover 059035342X –Number is semantically meaningful (components) –International administration (>150 countries) ISSN (International Standard Serial Number) –Uniquely identifies every serial (not issue or volume) –Semantically meaningless –International administration

8 Cornell CS 431 URI: Universal Resource Identifier Generic syntax for identifiers of resources Defined by RFC 2396RFC 2396 Syntax: :// ? –Scheme Defines semantics of remainder of URI ftp, gopher, http, mailto, news, telnet –Authority Authority governing namespace for remainder of URI Typically Internet-based server –Path Identification of data within scope of authority –Query String of information to be interpreted by authority

9 Cornell CS 431 Why is RFC 2396 so big? Character encodings Partial and relative URIs

10 Cornell CS 431 URL: Universal Resource Locator String representation of the location for a resource that is available via the Internet Use URI syntax Scheme has function of defining the access (protocol) method. Used by client to determine the protocol to “speak”. – http://an.org/index.html - open socket to an.org on port 80 and issue a GET for index.html –ftp://an.org/index.html - open socket to an.org on port 21, open ftp session, issue ftp get for index.html….

11 Cornell CS 431 URL Issues Persistence Location dependence Valid only at the item level –What about works, expressions, manifestations Multiple resolution –“get the one that is cheapest, most reliable, most recent, most appropriate for my hardware, etc.” Non-digital resources? Sub-parts

12 Cornell CS 431 URC – Uniform Resource Characteristic (Catalog) Failed but interesting effort –Multiple resolution –Describe resource by its characteristics Provide adequate bundled information about a resource (metadata) to create identification block for any given resource (including locations) –Exactly what are the common set of characteristics for describing different types of resources? –Where are these characteristics stored?

13 Cornell CS 431 Robust Hyperlinks Characteristic of document (metadata) is computed automatically via fingerprint of its content. “Lexical” signatures: The top n words of a document chosen for rarity, subject to heuristic filters to aid robustness. –“a TF-IDF-like” measure –Five or so words are sufficient Can be used to locate document (via search engine) after it is moved

14 Cornell CS 431 Robust Hyperlinks – Why does this work? Number of terms on Web is reportedly close to 10,000,000. If terms were distributed independently, the probability of 5 even moderately common terms occurring in more than one document is very small. –In fact, picking 3 terms restricted to those occurring in 100,000 documents works pretty well. –Many documents contain very infrequently used words. –There is lots of room for independence to be off, and to play with term selection for robustness, etc.. http://www.cs.berkeley.edu/~phelps/Robust/

15 Cornell CS 431 URN – Universal Resource Name “globally unique, persistent names” Independence from location and location methods ::= "urn:" ":" NID : namespace identifier NSS : namespace-specific string examples: urn:ISSN:1234-5678 urn:isbn:9044107642 urn:doi:10.1000/140

16 Cornell CS 431 Why isn’t DNS sufficient (parenthetical comment) Issue of semantic vs. non-semantic names Changing ownership Hierarchical legacy of DNS is sometimes inappropriate

17 Cornell CS 431 Handles: Names for Internet Resources Naming system for location-independent, persistent names One name, multiple resolutions http://www.handle.net The resource named by a Handle can be: A library item A collection of library items A catalog record A computer An e-mail address A public key for encryption etc., etc., etc.....

18 Cornell CS 431 / or hdl: / Examples 10.1234/1995.02.12.16.42.21;9 (date-time stamp) cornell.cs/cstr-94.45 (mnemonic name) loc/a43v-8940cgr(random string) Syntax of Handles

19 Cornell CS 431 Example of a Handle and its Data Used to Identify Two Locations URL loc.ndlp.amrlp/123456 http://www.loc.gov/..... Handle Data typeHandle data RAPloc/repository-1r4589

20 Cornell CS 431 Use of Handles in a Digital Library Repository Handle System Search System User interface

21 Cornell CS 431 Replication for Performance and Reliability Example: the Global Handle System Washington, DC Los Angeles, CA

22 Cornell CS 431 Proxies to Resolve Handles A Web browser can resolve Handles via a proxy. For example, the following URL can be used to resolve the Handle loc.ndlp.amrlp/3a16616:

23 Cornell CS 431 Proxy Resolution WWW browser HTTP server URL to Proxy URL Resource Handle System hdl.handle.net Proxy server

24 Cornell CS 431 DOI – Digital Object Identifier Technology and social infrastructure for naming Established by publishers for persistent naming of entities (articles, journals, conference proceedings) Cognizant of FRBR elements Underlying technology is handle system –“persistent” names Persistence is fortified by social underpinnings Rules for establishing registration agencies –Multiple resolution –Registration/mechanism has metadata associated with it doi:10.1000/186

25 Cornell CS 431 OCLC's Persistent URL (PURL) A PURL is a URL -> Is fully compatible with today's Internet browsers -> Users need no special software Has some of the desirable features of URNs Lacks some desirable features of URNs -> Resolves only to a URL -> Does not support multiple resolution Developed by OCLC Software openly available http://www.purl.org

26 Cornell CS 431 PURL Syntax A PURL is a URL. PURL resolvers use standard http redirects to return the actual URL. http://purl.oclc.org/keith/home protocolresolver addressname

27 Cornell CS 431 PURL Namespaces A PURL provides a local (not-global namespace) http://purl.oclc.org/keith/home is different from http://purl.stanford.edu/keith/home

28 Cornell CS 431 OCLC PURL Resolution WWW browser PURL server HTTP server PURL database PURL URL Resource

29 Cornell CS 431 Making links context sensitive Why? –“Appropriate item” differs for each user –Licensing locality –Some users may want a choice (abstract, full text, etc.) Conceptualize link as service rather than object targeted. OpenURL –Transports metadata about the work to… –A localized service that interprets the metadata and provides contextualized choices to the user.

30 Cornell CS 431 link source. user-specific resolution of metadata & identifiers into services reference OpenURL linking OpenURL linking server provision of OpenURL link destination link destination link destination link destination transportation of metadata & identifiers context-sensitive

31 Cornell CS 431 http://www.mysrv.org/menu? id=doi:10.111/12345& genre=article& aulast=Weibel&aufirst=Stu&ISSN=35345353 &year=2001&volume=14&issue=3&spage=44& pid=2829393& sid=OCLC:Inspec OpenURL 0.1 syntax

32 Cornell CS 431 Why haven’t URNs caught on beyond certain communities? Complexity of systems One size does not fit all - special purpose URN schemes have been successful, e.g., PubMed ID, Astrophysics BibCode No guarantee of persistence – longevity is an organizational not technical issue Requires well-regulated administrative systems Absence of “killer” applications – although reference linking is emerging

33 Cornell CS 431 Types: Not all data and content is the same Format or Genre –How you sense it –What you can do with it –E.G. – audio, video, map, book Type –What you need to process it –What is its bit layout Compression or encoding

34 Cornell CS 431 Multipurpose Internet Mail Extensions RFC 822 – define textual format of email messages RFC 2045-2049 – Extend textual email to allow –Character sets other than US-ASCII –Extensible set of non-ASCII types for message bodies –Definition of multi-part mail (attachments)

35 Cornell CS 431 MIME Types Two part type hierarchy –Top level type text audio video image application multipart –Examples text/plain image/gif application/postscript Extensions are handled by IANA

36 Cornell CS 431 MIME in HTTP (Content Negotiation) Accept in request-header –Accept: text/plain; q=0.5, text/html, text/x-dvi; q=0.8, text/xml text/plain and text/xml are preferred, then text/x-dvi, then text/html Content-Type in response-header –Content-Type: text/html

37 Cornell CS 431 MIME is too limited Two-level type depth is simplistic Multi-media documents “Documents” that have many types or views –FEDORA


Download ppt "Cornell CS 431 Identifiers and Types CS431 – Architecture of Web Information Systems Carl Lagoze – Cornell University – Feb. 7 2005."

Similar presentations


Ads by Google