Identifiers and Types CS431 – Architecture of Web Information Systems

Slides:



Advertisements
Similar presentations
THE DONOR PROJECT Titia van der Werf-Davelaar. Project Financed by: Innovation of Scientific Information Provision (IWI) Duration: –phase 1: 1 may 1998.
Advertisements

DDI3 Uniform Resource Names: Locating and Providing the Related DDI3 Objects Part of Session: DDI 3 Tools: Possibilities for Implementers IASSIST Conference,
Persistent identifiers – an Overview Juha Hakala The National Library of Finland
1 CS 502: Computing Methods for Digital Libraries Lecture 2 The Nomadic Computing Experiment Object Models.
Open Linking and the OpenURL Standard Eric F. Van de Velde, Ph.D. Chair, NISO Committee AX Director of Library Information Technology California Institute.
Cornell CS502 Web Basics and Protocols CS 502 – Carl Lagoze Acks to McCracken Syracuse Univ.
Cornell CS 502 Identifiers and Types CS 502 – Carl Lagoze – Cornell University.
IMT530- Organization of Information Resources1 Feedback Like exercises –But want more instructions and feedback on them –Wondering about grading on these.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Identifiers and Reference Links.
SNMP & MIME Rizwan Rehman, CCS, DU. Basic tasks that fall under this category are: What is Network Management? Fault Management Dealing with problems.
EPICUR Kathrin Schroeder ERPANET-Workshop „Persistent Identifiers“ (17th June 2004) Uniform Resource Names (URN) – Overview Die Deutsche Bibliothek.
Why identifiers? To access resources To cite resources To unambiguously identify a resource –To register it as intellectual property –To record changes.
Chinese-European Workshop on Digital Preservation, Beijing July 14 – Network of Expertise in Digital Preservation 1 Persistent Identifiers Reinhard.
Digital Library Architecture and Technology
Locating objects identified by DDI3 Uniform Resource Names Part of Session: Concurrent B2: Reports and Updates on DDI activities 2nd Annual European DDI.
CNRI Handle System and its Applications
Web Architecture Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
Data Communications and Computer Networks Chapter 2 CS 3830 Lecture 9
Computer Networking From LANs to WANs: Hardware, Software, and Security Chapter 12 Electronic Mail.
8/28/97Organization of Information in Collections Introduction to Description: Dublin Core and History University of California, Berkeley School of Information.
Cornell CS 431 Identifiers and Types CS431 – Architecture of Web Information Systems Carl Lagoze – Cornell University – Feb
Chapter 4 Networking and the Internet Introduction to CS 1 st Semester, 2015 Sanghyun Park.
Linking resources Praha, June 2001 Ole Husby, BIBSYS
World Wide Web Hypertext model Use of hypertext in World Wide Web (WWW) WWW client-server model Use of TCP/IP protocols in WWW.
OCLC Online Computer Library Center Erpanet Symposium on Persistent Identifiers PURLs Stuart Weibel Senior Research Scientist June 17, 2004.
DOI Workshop, Luxembourg - 20 May Identifiers in Context Andy Powell UKOLN University of Bath UKOLN.
European Endeavor Users Group Meeting Helsinki, Sept Esa-Pekka Keskitalo, System Analyst Helsinki University Library OpenURL 1.0.
Web Client-Server Server Client Hypertext link TCP port 80.
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
Alternative Architecture for Information in Digital Libraries Onno W. Purbo
Programming for WWW (ICE 1338) Lecture #2 Lecture #2 June 25, 2004 In-Young Ko iko.AT. icu.ac.kr Information and Communications University (ICU) iko.AT.
Module: Software Engineering of Web Applications Chapter 2: Technologies 1.
Interoperability How to Build a Digital Library Ian H. Witten and David Bainbridge.
The Akoma Ntoso Naming Convention Fabio Vitali University of Bologna.
1 CS 502: Computing Methods for Digital Libraries Guest Lecture William Y. Arms Identifiers: URNs, Handles, PURLs, DOIs and more.
1 Unraveling the Web: How Does it All Work?. 2 Web Enabling Technologies F TCP/IP network (Internet & others) F URLs F HTTP protocol and HTTP Servers.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.
Linked Data Publishing on the Semantic Web Dr Nicholas Gibbins
Networked Information Resources Federated search, link server, e-books.
Glencoe Introduction to Multimedia Chapter 2 Multimedia Online 1 Internet A huge network that connects computers all over the world. Show Definition.
Networking Applications
HTTP – An overview.
Domain Name System (DNS)
Introduction to Persistent Identifiers
Sec (4.3) The World Wide Web.
Data Communications and Computer Networks Chapter 2 CS 3830 Lecture 9
Chapter Eight Interoperability How to Build a Digital Library
E-commerce | WWW World Wide Web - Concepts
Layered Architectures
E-commerce | WWW World Wide Web - Concepts
Naming in Distributed Web-based Systems
Vocabulary Prototype: A preliminary sketch of an idea or model for something new. It’s the original drawing from which something real might be built or.
The Architecture of the World Wide Web
Persistent identifiers in VI-SEEM
Packet Switching To improve the efficiency of transferring information over a shared communication line, messages are divided into fixed-sized, numbered.
System And Application Software
Application layer Lecture 7.
OAI and Metadata Harvesting
ELECTRONIC MAIL SECURITY
ELECTRONIC MAIL SECURITY
COMP 150-IDS: Internet Scale Distributed Systems (Spring 2016)
Bina Ramamurthy Chapter 9
Bina Ramamurthy Chapter 9
Bina Ramamurthy Chapter 9
William Stallings Data and Computer Communications
MUMT611: Music Information Acquisition, Preservation, and Retrieval
WebDAV Design Overview
Presentation transcript:

Identifiers and Types CS431 – Architecture of Web Information Systems Carl Lagoze – Cornell University – Feb. 09, 2004 Cornell CS 431

Identity Change Persistence Paradox: reality contains things that persist and change over time Heraclitus and Plato: can you step into the same river twice? Ship of Theseus: over the years, the Athenians replaced each plank in the original ship of Theseus as it decayed, thereby keeping it in good repair. Eventually, there was not a single plank left of the original ship. So, did the Athenians still have one and the same ship that used to belong to Theseus Cornell CS 431

Identity Change Persistence Cornell CS 431

Provide us with a necessary figment of persistence Identifiers Provide a key or handle linking abstract concepts to physical or perceptible entities Provide us with a necessary figment of persistence They are perhaps the one essential and common form of metadata Why bother? Finding things Referring to things Asserting ownership over things Cornell CS 431

I have lots of identifiers Carl Jay Lagoze, Dad, Hey you 123-456-7890 (SSN) 1234-5678-1234-1234 (Visa Card) FZBMLH (US Airways locator on March 21 flight to San Diego) Cornell CS 431

Location independence Global uniqueness Persistent across time Identifier Issues Object granularity Identifier Context Object atomicity Part/whole relationships Location independence Global uniqueness Persistent across time Human vs. machine generation Machine resolution Administration (centralized vs. decentralized) Intrinsic semantics Type specificity Cornell CS 431

Two common pre-digital identifiers ISBN (International Standard Book Number) Uniquely identifies every monograph (book) One ISBN for each format HP & SS hardback 0590353403 HP & SS softcover 059035342X Number is semantically meaningful (components) International administration (>150 countries) ISSN (International Standard Serial Number) Uniquely identifies every serial (not issue or volume) Semantically meaningless International administration Cornell CS 431

URI: Universal Resource Identifier Generic syntax for identifiers of resources Defined by RFC 2396 Syntax: <scheme>://<authority><path>?<query> Scheme Defines semantics of remainder of URI ftp, gopher, http, mailto, news, telnet Authority Authority governing namespace for remainder of URI Typically Internet-based server Path Identification of data within scope of authority Query String of information to be interpreted by authority Cornell CS 431

Partial and relative URIs Why is RFC 2396 so big? Character encodings Partial and relative URIs Cornell CS 431

URL: Universal Resource Locator String representation of the location for a resource that is available via the Internet Use URI syntax Scheme has function of defining the access (protocol) method. Used by client to determine the protocol to “speak”. http://an.org/index.html - open socket to an.org on port 80 and issue a GET for index.html ftp://an.org/index.html - open socket to an.org on port 21, open ftp session, issue ftp get for index.html…. Cornell CS 431

Valid only at the item level Multiple resolution URL Issues Persistence Location dependence Valid only at the item level What about works, expressions, manifestations Multiple resolution “get the one that is cheapest, most reliable, most recent, most appropriate for my hardware, etc.” Non-digital resources? Disconnection from the entity Cornell CS 431

URC – Uniform Resource Characteristic (Catalog) Failed but interesting effort Multiple resolution Describe resource by its characteristics Provide adequate bundled information about a resource (metadata) to create identification block for any given resource (including locations) Exactly what are the common set of characteristics for describing different types of resources? Where are these characteristics stored? Cornell CS 431

Can be used to locate document (via search engine) after it is moved Robust Hyperlinks Characteristic of document (metadata) is computed automatically via fingerprint of its content. “Lexical” signatures: The top n words of a document chosen for rarity, subject to heuristic filters to aid robustness. “a TF-IDF-like” measure Five or so words are sufficient Can be used to locate document (via search engine) after it is moved Cornell CS 431

Robust Hyperlinks – Why does this work? Number of terms on Web is reportedly close to 10,000,000. If terms were distributed independently, the probability of 5 even moderately common terms occurring in more than one document is very small. In fact, picking 3 terms restricted to those occurring in 100,000 documents works pretty well. Many documents contain very infrequently used words. There is lots of room for independence to be off, and to play with term selection for robustness, etc.. http://www.cs.berkeley.edu/~phelps/Robust/ Cornell CS 431

URN – Universal Resource Name “globally unique, persistent names” Independence from location and location methods <URN> ::= "urn:" <NID> ":" <NSS> NID: namespace identifier NSS: namespace-specific string examples: urn:ISSN:1234-5678 urn:isbn:9044107642 urn:doi:10.1000/140 Cornell CS 431

Why isn’t DNS sufficient (parenthetical comment) Issue of semantic vs. non-semantic names Changing ownership Hierarchical legacy of DNS is sometimes inappropriate Cornell CS 431

Handles: Names for Internet Resources Naming system for location-independent, persistent names One name, multiple resolutions http://www.handle.net The resource named by a Handle can be: • A library item • A collection of library items • A catalog record • A computer • An e-mail address • A public key for encryption • etc., etc., etc. .... Cornell CS 431

<naming_authority>/<locally_unique_string> or Syntax of Handles <naming_authority>/<locally_unique_string> or hdl:<naming_authority>/<locally_unique_string> Examples 10.1234/1995.02.12.16.42.21;9 (date-time stamp) cornell.cs/cstr-94.45 (mnemonic name) loc/a43v-8940cgr (random string) Cornell CS 431

Example of a Handle and its Data Used to Identify Two Locations Data type Handle data loc.ndlp.amrlp/123456 URL http://www.loc.gov/..... RAP loc/repository-1r4589 Cornell CS 431

Use of Handles in a Digital Library Repository User interface Search System Handle System Cornell CS 431

Replication for Performance and Reliability Example: the Global Handle System Los Angeles, CA Washington, DC Cornell CS 431

Proxies to Resolve Handles A Web browser can resolve Handles via a proxy. For example, the following URL can be used to resolve the Handle loc.ndlp.amrlp/3a16616: Cornell CS 431

Proxy Resolution URL to Proxy Proxy server WWW browser URL hdl.handle.net Handle System URL HTTP server Resource Cornell CS 431

DOI – Digital Object Identifier Technology and social infrastructure for naming Established by publishers for persistent naming of entities (articles, journals, conference proceedings) Cognizant of FRBR elements Underlying technology is handle system “persistent” names Persistence is fortified by social underpinnings Rules for establishing registration agencies Multiple resolution Registration/mechanism has metadata associated with it doi:10.1000/186 Cornell CS 431

OCLC's Persistent URL (PURL) • A PURL is a URL -> Is fully compatible with today's Internet browsers -> Users need no special software • Has some of the desirable features of URNs • Lacks some desirable features of URNs -> Resolves only to a URL -> Does not support multiple resolution • Developed by OCLC • Software openly available http://www.purl.org Cornell CS 431

• PURL resolvers use standard http redirects to return the actual URL. PURL Syntax • A PURL is a URL. • PURL resolvers use standard http redirects to return the actual URL. http://purl.oclc.org/keith/home protocol resolver address name Cornell CS 431

A PURL provides a local (not-global namespace) PURL Namespaces A PURL provides a local (not-global namespace) http://purl.oclc.org/keith/home is different from http://purl.stanford.edu/keith/home Cornell CS 431

OCLC PURL Resolution PURL PURL PURL server database URL WWW browser HTTP server Resource Cornell CS 431

Making links context sensitive Why? “Appropriate item” differs for each user Licensing locality Some users may want a choice (abstract, full text, etc.) Conceptualize link as service rather than object targeted. OpenURL Transports metadata about the work to… A localized service that interprets the metadata and provides contextualized choices to the user. Cornell CS 431

. OpenURL linking link source destination link context-sensitive transportation of metadata & identifiers user-specific link source . link destination linking server OpenURL reference link destination OpenURL context-sensitive link destination link destination resolution of metadata & identifiers into services provision of OpenURL Cornell CS 431

OpenURL 0.1 syntax http://www.mysrv.org/menu? id=doi:10.111/12345& genre=article& aulast=Weibel&aufirst=Stu&ISSN=35345353 &year=2001&volume=14&issue=3&spage=44& pid=2829393& sid=OCLC:Inspec Cornell CS 431

Why haven’t URNs caught on beyond certain communities? Complexity of systems One size does not fit all - special purpose URN schemes have been successful, e.g., PubMed ID, Astrophysics BibCode No guarantee of persistence – longevity is an organizational not technical issue Requires well-regulated administrative systems Absence of “killer” applications – although reference linking is emerging Cornell CS 431

Types: Not all data and content is the same Format or Genre How you sense it What you can do with it E.G. – audio, video, map, book Type What you need to process it What is its bit layout Compression or encoding Cornell CS 431

Multipurpose Internet Mail Extensions RFC 822 – define textual format of email messages RFC 2045-2049 – Extend textual email to allow Character sets other than US-ASCII Extensible set of non-ASCII types for message bodies Definition of multi-part mail (attachments) Cornell CS 431

Two part type hierarchy MIME Types Two part type hierarchy Top level type text audio video image application multipart Examples text/plain image/gif application/postscript Extensions are handled by IANA Cornell CS 431

MIME in HTTP (Content Negotiation) Accept in request-header Accept: text/plain; q=0.5, text/html, text/x-dvi; q=0.8, text/xml text/plain and text/xml are preferred, then text/x-dvi, then text/html Content-Type in response-header Content-Type: text/html Cornell CS 431

Two-level type depth is simplistic Multi-media documents MIME is too limited Two-level type depth is simplistic Multi-media documents “Documents” that have many types or views FEDORA Cornell CS 431