Old Dominion University Department of Computer Science

Slides:



Advertisements
Similar presentations
DDI3 Uniform Resource Names: Locating and Providing the Related DDI3 Objects Part of Session: DDI 3 Tools: Possibilities for Implementers IASSIST Conference,
Advertisements

Distributed components
URI IS 373—Web Standards Todd Will. CIS Web Standards-URI 2 of 17 What’s in a name? What is a URI/URL/URN? Why are they important? What strategies.
ISP 433/533 Week 8 IR in libraries. Goal Universal Access to Information Vannevar Bush 1945 article Memex A memex is a device in which an individual stores.
Layer 7- Application Layer
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Identifiers and Reference Links.
Basic Concepts Architecture Topology Protocols Basic Concepts Open e-Print Archive Open Archive -- generalization of e-print Data Provider and Service.
CORDRA Philip V.W. Dodds March The “Problem Space” The SCORM framework specifies how to develop and deploy content objects that can be shared and.
Application Layer. Applications A program or group of programs designed for end users. A program or group of programs designed for end users. Software.
Chinese-European Workshop on Digital Preservation, Beijing July 14 – Network of Expertise in Digital Preservation 1 Persistent Identifiers Reinhard.
Digital Library Architecture and Technology
Locating objects identified by DDI3 Uniform Resource Names Part of Session: Concurrent B2: Reports and Updates on DDI activities 2nd Annual European DDI.
Copyright © cs-tutorial.com. Introduction to Web Development In 1990 and 1991,Tim Berners-Lee created the World Wide Web at the European Laboratory for.
Chapter 1: Introduction to Web
How Web Servers and the Internet Work by by: Marshall Brainby: Marshall Brain
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
Kingdom of Saudi Arabia Ministry of Higher Education Al-Imam Muhammad Ibn Saud Islamic University College of Computer and Information Sciences Chapter.
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
Alternative Architecture for Information in Digital Libraries Onno W. Purbo
World Wide Web “WWW”, "Web" or "W3". World Wide Web “WWW”, "Web" or "W3"
Corporation For National Research Initiatives Technical Issues in Electronic Publishing Corporation for National Research Initiatives William Y. Arms.
Oct 12-14, 2003NSDL Challenges in Building Federation Services over Harvested Metadata Kurt Maly, Michael Nelson, Mohammad Zubair Digital Library.
Web Server.
CS 6401 The World Wide Web Outline Background Structure Protocols.
Introduction to Active Directory
1 CS 502: Computing Methods for Digital Libraries Guest Lecture William Y. Arms Identifiers: URNs, Handles, PURLs, DOIs and more.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Identifiers and Repositories hussein suleman uct cs honours 2006.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
ODU CS CS 695 Fall 2002 Michael L. Nelson Introduction to Digital Libraries Week 5: Early DLs and the Kahn/Wilensky Framework Old Dominion.
World Wide Web. The World Wide Web is a system of interlinked hypertext documents accessed via the Internet The World Wide Web is a system of interlinked.
What is a Protocol A set of definitions and rules defining the method by which data is transferred between two or more entities or systems. The key elements.
Instructor Materials Chapter 5 Providing Network Services
Repositories, Identifiers and the Kahn/Wilensky Framework
Web Development Web Servers.
Introduction to Persistent Identifiers
REST- Representational State Transfer Enn Õunapuu
An Overview of Data-PASS Shared Catalog
Introduction To Web Design
E-commerce | WWW World Wide Web - Concepts
E-commerce | WWW World Wide Web - Concepts
Naming in Distributed Web-based Systems
Distribution and components
Distributed web based systems
CHAPTER 3 Architectures for Distributed Systems
Lecture 6: TCP/IP Networking By: Adal Alashban
CS 501: Software Engineering Fall 1999
CS222 Web Programming Course Outline
Protocols and the TCP/IP Suite
Application layer Lecture 7.
OAI and Metadata Harvesting
Old Dominion University Department of Computer Science
Web Design & Development
Patterns.
World Wide Web “WWW”, "Web" or "W3". World Wide Web “WWW”, "Web" or "W3"
Lecture 6: TCP/IP Networking 1nd semester By: Adal ALashban.
Unit 1.4 Wired and Wireless Networks Lesson 3
TCP/IP Protocol Suite: Review
Lecture 2: Overview of TCP/IP protocol
Unit# 5: Internet and Worldwide Web
World Wide Web “WWW”, "Web" or "W3". World Wide Web “WWW”, "Web" or "W3"
Web Server Design Week 16 Old Dominion University
1 TRANSMISSION CONTROL PROTOCOL / INTERNET PROTOCOL (TCP/IP) K. PALANIVEL Systems Analyst, Computer Centre Pondicherry University, Puducherry –
Open Archival Information System
Protocols and the TCP/IP Suite
Unit-3.
The Internet and Electronic mail
Exceptions and networking
Presentation transcript:

Introduction to Digital Libraries Week 6: Early DLs and the Kahn/Wilensky Framework Old Dominion University Department of Computer Science CS 695 Fall 2003 Michael L. Nelson <mln@cs.odu.edu> 10/02/03

DL Architectural Review The purpose of this week’s lecture is to provide background and concepts for preparation of reviewing the architecture of various DLs Assumptions TCP/IP connectivity no dial-up services, CD-ROMs, etc. distribute “actual stuff” (report, software, etc.) no abstract servers, etc.

DL Architecture History Two main approaches: build special client and server (generally using Motif/X11, Tcl/Tk, etc.), and use TCP/IP as the transport protocol only pros: rich functionality cons: high development cost, client distribution problem observation: many of these projects spent more time building the interfaces, protocols, searching, etc. than populating their DL!

DL Architecture History Two main approaches (cont’d): use standard, higher level, orthogonal TCP/IP protocols: SMTP, FTP, Gopher, WAIS, http, etc. con: less functionality pros: less development cost, uses commonly available clients observation: this approach is now the most common

Early TCP/IP DLs Netlib http://www.netlib.org/ begun in 1985, distributing mathematical software via e-mail (SMTP) other access methods and protocols added (ftp, X11 client, http)

Netlib Accesses from: http://www.dlib.org/dlib/september95/netlib/09browne.html

Netlib Accesses from: http://www.netlib.org/utk/misc/counts.html

Netlib Accesses from: http://www.netlib.org/utk/misc/counts.html

Early TCP/IP DLs Physics pre-print server originally: http://xxx.lanl.gov/ now: http://www.arXiv.org/ now run from Cornell U, with support from LANL begun in 1991 as an e-mail service to exchange TeX source of pre-prints TeX (and LaTeX, etc.) is a text formatting environment popular in math, physics, CS, etc. ftp, http access added shortly

arXiv usage, 1/94 – 6/97 Red - Number of connections in each week Blue - Number of hosts connecting that week (divide by 10 for correct number) Green - Number of new hosts that week (divide by 10) from: http://arXiv.org/cgi-bin/show_weekly_graph

arXiv usage, 7/97 – 2/01 Red - Number of connections in each week Blue - Number of hosts connecting that week (divide by 10 for correct number) Green - Number of new hosts that week (divide by 10) from: http://arXiv.org/cgi-bin/show_weekly_graph

arXiv usage, 7/97 – 9/03 Red - Number of connections in each week Blue - Number of hosts connecting that week (divide by 10 for correct number) Green - Number of new hosts that week (divide by 10) from: http://arXiv.org/show_weekly_graph

Early TCP/IP DLs Anonymous FTP used by numerous computer science departments and related laboratories for the distribution of both tech reports and software ftp://techreports.larc.nasa.gov/ begun in late 1992 http access added in 1994 see TMs 4567 and 109162 from http://www.cs.odu.edu/~mln/pubs/all.html

Characteristics of Early TCP/IP DLs Useful could get the “thing” that you were looking for Constrained by transport protocol SMTP, FTP, etc. interface inherently “clunky” searching, formatting, sophisticated browsing, etc. difficult to implement Small scale would the same systems work well if the holdings went from 100’s or 1000’s to millions?

Early HTTP DLs Initial http implementations / conversions pretty much provided incremental steps in DL improvement a “nice” ftp interface, maybe with better searching and browsing but the nature of the DLs changed little LTRS is an example of a http DL that is really: FTP+Searching(WAIS)+Browsing http://techreports.larc.nasa.gov/ltrs/

Early HTTP DLs But http is a very general transport protocol, and it is possible to build even higher level protocols on top of it Combine this with the more expressive WWW client, and there is a lot of potential Dienst (http://www.cs.cornell.edu/cdlrg/dienst/protocols/deprecated/Dienst%20protocol%20(4_1).htm), Cornell CS TR95-1514 builds an actual DL protocol on top of http 1994 -- the first to do so?

DL Sophistication Over Time Traditional IR, Databases, CD-ROMs, etc. Kahn / Wilensky implementations ? We Are Here Sophistication http Dienst http LTRS, e-print, Netlib, etc. ftp / gopher e-mail Time

A Framework for Distributed Digital Object Services More commonly known as the Kahn/Wilensky Framework (KWF) A high level document, not even detailed enough to be an architecture, that defines some of the key concepts and terms that form the basis for the next generation of DLs DLs beyond “make the ftp server look nice”

A Friendlier Intro to KWF Key points from Bill Arm’s paper (http://www.dlib.org/dlib/July95/07arms.html) The underlying architecture should be separate from the content stored in the library Names and identifiers are the basic building block for the digital library Digital library objects are more than collections of bits The digital library object that is used is different from the stored object Repositories must look after the information they hold Users want intellectual works, not digital objects Prelude to OAIS, digital preservation, etc. Well… maybe, or maybe not!

Key KWF Terms digital objects (DOs) repository handles a unit of exchange for the DL with a particular data structure and characteristics repository the place where DOs live handles a unique, persistent name for a DO

KWF Originator makes a Data which consists of Digital Object which comes from a handle generator Handle which can go in a Repository which is accessed by which registers the DO’s handle with a Handle Server Repository Access Protocol (RAP) at which point the DO becomes a registered DO

Digital Objects Digital object = data + key-metadata data is typed; core types include: bit-sequence / set-of-bit-sequences digital-object / set-of-digital-objects handle / set-of-handles other types can be defined, and registered with a global type registry definition and registration left undefined similar to MIME? key-metadata includes handle, possibly other metadata (left undefined in KWF)

Digital Objects Typed data; example from KWF: Composite DOs: a DO subtype: computer-science-tech-report with metadata: author, institution, series, etc. Composite DOs: a DO with data of type digital-object non-composite DOs are elemental DOs composite DOs can be used to collect similar works together composite DO than contains a DO for each work of Shakespeare...

A Digital Object figure 2 from http://www.dlib.org/dlib/July95/07arms.html

Changing Digital Objects Mutable DOs can be changed once placed in a repository key-metadata cannot be changed -- the DO’s handle does not change! Immutable DOs cannot be changed once placed in a repository however, it can be deleted

Uniform Resource Identifiers URI RFC 2396 RFC 1738 URL RFC 2141 URN

URIs & URNs registered URI schemes registered URN namespaces http://www.iana.org/assignments/uri-schemes registered URN namespaces http://www.iana.org/assignments/urn-namespaces

From RFC 2396 “A URI can be further classified as a locator, a name, or both. The term "Uniform Resource Locator" (URL) refers to the subset of URI that identify resources via a representation of their primary access mechanism (e.g., their network "location"), rather than identifying the resource by name or by some other attribute(s) of that resource. The term "Uniform Resource Name" (URN) refers to the subset of URI that are required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable.”

URLs URLs are tightly coupled with the physical location of an object, and are thus more likely to be transient “Error 404 - File not found” Tricks to make URLs more durable: plan ahead when constructing web site structure use good DNS CNAMEs symbolic links on filesystems http server redirects

URNs But with all the tricks available, URLs are not suitable for archival use in DLs how long will this URL (a report in LTRS): http://techreports.larc.nasa.gov/ltrs/PDF/1997/tm/NASA-97-tm112871.pdf be good? how to handle mirroring, replication, etc.? “appropriate copy” problem… mnemonic: URL = IP address (128.82.5.5173) URN = IP name (blearg.cs.odu.edu)

Handles Handles can be thought of as a Uniform Resource Name (URN) implementation http://www.dlib.org/dlib/february96/02arms.html for historical comparison of efforts http://www.handle.net/ contains info about the handle system persistence location independence multiple instances Handles are of the general form: GlobalAuthority.LocalAuthority/LocallyUniqueString or, for example: NASA.LaRC/tm112871

NASA.LaRC/tm112871 “NASA” would be assigned from the global naming authority “LaRC” would be created by who registered “NASA”, and the entire string “NASA.LaRC” would be registered “tm112871” is a locally unique string generated by “LaRC” ODU.CS/tm112871 is possible...

Handle Syntax In URL-type syntax: Using a proxy server: <a href=“hdl:NASA.LaRC/tm112871”> “hdl” is a scheme; handle is resolved into a URL by locally defined handle server see http://ftp.ics.uci.edu/pub/ietf/uri/ for a good list of schemes and naming projects Using a proxy server: <a href=“http://hdl.handle.net/NASA.LaRC/tm112871”> hdl.handle.net performs resolution from: http://www.handle.net/draft-ietf-handle-system-01.html

Handles Observation: isn’t the handle system just the Domain Name System (DNS) all over again? The need for URNs for just general WWW use is obvious; the need for them in DLs even more so...

Semantics in Names Two schools of thought: semantic clues in names, such as: NASA.LaRC/tm112871 www.larc.nasa.gov are: good: easy to parse, remember, map to real-world concepts, etc. bad: names are not for human consumption, are hurtful or restrictive in the long run, etc.

Purls Persistent URLs (Purls) examples: http://purl.net/, OCLC Maps stable URLs (registered in purl.net space) to transient URLs (i.e. cs.odu.edu/~user/ space) examples: http://www.purl.org/DC http://www.purl.org/NET/oai_explorer

DOIs Digital Object Identifier System (DOIs) http://www.doi.org/ no semantics in the names (well, that’s not always true…) driven by the publishing industry examples: doi:10.1045/september2002-rasmussen 10.1145/544220.544284 resolver: http://dx.doi.org/

Repositories “A network accessible storage system in which digital objects may be stored for possible subsequent access or retrieval” (KWF) A stored DO is a DO that resides in a repository A registered DO is a DO that the repository has registered with a handle server storing and registering can be the same or different processes

Repositories A repository keeps a properties record for each DO contains key-metadata and any other metadata the repository chooses to keep A repository of record (ROR) is the first repository that a DO is placed in ROR authorizes additional instances of the DO A dissemination is the result of an access service request

Repositories figure 3 from http://www.dlib.org/dlib/July95/07arms.html

Repository Access Protocol (RAP) “Protocol” may be misleading, its really just the skeleton for a protocol RAP is designed to be simple repositories themselves should be simple KWF defines 3 basic operation classes: ACCESS_DO DEPOSIT_DO ACCESS_REF this is the catch-all operation for all meta-services...

RAP RAP is fleshed out more in Cornell CS 95-TR1540 Where KWF suggested that the operations would take “metadata”, “key-metadata”, and “digital object” as arguments, TR1540 splits some of those into separate operations RAP could be implemented as a subset of a more sophisticated protocol (Dienst, Z39.50, etc.) prelude to the Open Archives Initiative (OAI) metadata harvesting protocol

RAP

Terms and Conditions First lengthy discussion with respect to KWF in Cornell CS 95 TR-1593 Terms and Conditions (TC) can be arbitrarily complex, but generally consist of: permissions: read, write, etc. authentication - person, group, etc. payment 3rd party intervention (possibly in support of the above)

TC TC are attached to: each DO dissemination repository TC are a precondition for any operation on the above Repositories responsible for enforcing TC

Booch Diagram for TC 1 1 terms and conditions repository 1 N 1 1 digital object dissemination 1 1 1 1 1 1 1 1 terms and conditions data terms and conditions data Figure 1 from 95 TR-1593

Why Are TC Difficult? Wide open model -- “everyone can access and do everything” is much simpler How do you: inform user of TC? negotiate TC? enforce TC? esp. with respect to 3rd party enforcers specify TC?

Access Rules and TC Figure 1 from TR-1540

Access Rules and TC TR-1593 makes access_rules an instance of the class terms_and_conditions Defines KWF concepts in a Common Object Request Broker Architecture (CORBA) context CORBA is a standard/architecture/mechanism for object communication across heterogeneous everything... http://www.corba.org/

CORBA Implementation Messages passed to the Object Request Broker (ORB) by interceptors ca. 1995 -- any current projects would likely use SOAP Interceptors create: credential object security context object access decision object

CORBA Implementation 1. Client requests a dissemination 2. Interceptor creates a credential object to store clients privilege attributes (PA) 3. Client and server establish a security context 4. Access decision object (ADO) controls access to the DO 5. ADO looks at DO’s control attributes (CA) and compares them to the client’s PAs 6. Negotiation (lots of icky details hidden here) 7. ADO grants or denies 8. Dissemination (or failure message) return to Client

“It’s so complex, it’s wonderful!” This is a long way from e-mail or anonymous ftp services... Things sure were simpler when everyone can read everything But open and free communication is a subset of DL applications success, widespread adoption of DLs depends on ability to model the more complex TC for various information

Are We There Yet? Grand and glorious projects often collapse under their own weight, fail to achieve critical mass, or generally under-achieve GOSIP, Ada, Z39.50, Hyper-G Simple and limited scope projects are more successful, achieve critical mass, etc. TCP/IP, C, OAI-PMH, WWW

Rough Consensus and Running Code Are KWF and its implementations: just what the DL community needs? not enough? too much? When the “perfect” DL architecture arrives, will we be too invested in our current DLs to transition? We reject kings, presidents and voting. We believe in rough consensus and running code. IETF Credo, Dave Clark, 1992