Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department.

Slides:



Advertisements
Similar presentations
Depositing e-material to The National Library of Sweden.
Advertisements

Where museums, libraries, and archives intersect NISO Z39.87 Developments Robin L. Dale RLG.
HTTP HyperText Transfer Protocol. HTTP Uses TCP as its underlying transport protocol Uses port 80 Stateless protocol (i.e. HTTP Server maintains no information.
Tools for a Preservation-Ready Web Joan A. Smith & Michael L. Nelson Old Dominion University Department of Computer Science {jsmit, NDIIPP.
HTTP Exercise 01. Three Internet Protocols IP TCP HTTP Routes messages thru “Inter-network “ 2-way Connection between programs on 2 computers So they.
The World Wide Web and the Internet Dr Jim Briggs 1WUCM1.
Chapter 2 Application Layer Computer Networking: A Top Down Approach Featuring the Internet, 3 rd edition. Jim Kurose, Keith Ross Addison-Wesley, July.
2/9/2004 Web and HTTP February 9, /9/2004 Assignments Due – Reading and Warmup Work on Message of the Day.
PL-IV- Group A HTTP Request & Response Header
Web Server Design Week 5 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 2/10/10.
1 HTML and CGI Scripting CSC8304 – Computing Environments for Bioinformatics - Lecture 10.
Using OAI-PMH Resource Harvesting & MPEG-21 DIDL for Digital Preservation Joan A. Smith & Michael L. Nelson Old Dominion University Department of Computer.
FTP (File Transfer Protocol) & Telnet
Chapter 1 © 2003 by Addison-Wesley, Inc A Brief Intro to the Internet - Origins - ARPAnet - late 1960s and early 1970s - Network reliability - For.
FITS: The File Information Tool Set
CSC 2720 Building Web Applications Getting and Setting HTTP Headers (With PHP Examples)
CP476 Internet Computing Lecture 5 : HTTP, WWW and URL 1 Lecture 5. WWW, HTTP and URL Objective: to review the concepts of WWW to understand how HTTP works.
TCP/IP Protocol Suite 1 Chapter 22 Upon completion you will be able to: World Wide Web: HTTP Understand the components of a browser and a server Understand.
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAIResource Software Her This work supported in part by the.
An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation.
Lis512 lecture 4 XML: documents and records. up until now Relational databases can store information that is internal to an organization. But a lot of.
CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University.
Dynamic Web File Format Transformations with Grace Daniel S. Swaney, Frank McCown, and Michael L. Nelson Old Dominion University Computer Science Department.
1 HTTP EECS 325/425, Fall 2005 September Chapter 2: Application layer r 2.1 Principles of network applications m app architectures m app requirements.
Web Server Design Week 8 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 3/3/10.
Web Server Design Week 4 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 2/03/10.
Archive Ingest and Handling Test: ODU’s Perspective Michael L. Nelson Department of Computer Science Old Dominion University
Web Server Design Assignment #1: Basic Operations Due: 02/03/2010 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin.
Web Server Design Week 11 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 3/24/10.
1-1 HTTP request message GET /somedir/page.html HTTP/1.1 Host: User-agent: Mozilla/4.0 Connection: close Accept-language:fr request.
Open Archives Initiative Object Reuse & Exchange Resource Map Discovery Michael L. Nelson * Carl Lagoze, Herbert Van de Sompel, Pete Johnston, Robert Sanderson,
Web Server Design Assignment #2: Conditionals & Persistence Due: 02/24/2010 Old Dominion University Department of Computer Science CS 495/595 Spring 2010.
Appendix E: Overview of HTTP ©SoftMoore ConsultingSlide 1.
2: Application Layer 1 Chapter 2: Application layer r 2.1 Principles of network applications  app architectures  app requirements r 2.2 Web and HTTP.
Web Server Design Week 7 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 2/24/10.
Web Server Design Week 13 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 4/7/10.
Web Server Design Assignment #4: Authentication Due: 04/14/2010 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
Archive Ingest and Handling Test: ODU’s Perspective Michael L. Nelson Department of Computer Science Old Dominion University
27.1 Chapter 27 WWW and HTTP Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
Evaluating Ingest Success: Using the AIHT Michael L. Nelson, Joan A. Smith Department of Computer Science Old Dominion University Norfolk VA DCC.
Data Communications and Computer Networks Chapter 2 CS 3830 Lecture 7 Omar Meqdadi Department of Computer Science and Software Engineering University of.
Web Server Design Week 6 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 2/17/10.
Web Server Design Week 5 Old Dominion University Department of Computer Science CS 495/595 Spring 2012 Michael L. Nelson 02/07/12.
Mod_oai: Metadata Harvesting for Everyone Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango
Web Programming Week 1 Old Dominion University Department of Computer Science CS 418/518 Fall 2007 Michael L. Nelson 8/27/07.
Web Server Design Week 13 Old Dominion University Department of Computer Science CS 495/595 Spring 2012 Michael L. Nelson 04/03/12.
Introduction to Digital Libraries Week 11: OAI-PMH and and Complex Objects for Preservation Old Dominion University Department of Computer Science CS 751/851.
Web Server Design Week 3 Old Dominion University Department of Computer Science CS 495/595 Spring 2006 Michael L. Nelson 1/23/06.
Web Server Design Week 6 Old Dominion University Department of Computer Science CS 495/595 Spring 2006 Michael L. Nelson 2/13/06.
HTTP – An overview.
Web Server Design Assignment #4: Authentication
Web Server Design Week 8 Old Dominion University
Web Server Design Week 7 Old Dominion University
Web Server Design Week 4 Old Dominion University
Web Server Design Week 15 Old Dominion University
Tools for a Preservation-Ready Web
Web Server Design Week 5 Old Dominion University
Web Server Design Week 8 Old Dominion University
Web Server Design Assignment #2: Conditionals & Persistence
Web Server Design Week 6 Old Dominion University
Web Server Design Week 8 Old Dominion University
Web Server Design Week 3 Old Dominion University
Web Server Design Week 5 Old Dominion University
Web Server Design Week 3 Old Dominion University
Web Server Design Week 4 Old Dominion University
Web Server Design Week 12 Old Dominion University
Web Server Design Week 14 Old Dominion University
Web Server Design Assignment #5 Extra Credit
Presentation transcript:

Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department of Computer Science Norfolk, VA {jsmit, JCDL 2007 Presented: 20 June 2007 Joint Conference on Digital Libraries 2007

20 June Slide # 2 What’s In A Web Page?

20 June Slide # 3 A Simple Web Page: Behind the Scenes

20 June Slide # 4 HTTP: Behind the Scenes Non-Text Resource example: Note the sparse metadata from the HTTP GET request Binary content is not human-readable and does not even display properly in the terminal window We really need more metadata for the digital archeologist of the future: –Color map –NISO information –Base64 encoding of resource –MD5 or other hash function –Subject matter And more metadata would help preserve the Jack and Jill document, too: –Language –Document summary/abstract –Keyword extraction –Lexical signature % telnet foo.edu 80 Trying Connected to foo.edu. Escape character is '^]'. GET /jackJill.jpg HTTP/1.1 Host: foo.edu HTTP/ OK Date: Mon, 11 Jun :49:25 GMT Server: Apache/ (Unix) Last-Modified: Mon, 29 Aug :01:40 GMT ETag: " e f924" Accept-Ranges: bytes Content-Length: Content-Type: image/jpeg ÿØÿà "#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿÄ ëÖ.éhéQ)Ùè5ü­b»[g¨øx^zè ² "#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿÄ ëÖ.éhéQ)Ùè5ü­b»[g¨øx^zè Connection closed by foreign host.

20 June Slide # 5 Preservation & Metadata Resource Metadata Available Less More Probability of Preservation Low High What I get from the HTTP/HTML What I need to make an Archival Information Package (AIP) AIP

20 June Slide # 6 Post-Harvest Processing (at Ingest) Harvest Analyze/Examine/ProcessArchive Often a combination of manual and automated input

20 June Slide # 7 Metadata Generation Utility Examples NameDescription JhoveAnalysis by type (img, audio, text) KeaKey phrase extraction OTSOpen Text Summarizer ExifToolImage/video metadata extractor PDFlib-pCOSExtract PDF metadata MP3-TagExtract audio file tags EssenceCustomized information extraction GDFRMIME++ MD5Message Digest File MagicUses content-identification bits of the file

20 June Slide # 8 The Conscientious Webmaster He who waits to do a great deal of good will never do anything. -- Samuel Johnson Preservation is important… But I’m soooo busy… How to help???

20 June Slide # 9 Configuring the Web-Server for Automatic Metadata No impact to everyday users Regular “GET” => “regular” response OAI-PMH “Get Record” => “crate” response Standard Apache “Location” directive mod_oai module configured with “plug-ins” Scripts, utilities, etc. can vary by MIME type

20 June Slide # 10 Harvest with Metadata (at Dissemination) Metadata Magic: Get the resource together with its metadata Harvest Pre-processed resource

20 June Slide # 11 Automatic Metadata via mod_oai T18:21:46Z <request verb="GetRecord" identifier= metadataPrefix=“crate"> T04:09:07Z mime:image:jpeg image/jpeg encoding=“base64” JVBERi0xLjQKMyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpI+hlzHdxHZ56diZdOiXjHNfEq9jOuDTzEc “file magic” /usr/bin/file jackJill.jpg file-4.16 JPEG image data, JFIF standard 1.00, resolution (DPI), "LEAD Technologies Inc. V1.01", 33 x 26 “jhove” /opt/jhove/jhove –m jpeg-hul Jhove (Rel. 1.1, ) Date: :35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpg ReportingModule: JPEG-hul, Rel. 1.2 ( ) LastModified: :09:07 EST Size: Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hul MIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCT Images: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endian CompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33 YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3 Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0

20 June Slide # 12 Preservation & Metadata Resource Metadata Available Less More Probability of Preservation Low High HTTP/HTML Automatic metadata utilities/CRATE Archival Information Package (AIP)

20 June Slide # 13 Automatic, Best-Effort Metadata Unverified –Utility results are not cross-checked –Output of analyses directly into XML response Undifferentiated –No categorization of output –Resource and metadata cohabit response Automatic –Generated at time of dissemination –Integrates preservation functions with the web server A simple, easy-to-implement option for improving preservation metadata for web resources

20 June Slide # 14 Further Information The mod_oai project home page: IWAW 2007: “CRATE: A Simple Model for Self-Describing Web Resources” Authors’ webs: I Helped!