Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department of Computer Science Norfolk, VA {jsmit, JCDL 2007 Presented: 20 June 2007 Joint Conference on Digital Libraries 2007
20 June Slide # 2 What’s In A Web Page?
20 June Slide # 3 A Simple Web Page: Behind the Scenes
20 June Slide # 4 HTTP: Behind the Scenes Non-Text Resource example: Note the sparse metadata from the HTTP GET request Binary content is not human-readable and does not even display properly in the terminal window We really need more metadata for the digital archeologist of the future: –Color map –NISO information –Base64 encoding of resource –MD5 or other hash function –Subject matter And more metadata would help preserve the Jack and Jill document, too: –Language –Document summary/abstract –Keyword extraction –Lexical signature % telnet foo.edu 80 Trying Connected to foo.edu. Escape character is '^]'. GET /jackJill.jpg HTTP/1.1 Host: foo.edu HTTP/ OK Date: Mon, 11 Jun :49:25 GMT Server: Apache/ (Unix) Last-Modified: Mon, 29 Aug :01:40 GMT ETag: " e f924" Accept-Ranges: bytes Content-Length: Content-Type: image/jpeg ÿØÿà "#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿÄ ëÖ.éhéQ)Ùè5üb»[g¨øx^zè ² "#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿÄ ëÖ.éhéQ)Ùè5üb»[g¨øx^zè Connection closed by foreign host.
20 June Slide # 5 Preservation & Metadata Resource Metadata Available Less More Probability of Preservation Low High What I get from the HTTP/HTML What I need to make an Archival Information Package (AIP) AIP
20 June Slide # 6 Post-Harvest Processing (at Ingest) Harvest Analyze/Examine/ProcessArchive Often a combination of manual and automated input
20 June Slide # 7 Metadata Generation Utility Examples NameDescription JhoveAnalysis by type (img, audio, text) KeaKey phrase extraction OTSOpen Text Summarizer ExifToolImage/video metadata extractor PDFlib-pCOSExtract PDF metadata MP3-TagExtract audio file tags EssenceCustomized information extraction GDFRMIME++ MD5Message Digest File MagicUses content-identification bits of the file
20 June Slide # 8 The Conscientious Webmaster He who waits to do a great deal of good will never do anything. -- Samuel Johnson Preservation is important… But I’m soooo busy… How to help???
20 June Slide # 9 Configuring the Web-Server for Automatic Metadata No impact to everyday users Regular “GET” => “regular” response OAI-PMH “Get Record” => “crate” response Standard Apache “Location” directive mod_oai module configured with “plug-ins” Scripts, utilities, etc. can vary by MIME type
20 June Slide # 10 Harvest with Metadata (at Dissemination) Metadata Magic: Get the resource together with its metadata Harvest Pre-processed resource
20 June Slide # 11 Automatic Metadata via mod_oai T18:21:46Z <request verb="GetRecord" identifier= metadataPrefix=“crate"> T04:09:07Z mime:image:jpeg image/jpeg encoding=“base64” JVBERi0xLjQKMyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpI+hlzHdxHZ56diZdOiXjHNfEq9jOuDTzEc “file magic” /usr/bin/file jackJill.jpg file-4.16 JPEG image data, JFIF standard 1.00, resolution (DPI), "LEAD Technologies Inc. V1.01", 33 x 26 “jhove” /opt/jhove/jhove –m jpeg-hul Jhove (Rel. 1.1, ) Date: :35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpg ReportingModule: JPEG-hul, Rel. 1.2 ( ) LastModified: :09:07 EST Size: Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hul MIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCT Images: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endian CompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33 YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3 Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0
20 June Slide # 12 Preservation & Metadata Resource Metadata Available Less More Probability of Preservation Low High HTTP/HTML Automatic metadata utilities/CRATE Archival Information Package (AIP)
20 June Slide # 13 Automatic, Best-Effort Metadata Unverified –Utility results are not cross-checked –Output of analyses directly into XML response Undifferentiated –No categorization of output –Resource and metadata cohabit response Automatic –Generated at time of dissemination –Integrates preservation functions with the web server A simple, easy-to-implement option for improving preservation metadata for web resources
20 June Slide # 14 Further Information The mod_oai project home page: IWAW 2007: “CRATE: A Simple Model for Self-Describing Web Resources” Authors’ webs: I Helped!