Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluating Ingest Success: Using the AIHT Michael L. Nelson, Joan A. Smith Department of Computer Science Old Dominion University Norfolk VA 23508 DCC.

Similar presentations


Presentation on theme: "Evaluating Ingest Success: Using the AIHT Michael L. Nelson, Joan A. Smith Department of Computer Science Old Dominion University Norfolk VA 23508 DCC."— Presentation transcript:

1 Evaluating Ingest Success: Using the AIHT Michael L. Nelson, Joan A. Smith Department of Computer Science Old Dominion University Norfolk VA 23508 DCC / LUCAS Joint Workshop Liverpool UK Nov 30 - Dec 1, 2006

2 Outline AIHT Recap ODU Approach to AIHT AIHT Lessons Learned –not evaluation –not ingest Create a tool that would have made AIHT easy

3 Archive Ingest and Handling Test Summary: a tar file of the filesystem + database for http://911.gmu.edu/ (~57k files) provided to:http://911.gmu.edu/ –Johns Hopkins University –Harvard University –Old Dominion University –Stanford University Goal: –“ingest” the tar file –migrate one format to another (simulate the passage of time) –exchange contents with one other partner

4 Outline AIHT Recap ODU Approach to AIHT AIHT Lessons Learned –not evaluation –not ingest Create a tool that would have made AIHT easy

5 Fortress Model 1.Get a lot of $ 2.Buy a lot of disks, machines, tapes, etc. 3.Hire an army of staff 4.Load a small amount of data 5.“Look on my archive ye Mighty, and despair!” image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg Five Easy Steps for Preservation:

6 ODU’s Research Goals We’re in the CS department, not the library –Less infrastructure (bad) –More freedom (good) Interested in repository/object interaction –Long-range vision: repositories fade away; objects are responsible for their own preservation –Could we accomplish this with our “bucket” technology? Significant questions about archive granularity Transition to MPEG-21 Digital Item Declaration Language (DIDL) based buckets New models for digital preservation?

7 Buckets Buckets: self-contained, web-accessible objects –Grew out of research for serving NASA documents, esp. NACA Reports http://naca.larc.nasa.gov/ http://doi.acm.org/10.1145/374308.374342 – implicit assumptions: 1 bucket = 1 logical item (N physical items) Display is for human use Bucket contents are DOM-parsable

8 Bucket / MPEG-21 Model MPEG-21 DIDL Payload http://beatitude.cs.odu.edu:8080/bucket/ Bucket Infrastructure methods logs support libraries

9 MPEG-21 DIDL A generic, powerful complex object metadata format –Based on an abstract data model –Semantics separated from syntax i.e. the tags don’t mean anything -- a little disconcerting at first glance –Digital library use championed by LANL http://www.dlib.org/dlib/november03/bekaert/11bekaert.html http://www.dlib.org/dlib/february04/bekaert/02bekaert.html http://arxiv.org/abs/cs.DL/0502028

10 MPEG-21 DIDL Data Model How to encode Archive? 1 file = 1 DID 1 archive = 1 container 1 archive = 1 component 1 file = 1 component

11 1 File = 1 Component 8 file archive for demo purposes… http://www.cs.odu.edu/~mln/aiht/

12 Looking Inside the Archive

13 Looking at a Single File…

14 Design Decisions: File Storage Store each file as a –Big: each file is base64’d into the DIDL –Small: each file is ref’d from the DIDL to a directory Filename = MD5 hash of the original file name (not contents!) + a version number Example:

15 Design Decisions: Ingestion For every program/process to apply to a file, create a corresponding –Jhove –Unix “file” –Fred URI –MD5 of file contents Expandable, scriptable list of metadata extraction / analysis programs Ingestion is parallelized over a workstation cluster

16 Example Output: MD5 perl/Digest::M D5 52217a1bcd2b e7cf05f36066d4cdc9cf

17 Conversion: Linking Old to New If the previous version of the Resource was specified as: then the new version of the resource is specified as:

18 Harvard Ingest Harvard’s model was the most similar to our MPEG-21 model Ingesting from another archive is (roughly) the same as initial ingest –Save any metadata that was delivered in the original METS file as a We don’t trust it, but it might be useful for future forensics –Re-ingest in the normal way Our export is part of the bucket API: –http://beatitude.cs.odu.edu:8080/bucket/?method=get&id=didl External Metadata image/jpeg 6 6 1 Canon Canon EOS D30 2 540 360 8 8 8

19 Outline AIHT Recap ODU Approach to AIHT AIHT Lessons Learned –not evaluation –not ingest Create a tool that would have made AIHT easy

20 The DIP is the TMD* Using METS or MPEG-21, there is no need for a separate transfer metadata format METS & MPEG-21 can be the lumps of XML exchanged between OAI-PMH harvesters & repositories –http://www.dlib.org/dlib/december04/vandeso mpel/12vandesompel.html * Apologies to Marshal McLuhan Figure 1, Bekaert & Van de Sompel http://www.dlib.org/dlib/june05/bekaert/06bekaert.html

21 Validation is Subjective images from: http://facweb.cs.depaul.edu/sgrais/collage.htm Preservation metadata is like a David Hockney photo collage: each image is both true and incomplete, and while the result is not faithful, it does capture the “essence”

22 Alternate Models of Preservation Lazy Preservation –Let Google, IA et al. preserve your website Just-In-Time Preservation –Wait for it to disappear first, then a “good enough” version Shared Infrastructure Preservation –Push your content to sites that might preserve it Web Server Enhanced Preservation –Use Apache modules to create archival-ready resources image from: http://www.proex.ufes.br/arsm/knots_interlaced.htm

23 Outline AIHT Recap ODU Approach to AIHT AIHT Lessons Learned –not evaluation –not ingest Create a tool that would have made AIHT easy

24 Web Site Preservation The counting problem How many pages are on that site? To save it you have to find it The representation problem What’s that page all about? Future use requires understanding

25 Limitations of HTTP There is no “Select *” in HTTP –Crawlers cannot request a list of all URLs for the site –Crawlers can only GET one resource at a time –HTTP cannot give a crawler a list of resources it has Cannot ask for only new resources –Conditional GET by datestamp or etag is limited –Cannot get a list of pages that have been deleted –Each resource must be requested, one at a time HTTP alone is insufficient to confidently enumerate a site’s resources

26 Time XBM Current browser accessible formats GIFPNG JP2 PS PDF Slide from “Dynamic Web File Format Transformations withGrace” IWAW 05

27 Integrate OAI-PMH functionality into the web server itself… 1.Use mod_oai an Apache 2.0 module automatically answers OAI-PMH requests for an http server written in C respects values in.htaccess, httpd.conf 2.Install mod_oai on http://www.foo.edu/http://www.foo.edu/ 3.Define baseURL: http://www.foo.edu/modoai Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets) mod_oai solution Using OAI-PMH http://www.foo.edu/modoai?verb=ListRecords&metdataPrefix=oai_didl&from=2004-09-15&set=mime:video:mpeg Give me all resources… …and their preservation metadata From site foo, dating from 9/15/2004 through today that are MIME type video/mpeg

28 Addressing the Counting Problem: ListIdentifiers CRAWLER: issues a ListIdentifiers, finds URLs of updated resources does HTTP GET updates only can get URLs of resources with specified MIME types EXPAND mod_oai approach: Web log lists File system lists Configuration information

29 Addressing the Representation Problem: ListRecords in DIDL Format CRAWLER: Makes a ListRecords query, Gets updates as MPEG-21 DIDL records (HTTP headers, resource By Value or By Reference) can get resources with specified MIME types EXPAND OAI-PMH approach: Add ability to incorporate other metadata output Build metadata-rich complex object response Encapsulate within existing OAI-PMH DIDL metadata format response

30 “Born Archival” It would be nice if HTML, PDF, MS Office, etc. applications encoded preservation oriented metadata –Stewart Brand - “born archival” http://www.rlg.org/en/page.php?Page_ID=75 That would be really nice… while we’re wishing…

31 CRATE: A Model for Web Resource Preservation Fits with OAIS Preservation Model Text-based protocol for long-term survivability Complex object format supported by HTTP via OAI-PMH Utilizes web-server to support preservation at dissemination

32 CRATE Example: PostScript File Metadata –Descriptive Summary Index words / term frequency –Administrative Copyright Item version information Last modified date –Technical Target application version MIME type Fred URI System compatibility Checksum Signature Resource –base64-encoded resource Information-rich preservation package

33 CRATE and the OAIS Information Model base64- encoded resource CRATE Metadata from plug-ins: Summary, index, MIME / GDFR, format analysis… URL (for web access), OAI Identifier, baseURL (for harvester access) OAI-PMH GetRecord Request SIP: submitted by web server AIP: contains archival information DIP: transmitted to other repositories or to user for extraction image from: http://www.oclc.org/research/publications/archive/2000/lavoie/

34 Example CRATE Plug-Ins for mod_oai NameDescription JhoveFile format analysis KeaKeyphrase extraction OTSOpen Text Summarizer ExifToolImage/video metadata extractor PdflibExtract PDF metadata MP3-TagExtract audio file tags EssenceCustomized information extraction Plug-in design allows for any type of extraction tool to be included Flexible architecture elements: Tags | Argument-Name | Version | CDATA output Simple Apache configuration file modification to enable plug-in Plug-ins written by 3 rd -party programmers Output from plug-ins is not validated!

35 AIHT caused to think differently about web resource preservation Goal of CRATE / mod_oai is dynamically produce as much preservation metadata as possible at dissemination time –“Ingestion goes faster if you can disseminate `crated’ metadata” Validation is a separate, local process –“Dissemination goes faster if you don’t sweat validation” More info: www.modoai.org Summary


Download ppt "Evaluating Ingest Success: Using the AIHT Michael L. Nelson, Joan A. Smith Department of Computer Science Old Dominion University Norfolk VA 23508 DCC."

Similar presentations


Ads by Google