HATHI TRUST A Shared Digital Repository Use of PREMIS for Internet Archive AIPs September 22, 2010
Overview University of Michigan and University of California worked together to develop ingest processes for Internet Archive content IA materials did not match previously developed standards for HathiTrust materials Solutions were developed to transform IA materials into HathiTrust-compatible AIPs Discuss our use of PREMIS events to document processes and transformations
HathiTrust Overview Launched in 2008 by CIC and University of California system libraries to archive and share digital collections Partnership is open to institutions worldwide Currently: Nearly 30 partners 6.6 million digital volumes 1.3 million public domain 247 terabytes
Internet Archive capture1 capture T19:50:13 Initial capture of item AgentID Internet Archive Executor tool scribe7.la.archive.org image capture
UM fixity check1 fixity check T16:34:02 Calculation of md5 hash values for downloaded IA files, comparison with pre-download md5 values warning files failed checksum validation arcanacaelestiah03swed_files.xml arcanacaelestiah03swed_meta.xml ….
… AgentID UM Executor tool md5sum software
UM package inspection1 package inspection T16:34:01 Inspection of IA download package for missing files pass AgentID UM Executor tool ingest_ia_volumes.pl software
UM mod1_image_header image header modification T16:34:29 Image header modification to HathiTrust conventions AgentID UM Executor tool ingest_ia_volumes.pl software …
tool exiftool software
UM mod2_file_rename file rename T16:34:03 File renaming to HathiTrust conventions AgentID UM Executor tool ingest_ia_volumes.pl software
UM mod3_ocr_split ocr split T16:34:05 Splitting of IA XML OCR into one plain text OCR file and one XML file (with coordinates) per page AgentID UM Executor tool ingest_ia_volumes.pl software
UM mod4_ia_mets_creation ia mets creation T16:34:30 Creation of IA METS file AgentID UM Executor tool ingest_ia_volumes.pl software
UM message digest calculation1 message digest calculation T16:34:30 Calculation of page-level md5 checksums AgentID UM Executor tool md5sum software
UM validation1 validation T16:34:30 IA METS validation AgentID UM Executor tool Xerces-C software
identifier uc2.ark:/13960/t2p55qw6d 1 file count 1584 page count 528
UM transformation1 transformation T16:34:30 Transformation of files for ingest: mod1-mod4 in IA METS AgentID UM Executor tool ingest_ia_volumes.pl software
UM page feature mapping1 page feature mapping T16:35:48 Map original page feature tags to HathiTrust AgentID UM Executor tool GROOVE software
UM fixity check1 fixity check T16:34:39 Validation page-level md5 checksums pass AgentID UM Executor tool md5sum software
UM ingestion1 ingestion T16:35:48 Ingestion of object package into repository AgentID UM Executor tool GROOVE software
UM validation1 validation T16:35:18 Validation of object components AgentID UM Executor tool GROOVE software tool jhove1.5 software