Presentation is loading. Please wait.

Presentation is loading. Please wait.

Preserving a Born-Digital Archive: The H-Net Lists Lisa M. Schmidt MATRIX: The Center.

Similar presentations


Presentation on theme: "Preserving a Born-Digital Archive: The H-Net Lists Lisa M. Schmidt MATRIX: The Center."— Presentation transcript:

1 Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu http://www.h-net.org/archive/ MATRIX: The Center for Humane Arts, Letters & Social Sciences Online Michigan State University November 16, 2009

2 Preserving the H-Net E-Mail Lists H-Net Background Original “Preservation” Practices Use of the Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC) Preservation Improvements

3 H-Net: Humanities and Social Sciences Online International consortium of scholars and teachers Oldest collection of born-digital and content- moderated arts, humanities, and social science material on the Internet Hosted by MATRIX

4 H-Net: Humanities and Social Sciences Online Valuable scholarly resource –More than 180 networks, or e-mail lists, with more than 130,000 unique subscribers –More than 5,000 posts per month –More than 230 “private” lists –230,000 message views in single week More than 1 million e-mail messages

5 MATRIX Digital humanities research center Devoted to the application of new technologies in teaching, research, and outreach Creates and maintains digital libraries of humanities and social science materials Provides training in computing and new teaching technologies Creates forums for the exchange of ideas and expertise

6 NHPRC Grant Conduct assessment of existing H-Net preservation policies and practices Apply OCLC/CRL TRAC checklist Develop and implement an improved long- term preservation plan Useful to those managing large collections of electronic records Research semantic clustering search techniques

7 How H-Net Works: Backup & Security 3 TB of data, including H-Net Server rack kept in climate controlled, physically secured room Daily incremental backups, weekly full Full, “permanent” tape backups every four months

8 How H-Net Works H-Net runs on LISTSERV software Submission policies –Users must be list subscribers to post –Messages written in plain text –No attachments allowed on public lists

9 How H-Net Works: An Archival Perspective Appraisal/Acquisition/Accession –All approved messages permanently archived –Editors approve and post messages –Messages post from a few seconds up to several days after approval

10 How H-Net Works: An Archival Perspective Message Posting Process

11 How H-Net Works: An Archival Perspective Arrangement –Messages kept in flat text files called “notebooks” –Single notebook includes messages posted during seven-day time period, concatenated in original order

12 How H-Net Works: An Archival Perspective

13 Arrangement –Notebooks appear to be arranged in original order within each list directory

14 How H-Net Works: An Archival Perspective Description –Most descriptive metadata for messages automatically generated on creation/posting –“Author’s Subject” inserted by creator

15 How H-Net Works: An Archival Perspective PeriodDay of Month a1-7 b8-14 c15-21 d22-28 e29-31 - Ex. “h-africa.log0802a” Notebook description contained in filename Notebook File Naming

16 How H-Net Works: Message Retrieval BRS Database –Newest notebook messages parsed and copied every 24 hours –MD5 hashes created for each message –Available for full-text search MySQL Database Cache –Key metadata extracted, MD5 hashes created, written to database cache –Enables more efficient browsing

17 How H-Net Works: Message Retrieval Message Metadata Stored in MySQL Database

18 How H-Net Works: Message Retrieval http://h-net.msu.edu/cgi-bin/logbrowse.pl?trx=vx&list=H-Albion &month=0808&week=b&msg=w8utW6nKNO1FuY19vSK2mo &user=&pw=

19 How H-Net Works Message Ingest, Storage, and Retrieval Processes

20 Original “Preservation” Practices Backup, but only local—and no true archiving No normalization or migration strategy –Message/notebook content: No need Created and stored in plain text formats XML encoding only required with proprietary e-mail formats –Needed for attachments on private lists

21 Original “Preservation” Practices Authenticity –Informal check by author and/or editor on posting –Broken URL on message retrieval attempt –Cached metadata as PDI Reference, Content, Provenance Information MD5 hashes for message discovery, not fixity No Fixity Information for notebook files Policies –No documented preservation policies

22 Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC) TRAC 1.0 published in February 2007 For certification by third party or self assessment Three sections –A. Organizational Infrastructure –B. Digital Object Management –C. Technologies, Technical Infrastructure, & Security 84 audit criteria

23 Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC) Compare core audit criteria to local capabilities—“Gap Analysis,” illuminating areas requiring improvement Formulate strategies to narrow the gap and improve trustworthiness of repository

24 Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC) Example 1: Repository has formal succession plan –H-Net: No succession plan in place –Narrow the gap: Identify, negotiate with, and make preliminary plans with potential successor; document intent, describing what’s needed in successor

25 Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC) Example 2: Repository functions on well-supported operating systems and other core infrastructural software –H-Net: Servers run on Debian distribution of Linux –No gap!

26 Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC)

27 The TRAC Experience Thorough yet flexible, leaving room for interpretation, lots of options for supporting documentation/evidence Good snapshot of current state of repository Clarifies what’s needed to narrow the gap Great internal audit tool Useful for certification of a trusted digital repository

28 Preservation Improvements: Backup & Archival Storage Backup Long-term (“permanent”) backup tape sets stored offsite, put on 3-year retention schedule Reciprocal backup storage arrangement with ICPSR Archival Storage Annual copying to tape of H-Net data, databases, scripts Media refreshment every 5 years Future: Copy to alternative storage repository Future: Participation in distributed archival storage system

29 Preservation Improvements: Authenticity Fixity: Individual Messages (SIPs/AIPs) Shorten time window for generation of hashes Create database of SHA-256 hashes for fixity checks Validate message hashes on notebook completion Fixity: Notebook Files (AICs) Create SHA-256 message digests on completion of notebooks Calculate SHA-256 message digests for existing notebooks Create database of SHA-256 message digests for fixity checks Validate notebook hashes on weekly basis

30 Preservation Improvements: Authenticity

31 Preservation Improvements: Attachments Found with < 0.01% of H-Net messages –MS Office, PDF, image files Provide constructed URLs, as with public lists Provide download links No file normalization or migration plan –Most files should open in viewers, later versions of applications –MATRIX will help users if problems arise

32 Preservation Improvements: Digital Preservation Policies Documented digital preservation policies and procedures for the H-Net e-mail lists –http://www.h-net.org/archive/doc.php Based on the Digital Preservation Policy Framework developed by Nancy McGovern of ICPSR –Digital Preservation Management Workshop/Tutorial –Roadmap to developing and documenting policies –Wealth of examples

33 Preservation Improvements: Narrowing the Gap Lather, rinse, repeat: New TRAC assessment Technical improvements Digital preservation policies

34 Conclusions Relevant to e-mail preservation discussion Applicable to preservation of LISTSERV- based and other e-mail lists Testbed for other preservation tools and systems Useful foundation for digital preservation planning at Michigan State

35 References Digital Preservation Management Tutorial, http://www.icpsr.umich.edu/dpm/dpm-eng/eng_index.html http://www.icpsr.umich.edu/dpm/dpm-eng/eng_index.html H-Net Archives Project, http://www.h-net.org/archive/http://www.h-net.org/archive/ H-Net: Humanities and Social Sciences Online, http://www.h-net.org http://www.h-net.org MATRIX: The Center for Humane Arts, Letters, and Social Sciences Online, http://www.matrix.msu.eduhttp://www.matrix.msu.edu OAIS Reference Model, http://public.ccsds.org/publications/archive/650x0b1.pdf http://public.ccsds.org/publications/archive/650x0b1.pdf Trusted Digital Repositories: Attributes and Responsibilities, http://www.oclc.org/programs/ourwork/past/trustedrep/reposit ories.pdf http://www.oclc.org/programs/ourwork/past/trustedrep/reposit ories.pdf Trustworthy Repositories Audit & Certification: Criteria and Checklist, http://www.crl.edu/PDF/trac.pdfhttp://www.crl.edu/PDF/trac.pdf


Download ppt "Preserving a Born-Digital Archive: The H-Net Lists Lisa M. Schmidt MATRIX: The Center."

Similar presentations


Ads by Google