Collaborative Preservation of ETDs: The MetaArchive Cooperative and LOCKSS Gail McMillan gailmac@vt.edu Digital Library and Archives, Virginia Tech Canadian ETD and Open Repositories Workshop Atelier sur les TME et les dépôts à libre accès
I met a traveller from an antique land Who said:—Two vast and trunkless legs of stone Stand in the desert. Near them on the sand, Half sunk, a shatter'd visage lies, whose frown And wrinkled lip and sneer of cold command Tell that its sculptor well those passions read Which yet survive, stamp'd on these lifeless things, The hand that mock'd them and the heart that fed. And on the pedestal these words appear: "My name is Ozymandias, king of kings: Look on my works, ye mighty, and despair!" Nothing beside remains: round the decay Of that colossal wreck, boundless and bare, The lone and level sands stretch far away. 1818 P.B. Shelley Thank you very much for inviting me to Ottawa and for this opportunity to share my passion for ETDs and their preservation. [Thanks to Henry Gladney for using this poem [“Ozymandias” ] in “Digital Preservation Archiving and Copyright: Problem Description and Legislative Proposal.”] I really like this 1818 sonnet by Percy Bysshe Shelley for the irony it so sonorously portrays. “My name is Ozymandias, king of kings: Look on my works, ye mighty, and despair!" Nothing beside remains: round the decay Of that colossal wreck, boundless and bare, The lone and level sands stretch far away. Ozymandias was speaking of his many material artifacts of very durable materials. We, however, are nervous because of the relative newness of electronic works and concerned about the long-term viability of ETDs.
Digital Preservation Systematic management of digital works over an indefinite period of time Processes and activities that ensure the continued access to works in digital formats Requires ongoing attention--constant input of resources: effort, time, money IR ≠ Digital Preservation Backups ≠ Digital Preservation How can we preserve our ETD Collections and protect them from neglect, or catastrophic events like fires and natural disasters like hurricanes, or hardware and software failures, or human errors?* We can participate in pragmatic and affordable preservation strategies. We can consciously employ preservation techniques that is, the systematic management of digital works over an indefinite period of time. This is a working definition of digital preservation. The MetaArchive Cooperative offers such a strategy. But before going into details, I’d like to clarify a few things. Institutional Repositories are not preservation strategies per se. They are largely a centralized approach aimed at managing information flow within an institution. IRs typically do not attempt to securely cache prioritized content at multiple geographically dispersed sites. And a single IR is unlikely to have the capability to operate several geographically dispersed and securely maintained servers. Media Failure, Hardware Failure, Software Failure, Network Failure, Natural Disaster, Operator Error, External Attack, Insider Attack, Economic Failure, Organization Failure
Backups ≠ Digital Preservation Backups are tactical measures Make copies to restore originals after data loss event. Typically stored in a single location Often nearby Collocated with the servers backed up Backups address short-term data loss with minimal investment resources Backups are not the same as preservation. Backups are tactical measures—we make regular copies (often over-writing older copies), keep them handy to restore following a data-loss event. Preservation is strategic goal and long term outcome, and the means to achieve that goal. Backups are designed to address short-term data loss via minimal investment of money and staff time resources. Backups are better than nothing, but not a comprehensive solution to the problem of preserving information over time. • 30% of all archives have been backed up one time or not at all. Source: 2005 NEDCC Survey by Bishoff and Clareson [Northeast Doc Conservation Center] Yes, many of our institutions are satisfied with backups, and VT was until we helped create a pragmatic preservation option through our work with the MetaArchive Cooperative.
Digital Preservation is Strategic Long-term, error-free storage and for the entire time span the information is required. Realistically address issues in preserving information over time Affordable ongoing investment Geographically dispersed set of secure caches Multi-institutional collaboration through formal agreements DDPN: Distributed Digital Preservation Network •Digital preservation is strategic. Preserving information over long periods requires systematic attention, not benign neglect or short-term tactical actions. To realistically address the issues involved in preserving information over time, a true digital preservation program requires Affordable ongoing investment Geographically dispersed set of secure caches Therefore, we need Multi-institutional collaboration through formal agreements We can accomplish this with a DDPN: Distributed Digital Preservation Network There is no need to reinvent this wheel since open-source tools for operating a DDPN already exist thanks to Stanford University: LOCKSS = Lots of Copies Keep Stuff Safe
What the LOCKSS software does: Gathers web content into (6) caches as a journal issue is completed/published. Compares content among caches and detect/repair discrepancies. Preserves the contents of each cache for posterity by never flushing it. Serves content to readers from the publisher, or, if necessary, from the cache. Through agreements with ejournal publishers and libraries (using just low-end equipment such as desktop computers) , the LOCKSS software, in general, Collects Compares Preserves Serves When the MetaArchive group was thinking about digital preservation of our unique university resources such as ETDs, we felt that LOCKSS provided the right framework for us in many ways.
http://www.metaarchive.org Representatives of 6 research universities in the southeastern United States originally created the MetaArchive of Southern Digital Culture because we shared an increasing concern that the digital items that define our culture and history might be forever lost due to disaster, error, or neglect. These include the evidence of the research being done at our universities like ETDs, but also digitizing projects that we employed to increase the visibility of our Special Collections materials and prolonged their analog lives by reducing the need for individuals to handle them. Even before we sought funding we agreed that we would like to use existing open source software and whatever we created would also be openly available.
Distributed Digital Preservation Network Secure Reduces the likelihood that any single cache will be compromised. Distributed Geographically Reduces likelihood that loss of any single cache will lead to loss of the preserved information. A single organization is unlikely to have the capability to operate several geographically dispersed and securely maintained servers. Inter-institutional agreements will ensure commitment to act in concert over time. The best option is Distributed Digital Preservation Network … •DDP mobilizes efforts of multiple institutions. It entails a geographically dispersed set of secure caches of critical information. A true digital preservation program will require multi-institutional collaboration and at least some ongoing investment to realistically address the issues involved in preserving information over time. • Security reduces the likelihood that any single cache will be compromised. • Distribution reduces the likelihood that the loss of any single cache will lead to a loss of the preserved content. But even Shared Archiving will fail without a Precoordinated DDP Network in Place Lessons learned from the NDIIPP Archive Ingest and Handling Test (AIHT) and other shared archiving experiments, include Realization that much of the cost in preserving digital material is in coordinating the organizational and institutional imperatives of preservation, and not the technological costs of storage space The MetaArchive group felt this was a good idea, very adaptable.
MetaArchive Cooperative: DDPN Library of Congress, supported since 2003 LOCKSS without public access Separated preservation from access Bit-level preservation Format agnostic Images, text, multimedia, datasets, program executables, etc. 16 members (150 collections): US, UK, Brazil The Library of Congress like our plan to create A distributed digital preservation cooperative for digital dark archives • Established in 2003 under the auspices of and with funding from the National Digital Information and Infrastructure Preservation Program (NDIIPP) of the US Library of Congress • A functioning DDP network using/building open source software • Organized as an incorporated nonprofit cooperative of libraries and other cultural memory organizations • Sustained by organization fee memberships, cooperative agreement with U.S. Library of Congress, and other sponsored funding A cooperative is an organization that consists of a group of individuals who have joined together to perform a function more efficiently than each individual could do alone. The purpose of a cooperative is not to make profits, but to improve each member's situation and the situation of the surrounding society. • Provides training and models for other groups to establish similar distributed digital preservation networks • Fosters broader awareness of digital preservation issues • Designed to address “in-the-trenches” needs of CMOs after environmental scans of other options We have grown from the original 6 members to 16 members with more in the signature-gathering state. So it’s a good idea, and there is a strategy. But is it really needed, was our intuition correct, and more specifically, is it needed by institutions with ETD initiatives?
http://www.ndltd.org I approached the NDLTD, since I’d been preaching at Board meetings and making conference presentations. Here you can see the missions statement specifically says that it promotes preservation of ETDs, but we promoted the theory and did not have an actual strategy to offer. The NDLTD Board of Directors agreed that the MetaArchive’s DDPN was a good strategy. In 2006 the NDLTD formed an strategic alliance with the MetaArchive to ask the question: is it really needed, was our intuition correct, and more specifically, is it needed by institutions with ETD initiatives?
NDLTD/MetaArchive Alliance ETD Preservation Survey Dec. 2007-April 2008 95 institutions responded 80% have ETDs 27% have a preservation plan 92% interested in DDPN http://lumiere.lib.vt.edu/surveys/results/ We did what academics do, we gathered data. We used an online survey and learned that lots of institutions have ETDs but only a few had a preservation plan. Validation! There are many more results you may be interested in seeing at this URL. But, Where to go from here? Pilot Project, of course.
Replace VT Library with ejournal publisher
NDLTD Preservation Guidelines http://scholar. lib. vt Hardware, software Metadata: Conspectus Database of Collections Organizing ETD collections—best practices and data wrangling Institutional Workflow Personnel Training Opportunities Documentation and Reports Retrieving from the ETD Archive Personnel includes: Staff Roles at Member Institutions Program Managers Data Wranglers Systems Administrators Librarians/Archivists
NDLTD Preservation Strategy: MetaArchive http://scholar. lib. vt Preliminaries Join, training, installation Collection readiness Manageable units, permission, define path Harvest/ingest/cache Continual comparisons, repairs Dark archive: PPDN M1 M3 M2 M5 M4 1. Preliminaries: 1. Join the NDLTD and the MetaArchive Cooperative (5% discount) 2. Attend workshop Equipment set up, register IP w/MetaArchive, and software installation 2. Collection Readiness Web accessible Manageable units of ETDs (annual, monthly; <20 GB)—Data Wranglers LOCKSS manifest page = permission to harvest Define path to units – plugin Timed harvesting for ongoing preservation or Notify or collections to be ingested into the DDPN LOCKSS continually compares content, “repairs” differences
MetaArchive Membership Levels Preservation Members Fundamental activity: network node server Preserve their own and others ETDs Sustaining Members Preservation member responsibilities Steering Committee, leadership, and technical development Preservation Member Sites are responsible for the basic ongoing activity of preserving digital content. At a minimum, every preservation site must include responsible staff and a node server of the relevant preservation network. Preservation sites collectively comprise a preservation network. • Sustaining Member Sites are responsible for steering committee of the Cooperative, technical development of the computer systems that enable the preservation network. Obviously, development sites may also be preservation sites and contributing sites. Requirements for operating a Node: Be able to bring up and maintain a Linux server over time Task local staff with both program management and systems administration duties, and preferably data wrangling as well Contribute content and monitor system functioning occasionally Sign membership agreement and pay membership dues Contents of each university (nodes M1 through M5) preserved at every other university Multiple, dispersed copies Not a backup-- nothing is overwritten All versions retained
Technical & Organizational Solutions http://www. metaarchive Cooperative Charter Membership Agreement Technical Specifications MetaArchive Trusted Repository Audit Market Analysis Management Plan Operations Plan Risk Assessment Financial Plan Extension Harvest Plan Outreach Program Implementation Plan Data and Node Recovery Conspectus Database Conspectus Schema MetaArchive Higher capacity equipment than LOCKSS recommends—not desktop computers, but dedicated servers Charter is a formative agreement that lays out the conceptual roles and responsibilities of participants Membership agreement is between new members and MetaArchive Services Group nonprofit corporation Agreement to preserve content for specified period Pledge to not intentionally harm the network Training Coordination Technical support Web interfaces to monitor functionality Documentation Conspectus
The MetaArchive is a Cooperative. Not a vendor. A cooperative is an organization of individuals or institutions who have joined together to perform a function more efficiently than each individual could do alone. The purpose of a cooperative is not to make profits, but to improve each member's situation and the situation of the surrounding society. A collaborative association of cultural memory organizations with a nonprofit administration. Membership fees go to a central pool of support for members’ co-op activities. All hardware and software assets are owned by the members.
ETD Preservation Plan for it; don’t put off for another day Separate issues: access and preservation Collaboration: everybody works and everybody benefits http://www.metaarchive.org http://www.ndltd.org