The Library of Congress Martha Anderson Program Officer, NDIIPP Office of Strategic Initiatives Library of Congress April 2005 LC Perspective : Preservation Partnerships
The Library of Congress 2 Born Digital “At-Risk” Web Sites
The Library of Congress 3 Take Actions that are Catalytic –Invest in existing strengths Collaborative –Engage partners in areas of mutual interest and expertise Iterative –Learn by doing Strategic –Broad spectrum of balanced short-term & investments NDIIPP Strategic Direction
The Library of Congress 4 Web of projects UIUC NARA GPO LC Web Projects IIPC NDIIP CDL IA AIHT Preservation Partners States Initiative
The Library of Congress 5 Library of Congress Web Archiving Collaborate with partners working on the same preservation issues Develop collection strategies to leverage available resources Learn by doing Strategy
The Library of Congress 6 Collaborate with partners working on the same preservation issues Membership in the International Internet Preservation Consortium (IIPC) Cooperative projects with NDIIPP Preservation Partners –California Digital Library –University of Illinois at Champaign-Urbana Technical information sharing with other US government agencies –Government Printing Office –National Archives and Records Administration
The Library of Congress 7 Collect thematically both by crawling and by acquiring collections gathered by others Develop collection strategies to leverage available resources Learn by doing Case studies and regular collection of theme-based collections Participate in tools development with IIPC Archive Ingest & Handling Project
The Library of Congress 8 Challenges of collecting from the Web Characteristics of the resource--dynamic, deep, linked Intellectual property laws and regulations Tension of preservation vs access goals Degree of alignment with current collection policies for other media Curation strategy Tools for identification and selection Tools for collection, curation, and archiving of large web collections
The Library of Congress 9 Average Web Collection Begins with a theme or event Usually does not include commercial sites Starts with a list of about 200 urls Is crawled by vendor Yields about 1 TB of data per month Has a frequency of once a week
The Library of Congress 10 Web Collections to date at LC Event-based –US National Elections—2000, 2002, 2004 –War in Iraq –September 11 Public Policy Topics –Health Care –Legislative Branch –Terrorism 26 TB
The Library of Congress 11 Archive Ingest & Handling Test AIHT is a first test of proposed NDIIP preservation architecture. The test is conducted with a common data set. –George Mason University 9/11 Archive Phase I tests ingest and data handling in local systems. Phase II tests export and import between institutions. Phase III explores format migration.
The Library of Congress 12 GMU 9/11 Archive Participants demonstrate capabilities Participants exchange archive
The Library of Congress 13 Participants Old Dominion University, Department of Computer Science Stanford University Libraries & Academic Information Resources The Johns Hopkins University, Sheridan Libraries Harvard University Library
The Library of Congress 14 George Mason University 9/11 Archive: Breakdown by File Types 57,450+ files 12GB Originally stored in a Linux environment
The Library of Congress 15 Goals of AIHT Gain practical experience with multiple institutions Document transfer and ingest processes for multiple systems Determine next set of tasks for developing interfaces between layers and institutions
The Library of Congress 16 Status of AIHT All phases completed. –Imports focused on technical assessment of archive and developing tools to examine the archive –Exports included METS and MPG21 DID objects –Migrations included transforms to JPG2000, TIFF, and some exploration of html to xml and avi to mpg Full report expected by early summer.
The Library of Congress 17 For more information…. NDIIPP Technical Architecture version International Internet Preservation Consortium MINERVA: Mapping the INternet Electronic Resources Virtual Archive
The Library of Congress 18 Martha Anderson NDIIP Program Officer Office of Strategic Initiatives The Library of Congress Washington, DC