Archiving the Web: why bother ? LA Times (March 2000)
Archiving the Web: why bother ? “Web sites are an increasingly important part of [an] institution’s digital assets and of [a] country’s information and cultural heritage.” (JISC – April 2002) “A lot of history is born digital. This should not be like early television where there is no record.” (Brewster Kahle – May 2002)
Archiving the Web: who bothers? Australia USA Nordic countries: Denmark, Finland, Sweden Other countries: UK, France, Japan Internet Archive –“Wayback Machine”
Three conferences: What’s next for Digital Deposit Libraries ? Darmstadt, September 2001 International Symposium on Web Archiving. Tokyo, January 2002 DPC Forum: Web-archiving. London, March 2002.
The Web Sites ECDL2001.htmhttp:// ECDL2001.htm eng.htmlhttp:// eng.html ion/webforum.htmlhttp:// ion/webforum.html
Issues and Questions Legal Deposit of Digital Information ? –European Union Copyright Directive Copyright ? Open or closed access ? Selective or comprehensive ? When in the life cycle ? How often ? Capturing the experience – –Dynamic web sites
Technical challenges Embedded external links and executable programs Persistent naming and date stamping Duplicate control Change in content over time Surface web vs Deep web
Australia (PANDORA Archive) – NLA As yet no legal deposit. Mandate for collecting C’wlth Government publications Selective –(Australian e-journals, organisational sites, government publications, ephemera) Accessible by public –Catalogued in the NBD
Australia (PANDORA Archive) ~1700 titles in the Archive (Nov. 2001) –Growth rate: 40 sites/month –Regathering: 35 sites/month ADRI (Australian Digital Resource Identifier) –Unique identification scheme –In-house resolving system
USA (Minerva) - Library of Congress (Mapping the Internet Electronic Resources Virtual Archive) Open access materials from the Web Changes in copyright law under discussion Selective inclusion Public access
LC/IA Pilot Project – “Election 2000” Joint pilot project Library of Congress and Internet Archive Objectives: –Library pilot : selection, collection and cataloguing web sites; build prototype access system –Internet Archive pilot: gain experience in harvesting and archiving sites Over 800 websites (150+ selected sites and major sites hyperlinked to/from those sites) 2-3 terabytes of data Archived daily August 2000 to January 2001
Denmark Royal Library, Copenhagen. Limited legal deposit of electronic publications –Static, not dynamic publications – finite units Access only from workstations at Royal Library and State and University Library Archiving static websites (monographs, periodicals) Server mirrored nightly to State and University Library, Arhus
Denmark (Statistics) June archived 9000 net publications –31% monographs, 69% periodicals –67.5% public sector/university, 32.5 private sector publications Staff resources 0.5 technical; 0.8 librarian
Sweden (Royal Library) Take snapshots of Swedish Web several times/year –No selection - take everything –All www pages in Sweden, all articles in e-journals, all Swedish newspapers –Definition of, with Swedish address or telephone number –Archive only - no public access as yet.
Sweden (Software) Uses Whois to identify Swedish sites in domains Harvesting with COMBINE Robot software (Univ. of Lund) –Collects papers by automatically following hypertext links –Also collects pictures and sound –Fully automatic - no human intervention
Swedish Archive (Kulturarw3) Everything associated with an object and metadata stored in one file as a multipart MIME object Name of the file: 33 character string with time stamp Sept 2001: 110 million files Gbytes of data from 97,000 web servers Stored on disk and magnetic tapes using Hierarchical Storage Management (HSM)
Swedish Archive (Kulturarw3) (2) Prior to July 2002: Limited legal deposit (fixed form e- documents) December 2001 : Data Inspection Board team confirms project is illegal. Project suspended July Amendments to Swedish copyright law. Gives Royal Library right to collect the Swedish web and to make the archive publicly available.
Finland - National Library Follows Swedish approach - domain initially Finnish Copyright Act under revision to permit harvesting web resources Uses harvesting software developed in Finland from NEDLIB specification Archive Metadata –Uses MD5 checksum for duplicate control, authentication and create unique access key –Time stamped upon retrieval
Finland - Results of current Harvesting Round (1) Harvesting round –Commenced August completed in April 2002 –9.4 million files from 29 million locations (URL’s) –Compressed data occupied 340 Gbytes of storage –Stored on a tape robot in national supercomputing centre –Hardware used: Sun E450 server
FINLAND - Results of current Harvesting Round (2) Finnish experience: “the NEDLIB harvester can deal with any national Web space (except perhaps the USA) with reasonably modest hardware, provided that there is sufficient storage space available somewhere”. (Juha Haleka, leader of the Finnish team)
Nordic Web Archive Joint project of Nordic national libraries Not dependent on what harvester is used –NEDLIB (Finland, Norway, Denmark), COMBINE (Sweden) Selected Norwegian search engine (FAST) Software –Convert documents from 100 different MIME types to HTML –Recognises most European languages Budget: 260,000 Euros (AUS $475,000)
“The homogeneous (surface) Web” 59.3% - Text/HTML 37.9% - Image (GIF,JPEG,PNG) 1.7% - PDF 1.1% - Other formats 1.5 million HTML 1 million GIF 550,000 JPEG 36,500 PDF 11,800 plain text 6,000 Word 5,300 Java etc. DenmarkFinland
United Kingdom (1) British Library –“” experiment (commenced 2002) Select and capture 100 UK websites (2001 election, GM crops) selected sites for approval Revisit every three weeks Uses Bluesquirrel Web Whacker software Audit change, loss and links over time –Intention to scale up (2004 funding bid)
United Kingdom (2) UKOLN Research Project –Estimates of size domain: 3 million sites, 24 million pages Wellcome Library/JISC Archiving Study to find a solution to web archiving –The “medical web” –Consultancy awarded March Completion date October –Draft report August Final report to be disseminated to the community
Germany (Deutsche Bibliotek) –Experiments with targeted harvesting –Two incomplete snapshots 12/2000 and 02/2001
France (Bibliotheque de France) –In 2001: two experiments with small numbers of sites (16,100), including music, video and multimedia. –Unsatisfactory results: Unexpected features Exceptionally large sites –Planning new feasibility study with with 2 different robot providers –Change in legal deposit law proposed in June Not yet adopted by Parliament.
Japan National Diet Library WARP (Web Archiving Program) Initially selective Major changes in Japanese copyright law expected to permit more comprehensive collecting.
Internet Archive (1) Founded by Brewster Kahle in $15 million from sale of WAIS Non-profit organisation. –Sponsors include AT&T Research, Compaq, Xerox PARC, Quantum DLT, National Science Foundation. Archived web pages from 1996+, movies from 1903 to 1973 Site has archived over 10 billion pages (Oct. 2001) = more than 100 terabytes Growth rate : 10 terabytes/month
Internet Archive (2) Complete sweep of the Web every two months “Robot exclusions” - many newspapers, individuals, photographers Complete copy of Archive at Bibliotheca Alexandrina (April 2002) Duplicates in other continents proposed. “Best method of preservation is replication”. Copyright ? “May be a massive violation of copyright law”. (Lawrence Lessig, Stanford University expert on IP law in Cyberspace)
“Wayback machine” Front end to the Internet Archive collection of public web pages Includes most image files in the collection Launched October 2001 Fully available to public 20,000 users/day; up to 200 queries per second Not yet text searchable (URL search only) Financial sustainability ? (No advertising)
Conclusion We’re not here to test laws. We’re trying to build a world we want to live in. The world without a library is a world without memory, and that would be tragic.” B. Kahle, October On the Web, anyone can be a publisher; now there is a library for their work.” B. Kahle, May 2002