Presentation is loading. Please wait.

Presentation is loading. Please wait.

Archiving the Web: why bother ? LA Times (March 2000)

Similar presentations


Presentation on theme: "Archiving the Web: why bother ? LA Times (March 2000)"— Presentation transcript:

1 Archiving the Web: why bother ? LA Times (March 2000)

2 Archiving the Web: why bother ? “Web sites are an increasingly important part of [an] institution’s digital assets and of [a] country’s information and cultural heritage.” (JISC – April 2002) “A lot of history is born digital. This should not be like early television where there is no record.” (Brewster Kahle – May 2002)

3 Archiving the Web: who bothers? Australia USA Nordic countries: Denmark, Finland, Sweden Other countries: UK, France, Japan Internet Archive –“Wayback Machine”

4 Three conferences: What’s next for Digital Deposit Libraries ? Darmstadt, September 2001 International Symposium on Web Archiving. Tokyo, January 2002 DPC Forum: Web-archiving. London, March 2002.

5 The Web Sites http://www.bnf.fr/pages/infopro/dli_ ECDL2001.htmhttp://www.bnf.fr/pages/infopro/dli_ ECDL2001.htm http://www.ndl.go.jp/e/enews/sympo- eng.htmlhttp://www.ndl.go.jp/e/enews/sympo- eng.html http://www.jisc.ac.uk/dner/preservat ion/webforum.htmlhttp://www.jisc.ac.uk/dner/preservat ion/webforum.html

6 Issues and Questions Legal Deposit of Digital Information ? –European Union Copyright Directive Copyright ? Open or closed access ? Selective or comprehensive ? When in the life cycle ? How often ? Capturing the experience – –Dynamic web sites

7 Technical challenges Embedded external links and executable programs Persistent naming and date stamping Duplicate control Change in content over time Surface web vs Deep web

8 Australia (PANDORA Archive) – NLA http://www.nla.gov.au/pandora http://www.nla.gov.au/pandora As yet no legal deposit. Mandate for collecting C’wlth Government publications Selective –(Australian e-journals, organisational sites, government publications, ephemera) Accessible by public –Catalogued in the NBD

9 Australia (PANDORA Archive) ~1700 titles in the Archive (Nov. 2001) –Growth rate: 40 sites/month –Regathering: 35 sites/month ADRI (Australian Digital Resource Identifier) –Unique identification scheme –In-house resolving system

10 USA (Minerva) - Library of Congress (Mapping the Internet Electronic Resources Virtual Archive) Open access materials from the Web Changes in copyright law under discussion Selective inclusion Public access

11 LC/IA Pilot Project – “Election 2000” Joint pilot project Library of Congress and Internet Archive Objectives: –Library pilot : selection, collection and cataloguing web sites; build prototype access system –Internet Archive pilot: gain experience in harvesting and archiving sites Over 800 websites (150+ selected sites and major sites hyperlinked to/from those sites) 2-3 terabytes of data Archived daily August 2000 to January 2001

12 Denmark http://www.netarchive.dk http://www.netarchive.dk Royal Library, Copenhagen. Limited legal deposit of electronic publications –Static, not dynamic publications – finite units Access only from workstations at Royal Library and State and University Library Archiving static websites (monographs, periodicals) Server mirrored nightly to State and University Library, Arhus

13 Denmark (Statistics) June 2001 - archived 9000 net publications –31% monographs, 69% periodicals –67.5% public sector/university, 32.5 private sector publications Staff resources 0.5 technical; 0.8 librarian

14 Sweden (Royal Library) Take snapshots of Swedish Web several times/year –No selection - take everything –All www pages in Sweden, all articles in e-journals, all Swedish newspapers –Definition of Sweden:.se -.com,.org.net with Swedish address or telephone number –Archive only - no public access as yet.

15 Sweden (Software) Uses Whois to identify Swedish sites in non-.se domains Harvesting with COMBINE Robot software (Univ. of Lund) –Collects papers by automatically following hypertext links –Also collects pictures and sound –Fully automatic - no human intervention

16 Swedish Archive (Kulturarw3) http://www.kb.se/kw3 http://www.kb.se/kw3 Everything associated with an object and metadata stored in one file as a multipart MIME object Name of the file: 33 character string with time stamp Sept 2001: 110 million files - 3000 Gbytes of data from 97,000 web servers Stored on disk and magnetic tapes using Hierarchical Storage Management (HSM)

17 Swedish Archive (Kulturarw3) (2) Prior to July 2002: Limited legal deposit (fixed form e- documents) December 2001 : Data Inspection Board team confirms project is illegal. Project suspended July 2002. Amendments to Swedish copyright law. Gives Royal Library right to collect the Swedish web and to make the archive publicly available.

18 Finland - National Library Follows Swedish approach - only.fi domain initially Finnish Copyright Act under revision to permit harvesting web resources Uses harvesting software developed in Finland from NEDLIB specification Archive Metadata –Uses MD5 checksum for duplicate control, authentication and create unique access key –Time stamped upon retrieval

19 Finland - Results of current Harvesting Round (1) Harvesting round 2001-2002 –Commenced August 2001 - completed in April 2002 –9.4 million files from 29 million locations (URL’s) –Compressed data occupied 340 Gbytes of storage –Stored on a tape robot in national supercomputing centre –Hardware used: Sun E450 server

20 FINLAND - Results of current Harvesting Round (2) Finnish experience: “the NEDLIB harvester can deal with any national Web space (except perhaps the USA) with reasonably modest hardware, provided that there is sufficient storage space available somewhere”. (Juha Haleka, leader of the Finnish team)

21 Nordic Web Archive Joint project of Nordic national libraries Not dependent on what harvester is used –NEDLIB (Finland, Norway, Denmark), COMBINE (Sweden) Selected Norwegian search engine (FAST) Software –Convert documents from 100 different MIME types to HTML –Recognises most European languages Budget: 260,000 Euros (AUS $475,000)

22 “The homogeneous (surface) Web” 59.3% - Text/HTML 37.9% - Image (GIF,JPEG,PNG) 1.7% - PDF 1.1% - Other formats 1.5 million HTML 1 million GIF 550,000 JPEG 36,500 PDF 11,800 plain text 6,000 Word 5,300 Java etc. DenmarkFinland

23 United Kingdom (1) British Library –“Domain.uk” experiment (commenced 2002) Select and capture 100 UK websites (2001 election, GM crops) Email selected sites for approval Revisit every three weeks Uses Bluesquirrel Web Whacker software Audit change, loss and links over time –Intention to scale up (2004 funding bid)

24 United Kingdom (2) UKOLN Research Project –Estimates of size of.uk domain: 3 million sites, 24 million pages Wellcome Library/JISC Archiving Study to find a solution to web archiving –The “medical web” –Consultancy awarded March 2002- Completion date October 2002. –Draft report August 2002. Final report to be disseminated to the community

25 Germany (Deutsche Bibliotek) –Experiments with targeted harvesting –Two incomplete snapshots 12/2000 and 02/2001

26 France (Bibliotheque de France) –In 2001: two experiments with small numbers of sites (16,100), including music, video and multimedia. –Unsatisfactory results: Unexpected features Exceptionally large sites –Planning new feasibility study with with 2 different robot providers –Change in legal deposit law proposed in June 2001. Not yet adopted by Parliament.

27 Japan National Diet Library WARP (Web Archiving Program) Initially selective Major changes in Japanese copyright law expected to permit more comprehensive collecting.

28 Internet Archive (1) Founded by Brewster Kahle in 1996 - $15 million from sale of WAIS Non-profit organisation. –Sponsors include AT&T Research, Compaq, Xerox PARC, Quantum DLT, National Science Foundation. Archived web pages from 1996+, movies from 1903 to 1973 Site has archived over 10 billion pages (Oct. 2001) = more than 100 terabytes Growth rate : 10 terabytes/month

29 Internet Archive (2) Complete sweep of the Web every two months “Robot exclusions” - many newspapers, individuals, photographers Complete copy of Archive at Bibliotheca Alexandrina (April 2002) Duplicates in other continents proposed. “Best method of preservation is replication”. Copyright ? “May be a massive violation of copyright law”. (Lawrence Lessig, Stanford University expert on IP law in Cyberspace)

30 “Wayback machine” - http://www.archive.org http://www.archive.org Front end to the Internet Archive collection of public web pages Includes most image files in the collection Launched October 2001 Fully available to public 20,000 users/day; up to 200 queries per second Not yet text searchable (URL search only) Financial sustainability ? (No advertising)

31 Conclusion We’re not here to test laws. We’re trying to build a world we want to live in. The world without a library is a world without memory, and that would be tragic.” B. Kahle, October 2001. On the Web, anyone can be a publisher; now there is a library for their work.” B. Kahle, May 2002


Download ppt "Archiving the Web: why bother ? LA Times (March 2000)"

Similar presentations


Ads by Google