Harvesting and archiving the Web Nordunet2000, Juha Hakala Helsinki University Library
Contents Legal background Functional issues –harvesting –archival –indexing Some projects
Legal background national libraries store in theory all (printed) publications –this work is based on (legal) deposit Legal deposit acts are currently being extended to electronic materials copyright must be relaxed –permission to copy documents (for e.g. preservation purposes) is essential
Finnish Act on Legal Deposit (proposed) The national library is granted a right to harvest and archive freely available Web documents Archived resources can only be accessed from dedicated work stations within the deposit libraries (n = 6) Access to references (index) will be free –anyone can see what is archived
Functional issues libraries traditional methods do not fit too well to electronic materials –manual cataloguing can not be done to million documents new, automated means must be developed for acquisition, storage and indexing of electronic publications –this work is well under way
Harvesting Harvester = automated tool for collecting Web documents –given a number of URLs, the harvester will fetch the documents, check the hypertext links in these documents and then proceed into second harvesting round –this goes on until all qualifying documents have been fetched
Practical experiences Building a robust and well-behaving tool is not easy –bad quality of data & Web servers –huge quantity of data: what works for 1000 files, may not work for million Performance optimisation is complicated –after a while, only large servers are left if scheduling is done in an easy way
Archival No off-the shelf solutions for this! Implement strange exclusion and scheduling rules into harvester –e.g. always retrieve inline materials, and at once –may be hard to modify an existing application Generate & store archive metadata Document storage Incremental versus full archiving
Archive metadata - examples MD5 checksum –duplicate control of the archive –authentication of archived materials –unique access key (used as URN/NBN) Document size & location (old & new) Time stamp –when was the document retrieved –single point in time, or a period
Document storage store the documents into a database or into (UNIX) file system –elimation of duplicates (MD5 check) –extract metadata from files –pre-prosessing of files (tar & ZIP) –send location information to the archive database
Indexing Many off-the shelf solutions exist, but they do not qualify as such –enhancements needed for indexing archive metadata + adding it into user interface –changes to navigation (hyperlinks must point into the archive if applicable) –ability to cope with very large amount of documents and file types
Problems - 1 not everything can be harvested –databases, dynamic documents –how to find all relevant data in.com,.net, etc? co-operation with network providers not all functionality can be kept –image maps indexing –images and sound; weird text types
Problems - 2 long time preservation –97 % of the Web documents are text/html, text/plain, image/jpeg or image/gif –the remaining 3 % are a problem –work stations accessing the archive must be equipped with a wide variety of tools such as emulators
Projects - 1 Kulturarw3 (Royal Library of Sweden) –started in 1996 –uses modified version of the Combine harvester –Swedish Web has been harvested seven times –
Projects - 2 NEDLIB (European national libraries) –started in 1997; the aim is to develop tools for deposit of all kinds of electronic materials –developed a harvester customised for Web archiving –version one published in Jan –after extensive tests version two released in September 2000 –
Projects - 3 Nordic Web Archive (Nordic national libraries) –has started in September 2000 –supported by Nordunet2 –will develop an index for Web archives built with Kulturarw3 or NEDLIB tools –co-operation with either FAST (Norway) or Index Data (Denmark) –virtual union catalogue of Nordic Web
Experiences The Web is small –Swedish web space contained in Spring million files, but only 300 gigabytes The Web is simple –four document types comprise about 97% of all documents –… but there are in all about 200 filetypes Work station will do for harvesting –… but search engine will need more power
Future several national libraries have plans to harvest the Web, either selectively or basically everything revision of legislation is under way in a few countries close co-operation between the libraries will continue in application development long time preservation will emerge as a new work item