Download presentation
Presentation is loading. Please wait.
Published byNathaniel MacDonald Modified over 10 years ago
1
Harvesting and archiving the Web Nordunet2000, 28.-30.9.2000 Juha Hakala Helsinki University Library
2
Contents Legal background Functional issues –harvesting –archival –indexing Some projects
3
Legal background national libraries store in theory all (printed) publications –this work is based on (legal) deposit Legal deposit acts are currently being extended to electronic materials copyright must be relaxed –permission to copy documents (for e.g. preservation purposes) is essential
4
Finnish Act on Legal Deposit (proposed) The national library is granted a right to harvest and archive freely available Web documents Archived resources can only be accessed from dedicated work stations within the deposit libraries (n = 6) Access to references (index) will be free –anyone can see what is archived
5
Functional issues libraries traditional methods do not fit too well to electronic materials –manual cataloguing can not be done to million documents new, automated means must be developed for acquisition, storage and indexing of electronic publications –this work is well under way
6
Harvesting Harvester = automated tool for collecting Web documents –given a number of URLs, the harvester will fetch the documents, check the hypertext links in these documents and then proceed into second harvesting round –this goes on until all qualifying documents have been fetched
7
Practical experiences Building a robust and well-behaving tool is not easy –bad quality of data & Web servers –huge quantity of data: what works for 1000 files, may not work for million Performance optimisation is complicated –after a while, only large servers are left if scheduling is done in an easy way
8
Archival No off-the shelf solutions for this! Implement strange exclusion and scheduling rules into harvester –e.g. always retrieve inline materials, and at once –may be hard to modify an existing application Generate & store archive metadata Document storage Incremental versus full archiving
9
Archive metadata - examples MD5 checksum –duplicate control of the archive –authentication of archived materials –unique access key (used as URN/NBN) Document size & location (old & new) Time stamp –when was the document retrieved –single point in time, or a period
10
Document storage store the documents into a database or into (UNIX) file system –elimation of duplicates (MD5 check) –extract metadata from files –pre-prosessing of files (tar & ZIP) –send location information to the archive database
11
Indexing Many off-the shelf solutions exist, but they do not qualify as such –enhancements needed for indexing archive metadata + adding it into user interface –changes to navigation (hyperlinks must point into the archive if applicable) –ability to cope with very large amount of documents and file types
12
Problems - 1 not everything can be harvested –databases, dynamic documents –how to find all relevant data in.com,.net, etc? co-operation with network providers not all functionality can be kept –image maps indexing –images and sound; weird text types
13
Problems - 2 long time preservation –97 % of the Web documents are text/html, text/plain, image/jpeg or image/gif –the remaining 3 % are a problem –work stations accessing the archive must be equipped with a wide variety of tools such as emulators
14
Projects - 1 Kulturarw3 (Royal Library of Sweden) –started in 1996 –uses modified version of the Combine harvester –Swedish Web has been harvested seven times –http://kulturarw3.kb.se/
15
Projects - 2 NEDLIB (European national libraries) –started in 1997; the aim is to develop tools for deposit of all kinds of electronic materials –developed a harvester customised for Web archiving –version one published in Jan. 2000 –after extensive tests version two released in September 2000 –http://www.kb.nl/nedlib
16
Projects - 3 Nordic Web Archive (Nordic national libraries) –has started in September 2000 –supported by Nordunet2 –will develop an index for Web archives built with Kulturarw3 or NEDLIB tools –co-operation with either FAST (Norway) or Index Data (Denmark) –virtual union catalogue of Nordic Web
17
Experiences The Web is small –Swedish web space contained in Spring 1999 7.5 million files, but only 300 gigabytes The Web is simple –four document types comprise about 97% of all documents –… but there are in all about 200 filetypes Work station will do for harvesting –… but search engine will need more power
18
Future several national libraries have plans to harvest the Web, either selectively or basically everything revision of legislation is under way in a few countries close co-operation between the libraries will continue in application development long time preservation will emerge as a new work item
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.