Download presentation
Presentation is loading. Please wait.
Published byShanna Kelly Modified over 9 years ago
1
1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator
2
2 Where to start? Selection Collection Development Policy Need to be able to find them again Cataloguing issues 404 Not Found Need to capture web sites Who is responsible for capture? Who is responsible for preservation/access? What does this mean? Define a web site - Where are the boundaries: Links Content on other sites / other servers Changes with time – significant change
3
3 Technical issues – Capture software Capture software Taking ‘Snapshots’ Follow directory structure or links? Where to break links / replace broken links? Relative vs absolute linking No changes to code for authenticity Preserve ‘original’ version, provide ‘access’ version Obey robots.txt exclusions Politeness – server load Quality control checking
4
4 Technical issues - Web sites File types - HTML, gif, JPEG, Javascript, asp, etc. etc. etc. Software plug-ins - permission - access Dynamic database driven sites - producing static pages - producing pages on-the-fly Frequency of capture Extent of capture - volume - duplication - storage and access to partial sites
5
5 Technical issues – storage and access Management and storage - high volume - multiple captures - long term, inc. storage system migration - disaster recovery Permanent naming Ensuring authenticity - trusted digital repository - checksums, signatures – long term Signifying access to archived version
6
6 Technical issues - preservation Preserve bits Preserve intellectual object, + ‘look & feel’ Preserve functionality Technology changes - physical storage - hardware platform - operating systems - application software - HTML
7
7 Technical issues – preservation strategies Metadata for preservation - describe bits: how and where stored - describe how to interpret/use bits - describe the context for the bits Migration - in part / in whole - valid code? - keep all versions? - manage multiple versions Emulation - of software / OS / platform
8
8 LEGAL DISCUSSION Minimise risk Capture non-commercial sites Preserve without providing access Embargo or limit access Document actions taken Maintain ability to remove access
9
9 Cost £££ ?? - to do it - of not doing it
10
10 PROJECTS General project types: Selective - narrow, high quality, low volume Comprehensive - broad, lower quality, high volume Combination - useful, high quality, high volume
11
11 PROJECTS British Library involvement: Domain.UK - selective UK Web Archiving Consortium - selective International Internet Preservation Consortium (IIPC) – comprehensive/combination
12
12 Project details Domain.uk WebWhacker, HTTrack Regular captures of simple sites Staff PC (later networked drive), very small No access UK WAC UK partners sharing one system PANDAS management, HTTrack, Oracle Manual selection, cataloguing and quality checking Web interface for capture and public access
13
13 Project details IIPC Comprehensive automated selection - links in / links out - authority / hits - rare words Designing new crawler / harvester Developing technical architecture Deep web? Access challenging
14
14 FUTURE WORK Expand collection Collaborative projects, inc. automated capture and metadata generation Legal deposit instruments for web archiving Provide restricted access
15
15 USEFUL REFERENCES http://library.wellcome.ac.uk/projects/archiving_reports.shtml Collecting and preserving the World Wide Web: A feasibility study undertaken for the JISC and Wellcome Trust Michael Day, UKOLN, University of Bath Version 1.0 - 25 February 2003 Legal issues relating to the archiving of Internet resources in the UK, EU, US and Australia Andrew Charlesworth, University of Bristol, Centre for IT and Law Version 1.0 - 25 February 2003 2 nd ECDL workshop on Web archiving http://bibnum.bnf.fr/ecdl/2002/index.html http://bibnum.bnf.fr/ecdl/2002/index.html Digital Preservation Coalition http://www.dpconline.org/ http://www.dpconline.org/
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.