Download presentation
Presentation is loading. Please wait.
Published byMiguel Nichols Modified over 11 years ago
1
A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003), Trondheim, Norway, 17-22 August 2003
2
ECDL 2003, Trondheim, Norway, 17-22 August 2003 Presentation overview The importance of the Web Challenges: –Technical, legal, and organisational challenges Approaches to collection: –Harvesting based, selective, and deposit; combined approaches Discussion: –Collection and access policies, software, costs, long-term preservation
3
ECDL 2003, Trondheim, Norway, 17-22 August 2003 Importance of the Web An all pervasive communication medium: In research: –Scientists are 'increasingly reliant' on the Web for supporting research (Hendler, 2003) Wider societal role: –personal communication, e-commerce, etc. –"… the information source of first resort for millions of readers" (Lyman, 2002)
4
ECDL 2003, Trondheim, Norway, 17-22 August 2003 The UKOLN study Feasibility study produced for: –Joint Information Systems Committee (JISC) –Wellcome Library –A survey of initiatives –Recommendations for the JISC and Wellcome Library –Supplementary legal study (Charlesworth) –Published February 2003 http://library.wellcome.ac.uk/projects/archiving_reports.shtml
5
ECDL 2003, Trondheim, Norway, 17-22 August 2003 Technical challenges (1) Size of Web: –Surface web > 50 Tb (2000) … and still growing –The 'deep Web' –Scale of task means that Web-archiving needs to be a collaborative activity
6
ECDL 2003, Trondheim, Norway, 17-22 August 2003 Technical challenges (2) Dynamic nature of Web: –Web pages disappear on average after 75 days –Many leave no trace Evolution of Web-based technologies: –Increasing reliance on databases, scripts, plug-ins, etc. –A 'moving target'
7
ECDL 2003, Trondheim, Norway, 17-22 August 2003 Legal challenges Copyright Content liability, e.g.: –Defamation –Data protection In the UK: –Selective approach would be the safest solution (unless law changes) See: Charlesworth (2003) http://library.wellcome.ac.uk/projects/archiving_reports.shtml
8
ECDL 2003, Trondheim, Norway, 17-22 August 2003 Organisational challenges Decentralised organisation: –Web-archiving initiatives focus on defined sub-sets of the Web, e.g.: –National domain, subject, organisation type –Need for co-operation between initiatives Quality: –Much on Web is low-quality (or worse) –Is there a need to preserve all of this?
9
ECDL 2003, Trondheim, Norway, 17-22 August 2003 Initiatives (1) The Internet Archive –Largest initiative, running since 1996 –Co-operates on special collections and with other repositories National Libraries: –Pioneer archives in Sweden (Kulturarw 3 ) and Australia (PANDORA) –Now many, many more –Changes to legal deposit legislation in some countries
10
ECDL 2003, Trondheim, Norway, 17-22 August 2003 Initiatives (2) National archives: –Focus on government Web-sites (however defined) –Guidance for Web-site managers: –e.g., UK and Australia –Snapshots: –e.g., USA and UK Other: –Universities and scholarly societies: –e.g., Archipol, Occasio archive, Political Communications Web Archiving (Cornell)
11
ECDL 2003, Trondheim, Norway, 17-22 August 2003 Approaches (1) Automatic harvesting: –Use of Web crawler technologies –Crawler follows links and downloads content –Pioneered by Internet Archive and Kulturarw 3 project –Also used for the gathering of the Finnish and Austrian Web
12
ECDL 2003, Trondheim, Norway, 17-22 August 2003 Approaches (2) Selective approaches: –Selection of individual Web sites –Negotiate rights with site owners –Collection using gathering or mirroring software, ftp, or e-mail –Pioneered in PANDORA project –Experimented with by Library of Congress and British Library Deposit approaches: –Site owners/administrators deposit site in repositories
13
ECDL 2003, Trondheim, Norway, 17-22 August 2003 Approaches (3) Combined approaches: –Combines the advantages of the harvesting and selective approaches –Pioneered by the Bibliothèque nationale de France –Experimented with enhancements to the harvesting approach e.g., noting the change frequency of sites, and their 'importance') Uses the selective approach for the 'deep Web'
14
ECDL 2003, Trondheim, Norway, 17-22 August 2003 Collection policies Dependent on technical approach chosen –National domain ++ (for harvesting-based approaches) –Collection guidelines (for selective approaches) –Based on relevance, provenance, quality, etc. –Frequency of capture –Possible overlap with subject gateway initiatives - e.g. the Resource Discovery Network (RDN) in the UK
15
ECDL 2003, Trondheim, Norway, 17-22 August 2003 Approximate size (2002) CountryInitiativeTypeSize (Gb.)No. Sites USAInternet ArchiveH>150,000.00 SwedenKulturarw3H4,500.00 FranceBnFC<1,000.00 AustriaAOLAH448.00 AustraliaPANDORAS405.003,300 FinlandHULH401.00 UKBritain on the WebS0.03100 USAMINERVAS* 35 Source: Day (2003)
16
ECDL 2003, Trondheim, Norway, 17-22 August 2003 Access policies Access policies differ: –Internet Archive and the PANDORA archive make data available –e.g., the Wayback Machine –Other collections effectively closed (for legal reasons or because experimental) –Need for specialised Web indexes that can search and navigate large collections of Web material –e.g., Nordic Web Archive (NWA) Toolset
17
ECDL 2003, Trondheim, Norway, 17-22 August 2003 Software Various software in use: –Harvesting: –Adapted Combine harvester, NEDLIB harvester, Xyleme, Alexa –Selective: –HTTrack (popular), etc. –PANDAS (PANDORA Digital Archiving System) - helps with managing the process, adding metadata, etc.
18
ECDL 2003, Trondheim, Norway, 17-22 August 2003 Costs Costs vary widely: –Selective approach much more expensive (per Tb.) than bulk harvesting –But resulting archives are more widely accessible –Significant costs in undertaking rights clearance
19
ECDL 2003, Trondheim, Norway, 17-22 August 2003 Long-term preservation Many initiatives until now mainly focused on the collection of resources: –Need to consider the longer-term –Descriptive and technical metadata –Migration needs (e.g. for complex sites) –Need for Web archiving initiatives to become trusted repositories –Need to be embedded into the 'core activities' of their host organisation
20
ECDL 2003, Trondheim, Norway, 17-22 August 2003 Summing up Much experimentation to date, but now moving into implementation phase Co-operation and collaboration is important Combined technical approaches offer best way forward Legal challenges still problematic Long-term preservation issues still to be explored in detail
21
ECDL 2003, Trondheim, Norway, 17-22 August 2003 Acknowledgements UKOLN is funded by Resource: the Council for Museums, Archives and Libraries, the Joint Information Systems Committee (JISC) of the UK higher and further education funding councils, as well as by project funding from the JISC and the European Union. UKOLN also receives support from the University of Bath, where it is based. http://www.ukoln.ac.uk/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.