Web archiving at the NLA ‘ Archiving the music web’ Music Council of Australia Annual Assembly 28 September 2009 Paul Koerbin Manager Digital Archiving National Library of Australia
1.Background – the what, why and how 2.What makes a valuable resource for archiving? 3.What can you do to help?
What is web archiving about and why do it? Archiving = long-term preservation and access Building collections Building ‘documentary’ historical record Creating artefacts from the web experience Discovering what is produced online An act of consciousness
What’s involved in web archiving? At the NLA it’s: Identifying, selecting, scoping Seeking permission to collect and make accessible Creating and recording metadata –administrative, descriptive, preservation Crawling/harvesting (including scheduling) Processing for quality assurance (best effort) Storing and maintaining the data Planning and implementing preservation strategies Preparing and rendering for public display Providing access and discovery mechanisms
What is the NLA doing? PANDORA Archive 1996→ –PANDORA participants NLA, state libraries (not Tas), NFSA, AWM, AIATSIS (and soon the NGA) –Highly selective, small scale, ‘quality’ collection, open access –PANDAS workflow management system, 2001→ Australian (.au) domain harvests –Annual since 2005 –Internet Archive –No access (yet)
Comparative statistics of NLA web collections PANDORA (selective) Files:73 million Size:3.26 TB Domain Harvest Unique files 185 million596 million516 million1 billion Hosts crawled 811,5231,046,0381,247,6143,038,658 Size 6.69 TB TB34.55 TB. au Domain Harvests Files:2.3 billion Size:78.75 TB
Music in the PANDORA Archive 500+ titles available from the PANDORA public listing of music –NFSA 33% –NLA 30% –Others 37% Musicians, bands, orchestras, composers, organisations, festivals, blogs, instrument makers, magazines … Plus 280 considered but not available –35% (no permission, rejected, yet to be selected)
What makes a valuable resource for archiving? Content –substantial, original Provenance ‘Long-term research value’ Cultural or social significance and interest –including events Curatorial/expert suggestion (e.g. Music Australia) Different collecting approaches based on ‘value’ Priorities, but never say never
How can you help? 10 tips: 1.Think about the issue of long term access – what is your intention? 2.Communicate interest and intentions – with collecting institutions; let us know about your site – respond to requests for permission 3.Organise and structure sites simply – its all about links 4.Comply with standards – limit use of proprietary technology if possible 5.Make it robot friendly – indexing, discovery, capture
How can you help? 10 tips: 6.Keep contributors informed and involved – make sure contributors understand and agree to long-term preservation and access from the beginning 7.Clear copyright, rights and contact information – it helps to know what and who (oh, and trust us too) 8.Maintain content online as much as possible – increases chance of it being collected 9.Learn to love and live with your past – archives are not the same as the ‘live’ web – archived versions cannot be altered 10.Do your own back up, of course
PANDORA Australia’s Web Archive