Web Archiving at the National Library of Australia Russell Latham Senior Web Archivist, National Library of Australia
“The Web's ever-expanding size, the dynamic and ephemeral nature of its content, and how this is to be captured, stored and made accessible for the long-term are some of the key questions being addressed by electronic archiving programs. “ PADI
What is web archiving? A web archive is not the same as the live web Brings a different value to web content Creating artefacts from the web Preserved snapshots, slices, gobbets of time Challenge of timeliness At certain times some things are more interesting and valuable Focus on the future and long term access (preservation objective)
History: web archiving at the NLA April Fools Day 1996: ‘Electronic Unit’ established May 1998: public access to PANDORA titles July 1998: first PANDORA ‘partner’ began participation 10 th participant joined in 2003 June 2001: PANDAS v.1 released Web archiving workflow system developed by NLA 2002: Digital Archiving Branch Our own identity at last! Began first trial of ‘mainstreaming’ web archiving in Serials and Govt Deposit sections
History: web archiving at the NLA August 2002: PANDAS v.2 released July 2003: joined IIPC 2004: PANDORA added to UNESCO Australian Memory of the World Register July 2005: first.au domain harvest Subsequent harvests in 2006, 2007, 2008 & 2009 December 2006: “Web Archiving and Digital Preservation Branch” July 2007: PANDAS v.3 released (at last!) 2010: PANDORA search moved to Trove May 2010: Proposal for whole-of-govt ‘opt-out’ arrangements through SIGB
PANDORA Participants
7 What we collect Selective approach Collaboration with PANDORA participating agencies Modest in size High quality, timely, high value collection, described and searchable Accessible to the public
Subjects Browse list Collections Agency based Trove – Archived Websites Trove – bibliographic Search engines Searching the collections
Collections National EventsIraq war, 2003 Asia Tsunami, 2004 Bali Bombing, 2002 Political EventsElections CHOGM National Apology Topic BasedExtreme sports Seven Network Natural eventsFloods Cyclones Bushfires
Subjects/Browsing When looking for non-specific resources Wish to browse a topic area
Agency based Use the partners page
Collections
Election campaigns
1996 Federal Election2001 Federal Election2004 Federal Election2007 Federal Election2010 Federal Election1998 Federal Election
The Future Selective Timely Small High Value Bulk Harvest Collections Thematic Domain Harvests Comprehensive
19 Australian web domain harvests Annual domain harvests Working with the Internet Archive Covers.au top level domain and a bit more … No public access Quantity over quality; content not assessed or described; opportunistic rather than timely
20 Comparative statistics PANDORA Files: 115 million Size:5.03 TB Domain Harvest Unique files 185 million596 million516 million1 billion765 million Hosts crawled 811,5231,046,0381,247,6143,038,6581,074,645 Size TBs Domain Harvests Files:3 billion Size:103 TB
Current status
23