Web Archiving at the National Library of Australia National Library of Indonesia Staff 5 October 2010 Paul Koerbin Manager, Web Archiving National Library of Australia
Web Archiving at the NLA Background History Organisation Participants Approaches to web archiving PANDORA selective archiving Whole domain harvesting Skills and operational tasks Workflows and systems PANDAS
History: web archiving at the NLA April 1996: ‘Electronic Unit’ established Part of Acquisitions Branch 3 staff, 6 months to develop selection (scope) guidelines and identify resources September 1996: ‘Australian Serials and Electronic Unit’ established Technical services restructure, multi-tasking, matrix management October 1996 first titles harvested November 1996: PANDORA born as ‘proof of concept project’ As at June 1997, 30 titles harvested
History: web archiving at the NLA May 1998: public access to PANDORA titles July 1998: first PANDORA ‘partner’ began participation 11 th participant joined in 2010 October 1998: first ‘Certified Agreement’ commenced in the Library Change to staffing classifications; professional librarian streams abolished June 2001: PANDAS v.1 released Web archiving workflow system developed by NLA 2002: Digital Archiving Branch Our own identity at last! Began first trial of ‘mainstreaming’ web archiving in Serials and Govt Deposit sections
History: web archiving at the NLA August 2002: PANDAS v.2 released July 2003: joined IIPC 2004: PANDORA added to UNESCO Australian Memory of the World Register July 2005: first.au domain harvest Subsequent harvests in 2006, 2007, 2008 & 2009 December 2006: ‘Web Archiving and Digital Preservation Branch’ July 2007: PANDAS v.3 released 2010: PANDORA search moved to Trove May 2010: Whole-of-govt ‘opt-out’ arrangements endorsed by SIGB
Manager Web Archiving (base level executive) Team Leader (senior librarian) Web Archiving Section Team Member (APS5) Web Archiving Section Team Member (APS4) Web Archiving Section Team Member (APS4) Web Archiving Section
Digitisation DIVISION 1 – COLLECTIONS MANAGEMENT Australian Collection Develop’t Special Materials Cataloguing, Standards & Training WEB ARCHIVING AND DIGITAL PRESERVATION BRANCH Web Archiving Imaging Services Jakarta Office SERIALS BRANCH DIVISION SUPPORT UNIT Overseas Collection Development Section MONOGRAPHS BRANCH DIGITAL COLLECTIONS MANAGEMENT BRANCH ASIAN COLLECTIONS BRANCH ILMS Section Serials Section Preservation Standards Collections Preservation PRESERVATION BRANCH ASSISTANT DIRECTOR-GENERAL BIBLIOGRAPHIC STANDARDS AND STRATEGIES BRANCH Digital Preservation Newspaper Digitisation Project Australian Newspaper Plan Acquisition & Access RDA
PANDORA Participants 11 participants including the NLA All state and territory libraries (except Tasmania and ACT) Major heritage institutions National Film and Sound Archive Australian War Memorial Australian Institute of Aboriginal and Torres Strait Islander Studies National Gallery of Australia
PANDORA Participants Memorandum of Understanding Respective obligations (NLA and Agencies) Adherence to policy and procedures Curatorial and collection management (operational staff) Selection – participant guidelines Permissions Harvesting – scoping and quality checking Cataloguing Publishing – access through PANDORA
What is web archiving? A web archive is not the same as the live web Brings a different value to web content Creating artefacts from the web Preserved snapshots, slices, gobbets of time Challenge of timeliness At certain times some things are more interesting and valuable Focus on the future and long term access (preservation objective)
Approaches to web archiving? Selective (specific targets) websites single publications Domain Country domains (e.g..au or.id) Sub-domains(e.g..gov.au) Thematic Scoped around topics, events, forms of publishing Seed lists
12 PANDORA - Australia’s Web Archive Selective approach – Australian content Collaboration with participating agencies No legal deposit Permissions based collecting Timely and scheduled collecting Quality checked Described and indexed (searchable) Accessible to the public Modest in size
13 Australian web domain harvests Annual domain harvests Working with the Internet Archive Covers.au top level domain and a bit more … No legal deposit Permissions not sought No public access (yet) Quantity over quality (not QA action) Full text indexed (searchable) not catalogued Opportunistic rather than timely
14 Comparative statistics PANDORA Files:94 million Size:4.23 TB Domain Harvest Unique files 185 million596 million516 million1 billion765 million Hosts crawled 811,5231,046,0381,247,6143,038,6581,074,645 Size TBs Domain Harvests Files:3 billion Size:103 TB
Skills and tasks Operational, Library’s ‘core business’ staff: Librarians, web curators, web archivists, cataloguers … by any other name Perform all associated tasks: Selection, permissions, acquisition (harvesting) processes, quality checking, cataloguing, publishing (resource discovery)
Operational skills and tasks Collection development Selection expertise in ‘new media’ Corporate objectives, priorities, resources Collection management Cataloguing: MARC, LCSH, Dewey PANDORA subjects Technical skills Scoping gather filters and settings Harvesting and code problem analysis and resolution (HTML, JavaScript, stylesheets) Understanding web technologies Experience and self-learning New technologies, Web 2.0, timely collecting, always new challenges
IT commitment and support All infrastructure maintained at NLA Systems and applications Storage of archival content Continuous development of systems from 3 version releases of PANDAS Technical support for applications and systems Expertise to assist with harvesting problems Support for domain harvests
Overview of PANDORA procedures PANDAS (PANDORA Digital Archiving System) Workflow management system Httrack harvesting software Agencies (PANDORA participants) Users Administrators (PANDAS and Agency) Standard user Informational user ‘Worktrays’ manage individual and agency workflow
Overview of PANDORA workflows Some concepts: Titles The target entity: a single document, a website, and everything in-between Publishers Permissions Instances Each instance of an archived title Users (‘owners’) Belong to Agencies and own titles Manage workflow among different agencies/people
Worktrays - Selection Nominating titles Shared agency worktray Before selection decision is made Selection statuses: Selected Rejected Monitored
Worktrays - Permission Requesting publisher permission Licence under Copyright Act Copy, preserve and make accessible Manage and record publisher contact Record permission status Title level permission Publisher level permission (‘blanket’)
Worktrays – Gather (Harvest) Set harvesting schedules Regular, specific days, gather now Define harvesting parameters Seed URLs, filters, gather settings View gathering titles Pause, view, modify, stop Statistics
Worktrays - Preserve Manage quality checking process Not yet archived – working area Analyse harvested instance: Completeness No unwanted content Functionality Fix problems (or ‘refer to IT’) WebDAV, FTP and Samba access to files Decision on instance: Archive or Delete
Worktrays - Publish Manages the public access to archived instances Set up Title Entry Pages Add notes Issues Copyright statements Browse listings
Worktrays - Catalogue Add ANBD number Automatically creates AGLS metadata for Title Entry Page
Administration Manage Agency information Add users Manage user access Run reports Agency statistics and totals Titles and instances selected, process and archived for specified period New title instances archived Scheduled gathers
33