Download presentation
Presentation is loading. Please wait.
Published bySandra Annabella McKenzie Modified over 9 years ago
1
Web Archiving at the National Library of Australia National Library of Indonesia Staff 5 October 2010 Paul Koerbin Manager, Web Archiving National Library of Australia
2
Web Archiving at the NLA Background History Organisation Participants Approaches to web archiving PANDORA selective archiving Whole domain harvesting Skills and operational tasks Workflows and systems PANDAS
3
History: web archiving at the NLA April 1996: ‘Electronic Unit’ established Part of Acquisitions Branch 3 staff, 6 months to develop selection (scope) guidelines and identify resources September 1996: ‘Australian Serials and Electronic Unit’ established Technical services restructure, multi-tasking, matrix management October 1996 first titles harvested November 1996: PANDORA born as ‘proof of concept project’ As at June 1997, 30 titles harvested
4
History: web archiving at the NLA May 1998: public access to PANDORA titles July 1998: first PANDORA ‘partner’ began participation 11 th participant joined in 2010 October 1998: first ‘Certified Agreement’ commenced in the Library Change to staffing classifications; professional librarian streams abolished June 2001: PANDAS v.1 released Web archiving workflow system developed by NLA 2002: Digital Archiving Branch Our own identity at last! Began first trial of ‘mainstreaming’ web archiving in Serials and Govt Deposit sections
5
History: web archiving at the NLA August 2002: PANDAS v.2 released July 2003: joined IIPC 2004: PANDORA added to UNESCO Australian Memory of the World Register July 2005: first.au domain harvest Subsequent harvests in 2006, 2007, 2008 & 2009 December 2006: ‘Web Archiving and Digital Preservation Branch’ July 2007: PANDAS v.3 released 2010: PANDORA search moved to Trove May 2010: Whole-of-govt ‘opt-out’ arrangements endorsed by SIGB
6
Manager Web Archiving (base level executive) Team Leader (senior librarian) Web Archiving Section Team Member (APS5) Web Archiving Section Team Member (APS4) Web Archiving Section Team Member (APS4) Web Archiving Section
7
Digitisation DIVISION 1 – COLLECTIONS MANAGEMENT Australian Collection Develop’t Special Materials Cataloguing, Standards & Training WEB ARCHIVING AND DIGITAL PRESERVATION BRANCH Web Archiving Imaging Services Jakarta Office SERIALS BRANCH DIVISION SUPPORT UNIT Overseas Collection Development Section MONOGRAPHS BRANCH DIGITAL COLLECTIONS MANAGEMENT BRANCH ASIAN COLLECTIONS BRANCH ILMS Section Serials Section Preservation Standards Collections Preservation PRESERVATION BRANCH ASSISTANT DIRECTOR-GENERAL BIBLIOGRAPHIC STANDARDS AND STRATEGIES BRANCH Digital Preservation Newspaper Digitisation Project Australian Newspaper Plan Acquisition & Access RDA
8
PANDORA Participants 11 participants including the NLA All state and territory libraries (except Tasmania and ACT) Major heritage institutions National Film and Sound Archive Australian War Memorial Australian Institute of Aboriginal and Torres Strait Islander Studies National Gallery of Australia
9
PANDORA Participants Memorandum of Understanding Respective obligations (NLA and Agencies) Adherence to policy and procedures Curatorial and collection management (operational staff) Selection – participant guidelines Permissions Harvesting – scoping and quality checking Cataloguing Publishing – access through PANDORA
10
What is web archiving? A web archive is not the same as the live web Brings a different value to web content Creating artefacts from the web Preserved snapshots, slices, gobbets of time Challenge of timeliness At certain times some things are more interesting and valuable Focus on the future and long term access (preservation objective)
11
Approaches to web archiving? Selective (specific targets) websites single publications Domain Country domains (e.g..au or.id) Sub-domains(e.g..gov.au) Thematic Scoped around topics, events, forms of publishing Seed lists
12
12 PANDORA - Australia’s Web Archive Selective approach – Australian content Collaboration with participating agencies No legal deposit Permissions based collecting Timely and scheduled collecting Quality checked Described and indexed (searchable) Accessible to the public Modest in size
13
13 Australian web domain harvests Annual domain harvests 2005-2009 Working with the Internet Archive Covers.au top level domain and a bit more … No legal deposit Permissions not sought No public access (yet) Quantity over quality (not QA action) Full text indexed (searchable) not catalogued Opportunistic rather than timely
14
14 Comparative statistics PANDORA Files:94 million Size:4.23 TB Domain Harvest 20052006200720082009 Unique files 185 million596 million516 million1 billion765 million Hosts crawled 811,5231,046,0381,247,6143,038,6581,074,645 Size TBs6.6919.0418.4734.5524.29 Domain Harvests Files:3 billion Size:103 TB
15
Skills and tasks Operational, Library’s ‘core business’ staff: Librarians, web curators, web archivists, cataloguers … by any other name Perform all associated tasks: Selection, permissions, acquisition (harvesting) processes, quality checking, cataloguing, publishing (resource discovery)
16
Operational skills and tasks Collection development Selection expertise in ‘new media’ Corporate objectives, priorities, resources Collection management Cataloguing: MARC, LCSH, Dewey PANDORA subjects Technical skills Scoping gather filters and settings Harvesting and code problem analysis and resolution (HTML, JavaScript, stylesheets) Understanding web technologies Experience and self-learning New technologies, Web 2.0, timely collecting, always new challenges
17
IT commitment and support All infrastructure maintained at NLA Systems and applications Storage of archival content Continuous development of systems from 1997-2007 3 version releases of PANDAS Technical support for applications and systems Expertise to assist with harvesting problems Support for domain harvests
18
Overview of PANDORA procedures PANDAS (PANDORA Digital Archiving System) Workflow management system Httrack harvesting software Agencies (PANDORA participants) Users Administrators (PANDAS and Agency) Standard user Informational user ‘Worktrays’ manage individual and agency workflow
19
Overview of PANDORA workflows Some concepts: Titles The target entity: a single document, a website, and everything in-between Publishers Permissions Instances Each instance of an archived title Users (‘owners’) Belong to Agencies and own titles Manage workflow among different agencies/people
21
Worktrays - Selection Nominating titles Shared agency worktray Before selection decision is made Selection statuses: Selected Rejected Monitored
23
Worktrays - Permission Requesting publisher permission Licence under Copyright Act Copy, preserve and make accessible Manage and record publisher contact Record permission status Title level permission Publisher level permission (‘blanket’)
25
Worktrays – Gather (Harvest) Set harvesting schedules Regular, specific days, gather now Define harvesting parameters Seed URLs, filters, gather settings View gathering titles Pause, view, modify, stop Statistics
27
Worktrays - Preserve Manage quality checking process Not yet archived – working area Analyse harvested instance: Completeness No unwanted content Functionality Fix problems (or ‘refer to IT’) WebDAV, FTP and Samba access to files Decision on instance: Archive or Delete
29
Worktrays - Publish Manages the public access to archived instances Set up Title Entry Pages Add notes Issues Copyright statements Browse listings
31
Worktrays - Catalogue Add ANBD number Automatically creates AGLS metadata for Title Entry Page
32
Administration Manage Agency information Add users Manage user access Run reports Agency statistics and totals Titles and instances selected, process and archived for specified period New title instances archived Scheduled gathers
33
33 http://pandora.nla.gov.au
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.