Web Archiving at the National Library of Australia National Library of Indonesia Staff 5 October 2010 Paul Koerbin Manager, Web Archiving National Library.

Slides:

Advertisements

Similar presentations

Update for CDNL Milan 26 August 2009 Caroline Brazier, Chair of ICADS IFLA-CDNL Alliance for Digital Strategies.

Advertisements

Moving Forward With Digital Preservation at the Library of Congress Laura Campbell Associate Librarian for Strategic Initiatives Library of Congress.

A centre of expertise in data curation and preservation London :: ARK Group Workshop: Archiving the Web :: 28 Sept 2006 Funded by: This work is licensed.

1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.

Bibliothèque nationale de France Tallinn, BnF update: production and development priorities in 2015.

Kentico CMS 5.5 R2 What’s New. Highlights Intranet Solution Document management package – WebDAV support – Project & task management – Document libraries.

The UM Libraries’ Frost Concert Archive Documenting the Performance History of the University of Miami Frost School of Music Amy Strickland University.

The Library of Congress Cooperative Web Archiving Project Abbie Grotke, Library of Congress Grant Harris, Library of Congress Jennifer Long, Georgetown.

BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall

Administration & Workflow

PANDORA and Beyond: Managing Web Archiving at the National Library of Australia Digital Preservation Seminar National Library of Australia, 21 November.

DAEDALUS: Facing the Challenges of eTheses at Glasgow William J Nixon Project Manager: Service Development (DAEDALUS) ETD Berlin, May 2003.

PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia

1 The IIPC Web Curator Tool: Steve Knight The National Library of New Zealand Philip Beresford and Arun Persad The British Library An Open Source Solution.

1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

APSR Forum on Long-Term Repositories National Library of Australia, 31 August – 1 September, Trust and the Web: Can the audit criteria apply to.

The FDLP Web Archive Dory Bower Archive-It Partner Meeting November 18, 2014.

1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum.

Archiving the Web: the PANDORA archive at the National Library of Australia Preserving the Present for the Future Copenhagen, June 2001 Warwick Cathro,

Web archiving at the NLA ‘ Archiving the music web’ Music Council of Australia Annual Assembly 28 September 2009 Paul Koerbin Manager Digital Archiving.

Developing PANDORA Mark Corbould Director, IT Business Systems.

Debbie Campbell Director Collaborative Services National Library of Australia Electronic Resources Australia Annual Forum Sydney 10 July 2012 Trove’s Application.

Fixed Fields Information Session 29 February 2012 Andrew Gloe Map Acquisitions & Cataloguing Team Australian Collections Management & Preservation Branch.

1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

Elizabeth Newbold and Samantha Tillett GL8 New Orleans, December 2006

The capture and preservation of websites at the National Library of New Zealand Gillian Lee Alexander Turnbull Library.

1 Archive-It Training University of Maryland July 12, 2007.

Australian Partnership for Sustainable Repositories AUSTRALIAN PARTNERSHIP FOR SUSTAINABLE REPOSITORIES Caul Meeting 2005/2 Brisbane 15.

Bibliography in the Digital Age - IFLA Satellite Meeting Warsaw, 9 August Online materials published in Austria collecting, archiving and metadata.

Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web.

1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

How to Face the Challenges of Web Archiving? The experiences of a small library on the edge. Chloe Martin, Internet Memory Catherine Ryan, National Library.

Web The Internet Archive. Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for.

Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.

The Australian Government Web Archive ALIA Conference September 2014, Melbourne Alison Dellit Director, Australian Collection Management.

Ymchwil Research Ymchwil Research RESAW Ioan Isaac-Richards Ingest Processes Manager Head of Web Archiving

Merging the National Library and the National Archives LIBER General Annual Conference, Tartu, June 2012 Els van Eijck van Heslinga, Head Finance and Corporate.

Report to the Libraries Australia Forum 6 November 2009 Warwick Cathro Assistant Director-General Resource Sharing & Innovation.

Cataloging and Metadata at the University Library.

Re-imagining the national data store Warwick Cathro Assistant Director-General, Innovation.

The Development of National Archives of Malaysia (NAM) as National Research Centre & SARBICA’s Roles Presented by : Ahmad Sukri Abdul Kadir National Archives.

The Legislative Library of Ontario’s Ontario Documents Repository Road to Partnership.

Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.

From here to perpetuity: challenges (and a few confessions) in preserving web-based AV content ASRA Conference 2011 Paul Koerbin Manager Web Archiving.

Web Archiving at the National Library of Australia Russell Latham Senior Web Archivist, National Library of Australia.

Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.

The Library of Congress Martha Anderson Program Officer, NDIIPP Office of Strategic Initiatives Library of Congress April 2005 LC Perspective : Preservation.

Building the Mother of all Collections: the future of the National Library’s discovery services Warwick Cathro Assistant Director-General, Innovation National.

European Commission on Preservation and Access Preservation of digital heritage Yola de Lusenet Lisbon, November

The Government Recordkeeping Survey 2008 Natalie Dewson, Senior Advisor, Government Recordkeeping Programme, Archives New Zealand.

The KB e-Depot long-term preservation of scientific publications in practice Marcel Ras, National library of The Netherlands.

Selection Strategies for Digital Institutional Repositories Kent Woynowski 30 September 2004.

National policy of the preservation of digital cultural heritage Estonian Legal Deposit Act and web resources Ülle Talihärm Head of Collection Development.

Web Discovery and Millennium Integrating Millennium with Summon Helen Bronleigh Library Systems Coordinator.

Preservation Program Digital Preservation Program Digital Preservation Services: Extending tools to meet campus needs Patricia Cruse, Director, Digital.

Warwick Cathro Assistant Director-General Resource Sharing and Innovation National Library of Australia Trove – a service built on collaboration OCLC Asia.

Building Collections on the Web BCWeb. What’s BCWeb ? BCWeb was developped entirely by the BnF for the content curators to replace its old selection tools.

EDLproject WP3 “Developing the European Digital Library” LIBER – EBLIDA workshop Digitisation of Library Material in Europe Copenhagen, October.

BHL-Europe Biodiversity Heritage Library for Europe – ECP-2008-DILI – Kick-off meeting – Berlin – May 2009www.biodiversitylibrary.org Biodiversity.

Use cases for BnF broad crawls Annick Lorthios. 2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the.

2008 DOT GOV HARVEST PRESERVING ACCESS UNIVERSITY OF NORTH TEXAS LIBRARIES Cathy N. Hartman Mark E. Phillips FDLC Oct 21, 2008.

Pre-Course Assignment

László Drótos – Márton Németh National Széchényi Library Department of Electronic Library Services Web archiving Planning a new pilot project.

Latin American Government Documents Archive, LAGDA

The Australian Government Web Archive

How to Design and Implement Research Outputs Repositories

MSC photo: It was taken some time in the late 1930s, but we don’t have an exact date. The college was known as MSC from 1925 until 1955 when we became.

Your Government, Your Publications

Márton Németh – László Drótos How to catalogue a web archive?

Managing the Institutional Repository for OA Khawulile Radebe: Librarian: Repository Administrator & Metadata.

Presentation transcript:

Web Archiving at the National Library of Australia National Library of Indonesia Staff 5 October 2010 Paul Koerbin Manager, Web Archiving National Library of Australia

Web Archiving at the NLA  Background  History  Organisation  Participants  Approaches to web archiving  PANDORA selective archiving  Whole domain harvesting  Skills and operational tasks  Workflows and systems  PANDAS

History: web archiving at the NLA  April 1996: ‘Electronic Unit’ established  Part of Acquisitions Branch  3 staff, 6 months to develop selection (scope) guidelines and identify resources  September 1996: ‘Australian Serials and Electronic Unit’ established  Technical services restructure, multi-tasking, matrix management  October 1996 first titles harvested  November 1996: PANDORA born as ‘proof of concept project’  As at June 1997, 30 titles harvested

History: web archiving at the NLA  May 1998: public access to PANDORA titles  July 1998: first PANDORA ‘partner’ began participation  11 th participant joined in 2010  October 1998: first ‘Certified Agreement’ commenced in the Library  Change to staffing classifications; professional librarian streams abolished  June 2001: PANDAS v.1 released  Web archiving workflow system developed by NLA  2002: Digital Archiving Branch  Our own identity at last!  Began first trial of ‘mainstreaming’ web archiving in Serials and Govt Deposit sections

History: web archiving at the NLA  August 2002: PANDAS v.2 released  July 2003: joined IIPC  2004: PANDORA added to UNESCO Australian Memory of the World Register  July 2005: first.au domain harvest  Subsequent harvests in 2006, 2007, 2008 & 2009  December 2006: ‘Web Archiving and Digital Preservation Branch’  July 2007: PANDAS v.3 released  2010: PANDORA search moved to Trove  May 2010: Whole-of-govt ‘opt-out’ arrangements endorsed by SIGB

Manager Web Archiving (base level executive) Team Leader (senior librarian) Web Archiving Section Team Member (APS5) Web Archiving Section Team Member (APS4) Web Archiving Section Team Member (APS4) Web Archiving Section

Digitisation DIVISION 1 – COLLECTIONS MANAGEMENT Australian Collection Develop’t Special Materials Cataloguing, Standards & Training WEB ARCHIVING AND DIGITAL PRESERVATION BRANCH Web Archiving Imaging Services Jakarta Office SERIALS BRANCH DIVISION SUPPORT UNIT Overseas Collection Development Section MONOGRAPHS BRANCH DIGITAL COLLECTIONS MANAGEMENT BRANCH ASIAN COLLECTIONS BRANCH ILMS Section Serials Section Preservation Standards Collections Preservation PRESERVATION BRANCH ASSISTANT DIRECTOR-GENERAL BIBLIOGRAPHIC STANDARDS AND STRATEGIES BRANCH Digital Preservation Newspaper Digitisation Project Australian Newspaper Plan Acquisition & Access RDA

PANDORA Participants  11 participants including the NLA  All state and territory libraries (except Tasmania and ACT)  Major heritage institutions  National Film and Sound Archive  Australian War Memorial  Australian Institute of Aboriginal and Torres Strait Islander Studies  National Gallery of Australia

PANDORA Participants  Memorandum of Understanding  Respective obligations (NLA and Agencies)  Adherence to policy and procedures  Curatorial and collection management (operational staff)  Selection – participant guidelines  Permissions  Harvesting – scoping and quality checking  Cataloguing  Publishing – access through PANDORA

What is web archiving?  A web archive is not the same as the live web  Brings a different value to web content  Creating artefacts from the web  Preserved snapshots, slices, gobbets of time  Challenge of timeliness  At certain times some things are more interesting and valuable  Focus on the future and long term access (preservation objective)

Approaches to web archiving?  Selective (specific targets)  websites  single publications  Domain  Country domains (e.g..au or.id)  Sub-domains(e.g..gov.au)  Thematic  Scoped around topics, events, forms of publishing  Seed lists

12 PANDORA - Australia’s Web Archive  Selective approach – Australian content  Collaboration with participating agencies  No legal deposit  Permissions based collecting  Timely and scheduled collecting  Quality checked  Described and indexed (searchable)  Accessible to the public  Modest in size

13 Australian web domain harvests  Annual domain harvests  Working with the Internet Archive  Covers.au top level domain and a bit more …  No legal deposit  Permissions not sought  No public access (yet)  Quantity over quality (not QA action)  Full text indexed (searchable) not catalogued  Opportunistic rather than timely

14 Comparative statistics PANDORA Files:94 million Size:4.23 TB Domain Harvest Unique files 185 million596 million516 million1 billion765 million Hosts crawled 811,5231,046,0381,247,6143,038,6581,074,645 Size TBs Domain Harvests Files:3 billion Size:103 TB

Skills and tasks  Operational, Library’s ‘core business’ staff:  Librarians, web curators, web archivists, cataloguers … by any other name  Perform all associated tasks:  Selection, permissions, acquisition (harvesting) processes, quality checking, cataloguing, publishing (resource discovery)

Operational skills and tasks  Collection development  Selection expertise in ‘new media’  Corporate objectives, priorities, resources  Collection management  Cataloguing: MARC, LCSH, Dewey  PANDORA subjects  Technical skills  Scoping gather filters and settings  Harvesting and code problem analysis and resolution (HTML, JavaScript, stylesheets)  Understanding web technologies  Experience and self-learning  New technologies, Web 2.0, timely collecting, always new challenges

IT commitment and support  All infrastructure maintained at NLA  Systems and applications  Storage of archival content  Continuous development of systems from  3 version releases of PANDAS  Technical support for applications and systems  Expertise to assist with harvesting problems  Support for domain harvests

Overview of PANDORA procedures  PANDAS (PANDORA Digital Archiving System)  Workflow management system  Httrack harvesting software  Agencies (PANDORA participants)  Users  Administrators (PANDAS and Agency)  Standard user  Informational user  ‘Worktrays’ manage individual and agency workflow

Overview of PANDORA workflows  Some concepts:  Titles  The target entity: a single document, a website, and everything in-between  Publishers  Permissions  Instances  Each instance of an archived title  Users (‘owners’)  Belong to Agencies and own titles  Manage workflow among different agencies/people

Worktrays - Selection  Nominating titles  Shared agency worktray  Before selection decision is made  Selection statuses:  Selected  Rejected  Monitored

Worktrays - Permission  Requesting publisher permission  Licence under Copyright Act  Copy, preserve and make accessible  Manage and record publisher contact  Record permission status  Title level permission  Publisher level permission (‘blanket’)

Worktrays – Gather (Harvest)  Set harvesting schedules  Regular, specific days, gather now  Define harvesting parameters  Seed URLs, filters, gather settings  View gathering titles  Pause, view, modify, stop  Statistics

Worktrays - Preserve  Manage quality checking process  Not yet archived – working area  Analyse harvested instance:  Completeness  No unwanted content  Functionality  Fix problems (or ‘refer to IT’)  WebDAV, FTP and Samba access to files  Decision on instance: Archive or Delete

Worktrays - Publish  Manages the public access to archived instances  Set up Title Entry Pages  Add notes  Issues  Copyright statements  Browse listings

Worktrays - Catalogue  Add ANBD number  Automatically creates AGLS metadata for Title Entry Page

Administration  Manage Agency information  Add users  Manage user access  Run reports  Agency statistics and totals  Titles and instances selected, process and archived for specified period  New title instances archived  Scheduled gathers

33