PANDORA and Beyond: Managing Web Archiving at the National Library of Australia Digital Preservation Seminar National Library of Australia, 21 November.

Slides:



Advertisements
Similar presentations
Recent developments in digital archiving and preservation Jan Fullerton Director General National Library of Australia.
Advertisements

Digital preservation – State of the game on the library lawns Digital Futures International Forum National Archives of Australia, 19 September 2007 Colin.
Sustaining repositories and middleware Tom Ruthven APSR Executive Officer.
DSpace: the MIT Libraries Institutional Repository MacKenzie Smith, MIT EDUCAUSE 2003, November 5 th Copyright MacKenzie Smith, This work is the.
BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall
A centre of expertise in data curation and preservation MIS Seminar :: University of Edinburgh :: 2 October 2006 Funded by: This work is licensed under.
Transformations at GPO: An Update on the Government Printing Office's Future Digital System George Barnum Coalition for Networked Information December.
Selecting Preservation Strategies for Web Archives Stephan Strodl, Andreas Rauber Department of Software.
PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia
Rutgers University Libraries What is RUcore? o An institutional repository, to preserve, manage and make accessible the research and publications of the.
APSR Forum on Long-Term Repositories National Library of Australia, 31 August – 1 September, Trust and the Web: Can the audit criteria apply to.
1 What is RUcore?  A cyberinfrastructure for the Rutgers Community that includes:  An institutional repository, to preserve, manage and make accessible.
William Y. Arms Corporation for National Research Initiatives March 22, 1999 Object models, overlay journals, and virtual collections.
The FAO Open Archive Enhancing the Access to FAO Publications Using International Standards and Exchange Protocols Claudia Nicolai, Imma Subirats and.
1 The Australian Partnership for Sustainable Repositories Margaret Henty Digital Futures Industry Briefing November 8, 2006.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Archiving the Web: the PANDORA archive at the National Library of Australia Preserving the Present for the Future Copenhagen, June 2001 Warwick Cathro,
The National Library’s role in the Australian Research Information Infrastructure projects Warwick Cathro National Library of Australia Coalition for Networked.
Web archiving at the NLA ‘ Archiving the music web’ Music Council of Australia Annual Assembly 28 September 2009 Paul Koerbin Manager Digital Archiving.
Developing PANDORA Mark Corbould Director, IT Business Systems.
1st Project Introduction to HTML.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
The capture and preservation of websites at the National Library of New Zealand Gillian Lee Alexander Turnbull Library.
1 Archive-It Training University of Maryland July 12, 2007.
Chapter ONE Introduction to HTML.
Australian Partnership for Sustainable Repositories AUSTRALIAN PARTNERSHIP FOR SUSTAINABLE REPOSITORIES Caul Meeting 2005/2 Brisbane 15.
The National Digital Newspaper Program (NDNP) An NEH/LC Collaborative Program Enhancing access to historical newspapers Release: September 2006.
Australian web domain harvests 2005, 2006 & 2007.
Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web.
Web The Internet Archive. Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for.
The Digital Object Management Programme (DOM) Richard Masters, Programme Manager PRESERV Partners Meeting 18 th November
Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.
The Australian Government Web Archive ALIA Conference September 2014, Melbourne Alison Dellit Director, Australian Collection Management.
Geoff Payne ARROW Project Manager 1 April Genesis Monash University information management perspective Desire to integrate initiatives such as electronic.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Annick Le Follic Bibliothèque nationale de France Tallinn,
Human Rights Archives and Documentation, CHRDR Conference 4- 6 October 2007 Issues in Human Rights Web Archiving Robert Wolven Columbia University Libraries.
Web Archiving at the National Library of Australia National Library of Indonesia Staff 5 October 2010 Paul Koerbin Manager, Web Archiving National Library.
Re-imagining the national data store Warwick Cathro Assistant Director-General, Innovation.
Cataloguing Electronic resources Prepared by the Cataloguing Team at Charles Sturt University.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
ECHO DEPository Project: Highlight on tools & emerging issues The ECHO DEPository Project is a 3-year digital preservation research and development project.
Group-based Repositories in Oz Diane Costello Council of Australian University Librarians ICOLC Montreal 2007.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
From here to perpetuity: challenges (and a few confessions) in preserving web-based AV content ASRA Conference 2011 Paul Koerbin Manager Web Archiving.
Web Archiving at the National Library of Australia Russell Latham Senior Web Archivist, National Library of Australia.
The Library of Congress Martha Anderson Program Officer, NDIIPP Office of Strategic Initiatives Library of Congress April 2005 LC Perspective : Preservation.
Building the Mother of all Collections: the future of the National Library’s discovery services Warwick Cathro Assistant Director-General, Innovation National.
Use & Access 26 March Use “Proof of Concept” Model for General Libraries & IS faculty Model for General Libraries & IS faculty Test bed for DSpace.
European Commission on Preservation and Access Preservation of digital heritage Yola de Lusenet Lisbon, November
Digital Commons & Open Access Repositories Johanna Bristow, Strategic Marketing Manager APBSLG Libraries: September 2006.
Digital Preservation across the technologies, strategies, open standards & interoperability aspects including the legal issues Pratik Shrivastava Scientist.
1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.
OSU | PSU | UO The Oregon Spatial Data Library: A Vision for Increased Data Sharing Myrica McCune Institute for Natural Resources Marc Rempel Oregon State.
Millman—Nov 04—1 An Update on Digital Libraries David Millman Director of Research & Development Academic Information Systems Columbia University
Warwick Cathro Assistant Director-General Resource Sharing and Innovation National Library of Australia Trove – a service built on collaboration OCLC Asia.
Copyright for teaching. 2 katelyncollins/category/week-5 CC BY.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Copyright in Schools Shannon Mersand, MLS Summer 2009.
ProQuest Dissertation Publishing ETD Administrator June 2012.
HTML PROJECT #1 Project 1 Introduction to HTML. HTML Project 1: Introduction to HTML 2 Project Objectives 1.Describe the Internet and its associated key.
Moving on : Repository Services after the RAE
Joanne Archer University of Maryland Libraries
Joseph JaJa, Mike Smorul, and Sangchul Song
Project 1 Introduction to HTML.
Getting Innovative with OER
VI-SEEM Data Repository
Legal Deposit & UK Publishing
Managing the Institutional Repository for OA Khawulile Radebe: Librarian: Repository Administrator & Metadata.
Presentation transcript:

PANDORA and Beyond: Managing Web Archiving at the National Library of Australia Digital Preservation Seminar National Library of Australia, 21 November 2006 Paul Koerbin Manager Digital Archiving National Library of Australia

PANDORA and Beyond Context and background PANDORA – selective archiving PANDAS – a web archiving system Domain harvesting Now and beyond

PANDORA and Beyond – Context - Legislation National Library Act, 1960 Functions of the National Library –Maintain and develop a national collection of library material, including a comprehensive collection of library material relating to Australia and the Australian people –To make library material in the national collection available … in the national interest –‘Library material’ ~ books, periodicals, newspapers, manuscripts, films, sound recordings, musical scores, maps, plans, pictures, photographs, prints and other recorded material …

PANDORA and Beyond – Context - Legislation Copyright Act, 1968 – Sect 201 Delivery of library materials to the National Library –‘Library material’ ~ book, periodical, newspaper, pamphlet sheet of letter-press, sheet of music, map, plan, chart or table, being a literary, dramatic, musical or artistic work or an edition of such a work … Enabling and supportive legislation does not address the collection of digital content Copyright Amendment (Digital Agenda) Act, 2000 –some support for digital preservation actions

PANDORA and Beyond – Context – Web Publishing World Wide Web: a new publishing medium, 1995→ Defining a publication for our purpose: A publication is information, regardless of its format or method of delivery, that is made available to the general public, or to an identified public, either free of charge or for a fee. Definition from: PANDORA Selection Guidelines Content rendered through a web browser – only as delivery mechanism (e.g. PDF) Databases – yes, but more problematic

PANDORA and Beyond – Context – Web Publishing Enormous growth and volume of material Everyone can be creators and publishers Virtually instantaneous publication Dynamic content and format Multiplicity of formats Technology dependent Hyperlinked and interconnected Highly accessible but hard to identify Ephemeral Interactivity, re-use, personalisation (web 2.0)

PANDORA and Beyond – Context – Some Objectives Fulfil the functions of the National Library Identify published content to collect Manage content for long term preservation –Integrity of the data streams –Maintain access to authentic content Provide persistent access to the content Incorporate collection and preservation of web content into routine Library processes Efficient and sustainable

PANDORA and Beyond – The PANDORA Archive PANDORA Archive 1996→PANDORA Archive Began as proof-of-concept project Now a routine process within NLA Currently 10 participants – NLA, state libraries (not Tas), NFSA, AWM, AIATSIS Selective, content focused (bibliocentric) –simple documents to whole websites PANDAS workflow management system, 2001→

PANDORA and Beyond – PANDORA – Web Archiving What is web archiving? Identifying and selecting Seeking permission to collect and make accessible Recording metadata Crawling/harvesting (including scheduling) Processing for quality assurance (best effort) Storing and maintaining the data Preparing and rendering for public display Creating resource discovery metadata

PANDORA and Beyond – PANDAS PANDAS – PANDORA Digital Archiving System Web based workflow management system Developed specifically to manage the web archiving processes at the National Library of Australia Used by PANDORA’s participants located throughout Australia (mainland state libraries, AWM, NFSA, AIATSIS) Also used by UKWAC

PANDORA and Beyond – PANDAS Developed in-house at the NLA Replaced multiple non-integrated systems used between 1996 and 2001 Written in Java on Apple WebObjects application development platform Presentation, application, business and data layers Version 1 released June 2001 Version 2 released August 2002 Version 3 due early 2007

PANDORA and Beyond – PANDAS

Developed in-house at the NLA Replaced multiple non-integrated systems used between 1996 and 2001 Written in Java on Apple WebObjects application development platform Presentation, application, business and data layers Version 1 released in June 2001 Version 2 released August 2002 Version 3 due early 2007

PANDORA and Beyond – PANDAS Record administrative metadata about titles selected (or considered) for archiving Schedule and initiate harvesting –but not a crawler; currently use HTTrack Manage quality assurance checking and problem fixing workflow Prepare and deliver archived copies for public display through the PANDORA home page –dynamically from PANDAS database Manage access restrictions Facilitates management reporting

PANDORA and Beyond – Persistent URIs Running number generated by PANDAS Persistent URL applied to title entry page Logically extended to any resource in the Archive ed.html Citation generator on public interface

PANDORA and Beyond – PANDORA Statistics Indicative statistics as at October ,000+ titles 26,000+ archived instances million files* 1.2+ Terabytes data* * These figures are for the display copy only. Three preservation copies are actually maintained: a preservation master, an access master and a metadata master.

PANDORA and Beyond – Domain Harvesting Crawl conducted by the Internet Archive for the NLA 1 st harvest June/July 2005 –4 weeks, 185m files, 6.69 TBs 2 nd harvest Aug/Sept 2006 –5 weeks, 516m files, TBs Harvest of the.au top level domain –plus, non.au hosts identified through geoPI lookup as being hosted in Australia Domain harvesting – obvious choice?

Comparative statistics PANDORA (c. 6% of 2006 DH) Files:33 million Size:1.2 TB HTML:67% Image files:28.5% PDF files:1.6% MS Word files: 0.2% DH MIME types Domain Harvest Unique files185,549,662516,280,205 Hosts crawled811,5231,046,038 Size6.69 TB19.04 TB

PANDORA and Beyond – Domain Harvesting – Pros and Cons Convergence of resources, technology, collaborations, and purpose in 2005 Some pros – –Retains linkages and context –Large scale – more bytes for the buck –Less selectively discriminate Some cons – –High dependence on the crawler technology –Domain and geo-location bias (.au, geoIP) –Limitations in timeliness, quality assurance, scoping, site complexity, deep web –Legal and access issues to resolve

PANDORA and Beyond – Now 10 years selective web archiving for PANDORA –publicly accessible web archive 2 years domain harvesting –large scale archival content PANDAS –production workflow system Tangible outcomes from pragmatic approach Doing (what we can) with limited resources Developing experience, knowledge and skill through practical engagement in the tasks

PANDORA and Beyond – Future Strategies Renewed focus on strategic thinking Collaborations, relationships, partnerships –International Internet Preservation Consortium Internet Archive –Open source tools, standards (IIPC) –Institutional and trusted repositories (universities and e-presses) –Government & academic sectors (APSR, ARROW) –‘research information infrastructure’ services that support the discovery and management of research resources and research outputs by and for the current and future research community

PANDORA and Beyond – Future Strategies Preservation planning and infrastructure Sustainable resourcing and workflows Push for legislation for collecting in the digital age Understanding the territory –Personal web archiving (HanzoWeb); archive crawlers (Warrick); advanced bookmarking (spurl.net) Strategic use of selective and domain harvesting Architecture, systems and workflows for efficient management of and access to web archive collections

PANDORA Australia’s Web Archive