1 The IIPC Web Curator Tool: Steve Knight The National Library of New Zealand Philip Beresford and Arun Persad The British Library An Open Source Solution.

Slides:



Advertisements
Similar presentations
Contribution to MD9 Viktor Pusztai Ministry For Environment and Water GRID-Budapest CEOS WGISS meeting 17 September 2003 Thailand - Chiang Mai.
Advertisements

EIONET Training Beginners Zope Course Miruna Bădescu Finsiel Romania Copenhagen, 27 October 2003.
Configuration management
Technical Framework Charl Roberts University of the Witwatersrand Source: Repositories Support Project (JISC)
ITIL: Service Transition
SOFTWARE PRESENTATION ODMS (OPEN SOURCE DOCUMENT MANAGEMENT SYSTEM)
DSpace Devika P. Madalli DRTC, ISI Bangalore.
Technology Steering Group January 31, 2007 Academic Affairs Technology Steering Group February 13, 2008.
May Archiving PAWN: A Policy-Driven Software Environment for Implementing Producer- Archive Interactions in Support of Long Term Digital.
Baxter PSAS Privacy Self Audit System An exercise in process re-engineering Jeff Bailey Winter/Spring 2006 CS 491.
Kristin Eberle Monica Hampton Carmen Velasquez Kristin Eberle Monica Hampton Carmen Velasquez Knowledge Management.
Supplement 02CASE Tools1 Supplement 02 - Case Tools And Franchise Colleges By MANSHA NAWAZ.
Chapter 13 Web Application Infrastructure. Objectives Explain the components and purpose of a web application platform Describe several common webapp.
Deploying Visual Studio Team System 2008 Team Foundation Server at Microsoft Published: June 2008 Using Visual Studio 2008 to Improve Software Development.
Talend 5.4 Architecture Adam Pemble Talend Professional Services.
The DSpace Course Module – DSpace Installation. Module objectives  By the end of this module you will:  Understand the platforms DSpace can be hosted.
ETD Repositories Using DSpace Software Andrew Penman The Robert Gordon University 27 th September 2004.
Bibliography in the Digital Age - IFLA Satellite Meeting Warsaw, 9 August Online materials published in Austria collecting, archiving and metadata.
ViciDocs for BPO Companies Creating Info repositories from documents.
Introduction to the Enterprise Library. Sounds familiar? Writing a component to encapsulate data access Building a component that allows you to log errors.
Malaysian Grid for Learning October DC 2004, Shanghai, China. © 2004 MIMOS Berhad. All Rights Reserved Metadata Management System DC2004: International.
Module - Technical Basics
Trimble Connected Community
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Geoff Payne ARROW Project Manager 1 April Genesis Monash University information management perspective Desire to integrate initiatives such as electronic.
Content Strategy.
IIPC GA Curator Tools Fair May 2014 WEB CURATOR TOOL Nicola Bingham Web Archivist.
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
IUScholarWorks is a set of services to make the work of IU scholars freely available. Allows IU departments, institutes, centers and research units to.
Configuration Management (CM)
The DiVA System: Current Status and Ongoing Development Uwe Klosa Electronic Publishing Centre, Uppsala University, Sweden Eva Müller.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
High-class document management for small and medium businesses. Let effective and easy document handling become reality at your company.
Archivists' Toolkit - CRADLE Presentation, 10 Feb The Archivists’ Toolkit CRADLE Presentation 10 Feb
Archivists' Toolkit - CDL Presentation, October 17, 2005 The Archivists’ Toolkit Lee Mandell Brad Westbrook.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Digital Library Syllabus Uploader Will Cameron CSC 8530 October 19, 2006 Project Presentation 2.
Experts Workshop on the IPT, v. 2, Copenhagen, Denmark The Pathway to the Integrated Publishing Toolkit version 2 Tim Robertson Systems Architect Global.
Hybrid Cloud and Windows Server 2003 end of support on Azure Rod Kruetzfeld Data Center Technical Strategist Microsoft Canada.
This presentation describes the development and implementation of WSU Research Exchange, a permanent digital repository system that is being, adding WSU.
Metadata Mòrag Burgon-Lyon University of Glasgow.
Selene Dalecky March 20, 2007 FDsys: GPO’s Digital Content System.
Running Kuali: A Technical Perspective Ailish Byrne (Indiana University) Jonathan Keller (University of California, Davis)
Funded by: © AHDS Preservation in Institutional Repositories Preliminary conclusions of the SHERPA DP project Gareth Knight Digital Preservation Officer.
Archivists' Toolkit - All Hands Meeting Project Objectives Build an application for creating and managing archival information Target core archival.
Afresco Overview Document management and share
Current Quality Assurance Practices in Web Archiving Brenda Reyes Ayala, Mark Phillips, and Lauren Ko University of North Texas
DSpace System Architecture 11 July 2002 DSpace System Architecture.
ESG-CET Meeting, Boulder, CO, April 2008 Gateway Implementation 4/30/2008.
Recent Enhancements to Quality Assurance and Case Management within the Emissions Modeling Framework Alison Eyth, R. Partheepan, Q. He Carolina Environmental.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Where are my files? Discoveries in establishing a digital archive workflow Sally McDonald Archivist/Librarian Western History/Genealogy, Denver Public.
CMPE 226 Database Systems April 19 Class Meeting Department of Computer Engineering San Jose State University Spring 2016 Instructor: Ron Mak
Aaron Corso COSC Spring What is LAMP?  A ‘solution stack’, or package of an OS and software consisting of:  Linux  Apache  MySQL  PHP.
Scribe Technical Workshop Adapter for OLE DB Import-Export Wizard September 13, 2007.
William J Nixon Setting up a Repository. Introduction Key Features to consider (and review) Wide Range of Technology Available –Best fit for purpose –Clear.
17 Copyright © 2006, Oracle. All rights reserved. Information Publisher.
Breeda Herlihy, IR Manager, UCC Library. UCC selected DSpace in 2008 Software selection group Staff from Library IT, Computer Centre, Special Collections,
International Planetary Data Alliance Registry Project Update September 16, 2011.
Chapter 25 – Configuration Management 1Chapter 25 Configuration management.
Chapter 13 Web Application Infrastructure
ITIL: Service Transition
Software Support Framework
Introduction, Features & Technology
What Is Sharepoint? Mohsen Ashkboos
Course: Module: Lesson # & Name Instructional Material 1 of 32 Lesson Delivery Mode: Lesson Duration: Document Name: 1. Professional Diploma in ERP Systems.
Managing Private and Public Views of DDI Metadata Repositories
Technical Outreach Expert
SDMX IT Tools SDMX Registry
Presentation transcript:

1 The IIPC Web Curator Tool: Steve Knight The National Library of New Zealand Philip Beresford and Arun Persad The British Library An Open Source Solution for Selective Web Harvesting

2 What is the Web Curator Tool? How does it work? Who built it? Where can I get it? Today’s Topics

3 What is the WCT? The Web Curator Tool (WCT) is a tool for managing the selective web harvesting process. It is designed for use in libraries by non-technical users. Heritrix is used to download web material (but technical details are handled behind the scenes by system administrators).

4 What does it do? The WCT supports: –Harvest Authorisation: getting permission to harvest web material and make it available. –Selection, scoping and scheduling: what will be harvested, how, and how often? –Description: Dublin Core metadata. –Harvesting: Downloading the material at the appointed time with the Heritrix web harvester deployed on multiple machines. –Quality Review: making sure the harvest worked as expected, and correcting simple harvest errors. –Submitting the harvest results to a digital archive.

5 What is it NOT? It is NOT a digital archive or document repository –It is not appropriate for long-term storage –It submits material to an external archive It is NOT an access tool –It does not provide public access to harvested material –(But it does let you review your harvests) –You should use Wayback or WERA as access tools

6 What is it NOT? It is NOT a cataloguing system –It does allow you to record external catalog numbers –And it does allow you to describe harvests with Dublin Core metadata It is NOT a document management system –It does not store all your communications with publishers –But it may initiate these communications –And it does record the outcome of these communications

7 The Web Curator Tool Project A joint project undertaken by National Library of New Zealand and the British Library Each partner provided around half the funding Each partner contributed personel to the project team

8 The Web Curator Tool Project The project was initiated by NLNZ and BL under the auspices of the IIPC Content Management Working Group –Including requirements from several IIPC members –BL project team members visted New Zealand for design workshops in April 2006 –Part of the project’s success has been the speed of the design and development phases

9 Project Objectives Design and build a Web Curator Tool that: –meets the needs of the National Library of New Zealand –meets the needs of the British Library –is modular and can be extended to meet the needs of IIPC members and other organisations engaging in web harvesting –manages permissions, selection, description, scoping, harvesting and quality review –provides a consistent, managed approach allowing users with limited technical knowledge to easily capture web content for archival purposes.

10 Project Timeline IIPC Functional Requirements IIPC Use Cases To June 2005 To October 2005 Project Definition, Solution Scope Software Requirements Specification November–December 2005 ProcurementJanuary–February 2006 Detailed DesignMarch–April 2006 ImplementationMay–July 2006 User Acceptance TestingJuly–August 2006 Open Source ReleaseToday

11 Technology Implemented in Java Runs in Apache Tomcat Incorporates parts or all of –Acegi Security System –Apache Axis (SOAP data transfer) –Apache Commons Logging –Heritrix (version 1.8) –Hibernate (database connectivity) –Quartz (scheduling) –Spring Application Framework –Wayback

12 More technology Platform: –Tested on Solaris (version 9) and Red Hat Linux –Developed on Windows –Should work on any platform that supports Apache Tomcat Database: –A relational database is required –Tested on Oracle and PostgreSQL –Installation scripts provided for Oracle and PostgreSQL –Should work with any database that Hibernate supports Including MySQL, Microsoft SQL Server, and about 20 others

13 How does it work?

14

15

16 Harvest Authorisations WCT Harvest Authorisation is concerned with: –Permission to harvest web material –Permission to make web material accessible to users –Any and all special conditions that apply

17 Harvest Authorisation A record is kept of: –What authorisations have been granted (or rejected)? And what special conditions apply? –Who has authorised us? The government, via legislation. Publishers, creators and copyright holders. –Where (i.e. what part of the web) does the authorisation apply? –When does the authorisation start and end? Communications… –Can be initiated from within the tool, –Documentation should be stored in your document management system.

18

19

20

21

22

23

24 Targets A Target is a portion of the web you want to harvest. The Target is the “unit of selection”: –If there is something you want to harvest and archive and describe, then it is a Target. You can attach a Schedule to a Target to specify when (and how often) it will be harvested. –But you can’t harvest until you have permission to harvest, and you can’t harvest until the selection is approved.

25

26

27

28

29

30

31 Target Instances Target instances represent individual harvests that –Are scheduled to happen, or –Are in progress, or –Have finished. Target Instances are created automatically for a Target when that Target is Approved. –A Target Instance is created for each harvest that has been scheduled.

32 Target Instances - the Queue Scheduled Target Instances are put in a queue When their scheduled start time arrives 1.The WCT allocates the harvest to one of the harvesters 2.The harvester invokes Heritrix and harvests the requested material 3.When the harvest is complete, the User is notified Examining the Queue gives you a good idea of the current state of the system –The WCT provides a quick view of the instances in the Queue, including Running, Paused, Queued, and Scheduled Instances

33 Target Instances

34 Target Instances - the User’s view When a harvest is complete, its Owner is notified The Owner (or another User) then has to 1.Quality Review the harvest result to see if it was successful Browse Tool: Browse the harvest result to ensure all the content is there Prune Tool: Delete unwanted material from the harvest 2.Endorse or Reject the harvest 3.Submit the harvest to an Archive (if it has been endorsed) The User view of the Target Instances shows all the instances that the user owns.

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50 Immediate directions Streamlining user interface and workflow Develop additional quality review tools Improve support for other platforms and databases Integration with Wayback/WERA for access

51 Future plans Development strategy –Distribute the tool more widely –Incorporate feedback from wider user bases –Form a governance group to determine development strategy Tool development –Fix bugs –Develop new features –Support new platforms and databases

52 Where can I get it? Available from –Source code –Documentation: user and administrator guides, FAQ –Mailing lists Released under Apache License, Version 2.0

53 Questions?