Harvesting and archiving the Web Nordunet2000, 28.-30.9.2000 Juha Hakala Helsinki University Library.

Slides:

Advertisements

Similar presentations

Kulturarw³ Capturing the web The Swedish experience

Advertisements

1 Copyright © 2002 Pearson Education, Inc.. 2 Chapter 1 Introduction to Perl and CGI.

Texts and Digital Objects What seems to have changed.

© 2008 EBSCO Information Services SUSHI, COUNTER and ERM Systems An Update on Usage Standards Ressources électroniques dans les bibliothèques électroniques.

28 April 2004Second Nordic Conference on Scholarly Communication 1 Citation Analysis for the Free, Online Literature Tim Brody Intelligence, Agents, Multimedia.

THE DONOR PROJECT Titia van der Werf-Davelaar. Project Financed by: Innovation of Scientific Information Provision (IWI) Duration: –phase 1: 1 may 1998.

Google Series Part 1: gmail Part 2: maps Part 3: talk Part 4: earth Part 5: books Part 6: picasa Part 7: sites Part x: ?

September Public Library Web Managers Workshop 2000 Cascading Style Sheets Manjula Patel UKOLN University of Bath Bath, BA2 7AY UKOLN is funded.

Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall

Hypertext, hypermedia and interactivity. A brief overview and background primer.

Enterprise Content Management Departmental Solutions Enterprisewide Document/Content Management at half the cost of competitive systems ImageSite is:

BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall

Communicating Information: Web Design. It’s a big net HTTP FTP TCP/IP SMTP protocols The Internet The Internet is a network of networks… It connects millions.

Transformations at GPO: An Update on the Government Printing Office's Future Digital System George Barnum Coalition for Networked Information December.

SOFTWARE PRESENTATION ODMS (OPEN SOURCE DOCUMENT MANAGEMENT SYSTEM)

Depositing e-material to The National Library of Sweden.

Digitisation projects and preserving digital documents in Hungary Current trends in digitisation DELOS, Turin, 3-4. febr István Moldován Hungary,

1 Archiving Workflow between a Local Repository and the National Library Archive Experiences from the DiVA Project Eva Müller, Peter Hansson, Uwe Klosa,

Peoplesoft Fundamentals David Lewis 10/18/02 (adapted from Psoft Training Materials)

Automated Reference Assistance: Reference for a New Generation Denise Troll Covey Associate University Librarian Carnegie Mellon CNI Meeting – April 2002.

Master’s course Bioinformatics Data Analysis and Tools Lecture 6: Internet Basics Centre for Integrative Bioinformatics.

William Y. Arms Corporation for National Research Initiatives March 22, 1999 Object models, overlay journals, and virtual collections.

Presented by Mina Haratiannezhadi 1.  publishing, editing and modifying content  maintenance  central interface  manage workflows 2.

The Internet & The World Wide Web Notes

Internet Basics مهندس / محمد العنزي

CERES AND COLORADO STATE UNIVERSITY LIBRARIES. PROJECT CERES Begun in 2013, Project CERES is a Center for Research Libraries Global Resources Agriculture.

Chapter ONE Introduction to HTML.

Metadata and identifiers for e- journals Copenhagen Juha Hakala Helsinki University Library

Architecting an Extensible Digital Repository Anoop Kumar, Ranjani Saigal,Rob Chavez, Nikolai Schwertner Tufts University, Medford, MA.

Svein Arne Brygfjeld National Library of Norway Nordic Web Archive.

DATA COMMUNICATION DONE BY: ALVIN SAMPATH CARLVIN SAMPATH.

The Internet Writer’s Handbook 2/e Introduction to World Wide Web Terms Writing for the Web.

Introducing the Internet Source: Learning to Use the Internet.

OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.

Copyright © Allyn & Bacon 2008 POWER PRACTICE Chapter 7 The Internet and the World Wide Web START This multimedia product and its contents are protected.

European digital repositories: an overview ELAG 2006, Bucharest Juha Hakala Helsinki University Library.

Open Textbooks and Electronic Publishing Formats/Standards Arctic Virtual Learnng Tools

Kulturarw³ The Swedish WWW Archive Eller, att fånga den V ärlds V ida V även

Copy cataloguing in Finland Juha Hakala The National Library of Finland

1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,

NBN:URN Generator and Resolver ERPANET Workshop on Persistent Identifiers Cork, June, Ádám Horváth National Széchényi Library Hungary.

Introduction To Internet

Digital Archiving in the Hungarian Széchényi Library The story and the plans of the Hungarian Electronic Library Rome, 21. Oct István Moldován OSZK,

1 nlresearch.com The First ReSearch Engine: Northern Light® Susan M. Stearns Director of Enterprise Marketing March, 1999.

07/11/2002Thomas Baron - JACoW Workshop1 CERN Library Requirements T. Baron CERN ETT-DH-CDS.

Digital library projects in the Nordic national libraries Juha Hakala Helsinki University Library – The National Library of Finland.

Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,

Digital Commons & Open Access Repositories Johanna Bristow, Strategic Marketing Manager APBSLG Libraries: September 2006.

Implementing PTFS ArchivalWare at York St John University: a project under the JISC Repositories Start-up and Enhancement (SUE) strand Helen Westmancoat.

GPO’s Federal Digital System December 10, 2009 U.S. Government Printing Office.

1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

National Library of the Czech Republic Integration of digital materials into EDL Adolf Knoll National Library of the Czech Republic Helsinki CENL Workshop.

Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.

Current Information To help you find current news and information, many search engines and directories include a hyperlink to a "What's new" page. Many.

1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.

Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)

Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.

September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.

General Architecture of Retrieval Systems 1Adrienn Skrop.

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

(class #2) CLICK TO CONTINUE done by T Batchelor.

A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.

The World Wide Web.

Chapter 1 Introduction to HTML.

Building A Repository for Digital Objects

Introducing the World Wide Web

A Brief Introduction to the Internet

Skills in Information Retrieval

DIGITAL LIBRARY.

Unit# 5: Internet and Worldwide Web

Presentation transcript:

Harvesting and archiving the Web Nordunet2000, Juha Hakala Helsinki University Library

Contents Legal background Functional issues –harvesting –archival –indexing Some projects

Legal background national libraries store in theory all (printed) publications –this work is based on (legal) deposit Legal deposit acts are currently being extended to electronic materials copyright must be relaxed –permission to copy documents (for e.g. preservation purposes) is essential

Finnish Act on Legal Deposit (proposed) The national library is granted a right to harvest and archive freely available Web documents Archived resources can only be accessed from dedicated work stations within the deposit libraries (n = 6) Access to references (index) will be free –anyone can see what is archived

Functional issues libraries traditional methods do not fit too well to electronic materials –manual cataloguing can not be done to million documents new, automated means must be developed for acquisition, storage and indexing of electronic publications –this work is well under way

Harvesting Harvester = automated tool for collecting Web documents –given a number of URLs, the harvester will fetch the documents, check the hypertext links in these documents and then proceed into second harvesting round –this goes on until all qualifying documents have been fetched

Practical experiences Building a robust and well-behaving tool is not easy –bad quality of data & Web servers –huge quantity of data: what works for 1000 files, may not work for million Performance optimisation is complicated –after a while, only large servers are left if scheduling is done in an easy way

Archival No off-the shelf solutions for this! Implement strange exclusion and scheduling rules into harvester –e.g. always retrieve inline materials, and at once –may be hard to modify an existing application Generate & store archive metadata Document storage Incremental versus full archiving

Archive metadata - examples MD5 checksum –duplicate control of the archive –authentication of archived materials –unique access key (used as URN/NBN) Document size & location (old & new) Time stamp –when was the document retrieved –single point in time, or a period

Document storage store the documents into a database or into (UNIX) file system –elimation of duplicates (MD5 check) –extract metadata from files –pre-prosessing of files (tar & ZIP) –send location information to the archive database

Indexing Many off-the shelf solutions exist, but they do not qualify as such –enhancements needed for indexing archive metadata + adding it into user interface –changes to navigation (hyperlinks must point into the archive if applicable) –ability to cope with very large amount of documents and file types

Problems - 1 not everything can be harvested –databases, dynamic documents –how to find all relevant data in.com,.net, etc? co-operation with network providers not all functionality can be kept –image maps indexing –images and sound; weird text types

Problems - 2 long time preservation –97 % of the Web documents are text/html, text/plain, image/jpeg or image/gif –the remaining 3 % are a problem –work stations accessing the archive must be equipped with a wide variety of tools such as emulators

Projects - 1 Kulturarw3 (Royal Library of Sweden) –started in 1996 –uses modified version of the Combine harvester –Swedish Web has been harvested seven times –

Projects - 2 NEDLIB (European national libraries) –started in 1997; the aim is to develop tools for deposit of all kinds of electronic materials –developed a harvester customised for Web archiving –version one published in Jan –after extensive tests version two released in September 2000 –

Projects - 3 Nordic Web Archive (Nordic national libraries) –has started in September 2000 –supported by Nordunet2 –will develop an index for Web archives built with Kulturarw3 or NEDLIB tools –co-operation with either FAST (Norway) or Index Data (Denmark) –virtual union catalogue of Nordic Web

Experiences The Web is small –Swedish web space contained in Spring million files, but only 300 gigabytes The Web is simple –four document types comprise about 97% of all documents –… but there are in all about 200 filetypes Work station will do for harvesting –… but search engine will need more power

Future several national libraries have plans to harvest the Web, either selectively or basically everything revision of legislation is under way in a few countries close co-operation between the libraries will continue in application development long time preservation will emerge as a new work item