archiving of websites and born-digital documents in slovakia

Slides:



Advertisements
Similar presentations
What is HathiTrust and How Can it Make a Difference? Sourcing and Scaling brought to the collective collection.
Advertisements

OCLC Online Computer Library Center OCLC Cataloging Update Connexion client 1.50 & more OCLC CJK Users Group Annual Meeting San Francisco, CA April 8,
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
Kentico CMS 5.5 R2 What’s New. Highlights Intranet Solution Document management package – WebDAV support – Project & task management – Document libraries.
BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall
Mark J. Myers Electronic Records Archivist, KY Dept for Libraries and Archives (2001-May, 2014) Electronic Records Specialist, TX State Library and Archive.
Chapter 2. Slide 1 CULTURAL SUBJECT GATEWAYS CULTURAL SUBJECT GATEWAYS Subject Gateways  Started as links of lists  Continued as Web directories  Culminated.
CNRIS CNRIS 2.0 Challenges for a new generation of Research Information Systems.
Web Services Presentation. Site Management Console (SMC)
Depositing e-material to The National Library of Sweden.
1 Uppsala University Library Eva Müller Peter Hansson Stefan Andersson Uwe Klosa Electronic Publishing Centre Krister Östlund Waller project.
1 Persistent identifiers, long-term access and the DiVA preservation strategy Eva Müller Electronic Publishing Centre Uppsala University Library, Sweden.
1 Archiving Workflow between a Local Repository and the National Library Archive Experiences from the DiVA Project Eva Müller, Peter Hansson, Uwe Klosa,
NOBLE Digital Library. How does it work? The NOBLE Digital Library uses the DSpace platform. Image files and metadata are imported into DSpace using.
1 Archive-It Training University of Maryland July 12, 2007.
Bibliography in the Digital Age - IFLA Satellite Meeting Warsaw, 9 August Online materials published in Austria collecting, archiving and metadata.
World Bank, Africa Region, Africa Household Survey Databank - The World Bank - Africa.
WebArchiv Czech Web Archive IIPC 2007, Paris.
Implementing an Integrated Digital Asset Management System: FEDORA and OAIS in Context Paul Bevan DAMS Implementation Manager
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
Metadata, the CARARE Aggregation service and 3D ICONS Kate Fernie, MDR Partners, UK.
Ms. Irene Onyancha ISTD/Library & Information Management Services United Nations Economic Commission for Africa The Second Session of the Committee on.
ERIKA Eesti Ressursid Internetis Kataloogimine ja Arhiveerimine Estonian Resources in Internet, Indexing and Archiving.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
CBSOR,Indian Statistical Institute 30th March 07, ISI,Kokata 1 Digital Repository support for Consortium Dr. Devika P. Madalli Documentation Research &
CyberCemetery Preserving At-Risk Government Web Content.
Implementation and partnership with industry: a case Koninklijke Bibliotheek & IBM Netherlands Hans Jansen Head Research & Development Koninklijke Bibliotheek.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Sharing Digital Scores: Will the Open Archives Initiative Protocol for Metadata Harvesting Provide the Key? Constance Mayer, Harvard University Peter Munstedt,
ARIADNE is funded by the European Commission's Seventh Framework Programme Archiving and Repositories Holly Wright.
1 Dissemination on Internet Experience of the Statistical Office of the Slovak Republic Dissemination on Internet Experience of the Statistical Office.
1 « Luxembourg, 18 April 2007 « Virtual Library of Official Statistics « Dissemination Working Group.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Preservation Functionality in a Digital Archive Erik Oltmans Koninklijke Bibliotheek Raymond J. van Diessen IBM Business Consulting Services Hilde van.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus
Discover ScholarSphere A repository service collaboration between the University Libraries and ITS.
Use cases for BnF broad crawls Annick Lorthios. 2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the.
Data Management and Archival Storage Bojana Tasić FORS SEEDS Workshop I Belgrade, October.
Learning Resource Management and Development System
Architecture Review 10/11/2004
Information Architecture
Internet Made Easy! Make sure all your information is always up to date and instantly available to all your clients.
Using JSTOR May 2016.
GISELA & CHAIN Workshop Digital Cultural Heritage Network
DIAS & DIAS data release 2 years DIAS-GCI Cooperation Hiroko KINUTANI DIAS (Data Integration and Analysis System in Japan) , St. Petersburg.
CONTENT MANAGEMENT SYSTEM CSIR-NISCAIR, New Delhi
An Overview of Data-PASS Shared Catalog
An Introduction to Tessella and The Safety Deposit Box Platform
DIGITAL RESOURCES Webharvesting and e-Born Archiving
BnF - DLWEB - Umbra & Heritrix 3
Managing Copyrights in Invenio
DIGITAL RESOURCES Webharvesting and e-Born Archiving
VI-SEEM Data Repository
Building Search Systems for Digital Library Collections
László Drótos – Márton Németh National Széchényi Library Department of Electronic Library Services Web archiving Planning a new pilot project.
DIGITAL ARCHIVING PLATFORM Digital Resources Project
Introduction to D4Science
Latin American Government Documents Archive, LAGDA
Managing a Web Server and Files
Sharing of Eurostat predefined tables
DDP/DAP Design and Technology Overview
IDEALS at the University Of Illinois: A Case Study of Integration Between an IR and Library Discovery Systems Sarah L. Shreeves University of Illinois.
Sharing of Eurostat predefined tables
Objectives, activities, and results of the database Lituanistika
Márton Németh – László Drótos How to catalogue a web archive?
GISELA & CHAIN Workshop Digital Cultural Heritage Network
Development roadmap of Suomi.fi-services
Features Overview.
Contract Management Software 100% Cloud-Based ContraxAware provides you with a deep set of easy to use contract management features.
Presentation transcript:

archiving of websites and born-digital documents in slovakia Jana Matúšková, Peter Hausleitner, Andrej Bizík University Library in Bratislava 8th Colloquium of Library and Information Experts of the V4+ Countries 19.6.2019 TVORÍME VEDOMOSTNÚ SPOLOČNOSŤ

national project digital resources web archiving in Slovakia - first attempts 2005/2006 - ULIB participated in project Web Cultural Heritage there were no needed legislative, organisational and technical conditions for systematic harvesting and archiving of web pages and original electronic resources in Slovakia 2015 – University Library in Bratislava - national project Digital Resources – Web Harvesting and e-Born Content Archiving implementation: 01. 04. 2015 – 31. 12. 2015 (application development, ICT installation, pilot harvests)

digital resources – organisational support Digital Resources Project - Mission & Goal: to create the technological and organisational infrastructure for systematic and controlled web harvesting and born-digital archiving to archive Slovak websites and born-digital content (electronic monographs and electronic serials) as an integral part of the Slovak Cultural Heritage Organisational support: Deposit of Digital Resources (National ISSN Centre and Deposit of Digital Resources) – created in 2015 external capacities: support – Service Level Agreement

digital resources – organisational support 2019 – the fourth year of sustainability of the project Deposit of Digital Resources – staff: Head of the Department (manager) Curator I - Administrator of the Information System for Digital Resources Curator II - Web Archive Curator Curator III – Born-Digital Archive Curator Manager and part-time person for Born-Digital Resources

digital resources – technical infrastructure powerful HW infrastructure - Public and Internal Portal, 24 virtual servers system management is optimized for multiple harvesting processes (239 workers, each one has 15 Heritrix harvesters) there is an identical parallel testing environment, which enables test harvesting and problem analysing without interference of the production processes the system disposes with 800 TB storage application and system SW - open source technologies: Heritrix, DeDuplicator, OpenWayback, SOLR, JAVA, Invenio etc. support services: Backup, Monitoring, Communication, Antivirus...

access to the archive according to the actual Copyright Act 3 types of access: public, local and forbidden default: without public access a limited number of archived websites and electronic publications is available publicly (Open Access, Creative Commons and agreements with provider/publisher) local access to the archived content in the ULIB research room local access to the archived content in the ULB research room (26 PC)

archiving of electronic publications specific feature of our system – both websites and born-digital content (electronic serials and monographs with assigned ISSN/ISBN) 856 787 WARC files and 612 PDF files (14.6.2019) 144 serial titles, 59 monographs (14.6.2019) National ISSN Centre - cooperation from the beginning (2015) the Slovak  National  ISSN Database – the source of e-serials titles archiving is a preparation for the future Legal Deposit Act for born-digital content

internal portal - curator interface 2x test virtual platform 1x production platform divided into 4 parts Harvest administration System configuration Domain catalogue e-Born administration task overiew system status audit log cooperating subjects contracts reports SIPs to send domain overview and administration (harvest parameters, access, Conspectus, metadata, ...) domain import web archive facets harvest templates harvest overiew crawler traps XSLT templates (metadata extraction) e-Born title catalogue archive records details deposit overview e-Born metadata

web resources catalogue nic.sk - current number of registered .sk domains 394 462 (14.06.2019) number of domains in the DIR catalogue: 519 727 (12.06.2019) other domains of slovacical character – content, author or language: .eu (209), .com (109), .org (72) .info(35), .net(21), .cz (14), other (23) - added into the catalogue manually Domain status in the catalogue:

Table: Web harvest campaigns web harvest policy respecting robots.txt agreements => respecting limitations harvest types: selective, thematic, full-domain (complex) formats: generally no audio, video files, streams, social networks 2016 2017 2018 2019 done 2019 planned Selective 7 11 14 5 9 Thematic 4 2 6 1 Full-domain - ∑ 12 20 Table: Web harvest campaigns

harvest types – 1: selective harvest focus: Conspectus categories and contracted URLs contracted URLs - ignoring robots.txt, but respecting limitations in the agreement Conspectus - meta 072_7$x agreement number - meta 542__n 2 GB, 3 days/ domain Image: harvest settings for contracted domain

harvest types – 2: thematic harvest thematic (event) harvest 2 GB, 2 days/ domain elections (presidential, europarliamentary, regional), Olympic Games, IIHF Championship, ... respecting robots.txt archived Source (accessible in ULIB): https://search.webdepozit.sk/webarchiv/kniznica/20190401151552im_/https://m.smedata.sk/api-media/media/image/sme/9/41/4109789/4109789_600x400.jpeg?rev=3

harvest types – 3: full-domain harvest harvest of all domains in the catalogue once a year parameters/ one seed: max. size 400 MB 10 000 objects max. time 2 hrs.

public portal a lot of information for publishers including forms for publication submission/suggestion frontend for web archive and e-publications archive plenty of information about our activities made in Wordpress simple and advanced search https://webdepozit.sk

simple search search in all metadata fields filtered by Conspectus categories links to live websites and archived versions Search of the keyword „ulib.sk“ – links to live site, OW versions, possible export of harvest metadata

advanced search Basic metadata search by following MARC fields: All metadata Title (24500$a) Keyword (650#4$a) Conspectus category (072#7$x) URL of resource (85640$u) Subtitle (24616$i) Description (500##$a) SOLR complex expression search using operators Logical operators: and (AND) or (OR) not (NOT) filtering results by date of archiving

openwayback Results in OpenWayback Displaying of the archived website https://search.webdepozit.sk

e-born search full-text search results of e-Born archive search with title details title „Dejiny“

ltp archivation process diagram  Digital Resources LTP  Central Data Archive AIP ID WebDAV Temporary Storage  Save Network transfer Signature check Validation start Creation Saving in 3 locations SIP ID Keystore SSL  AIP Package SIP Package HA – ZIP / EB - ZIP Signature Mets.xml + Mets.sig Validation: Data consistence File validation PUID and PDF version check Antivirus check Signature verification MD5 checksum Transformation content Signature verification WARC.gz e-Born.pdf + Marc.xml Valid ltp archivation process diagram Non valid Update

thank you for your attention PhDr. Jana Matúšková & Ing. Peter Hausleitner University Library in Bratislava, Slovakia Deposit of Digital Resources (ddp@ulib.sk) TVORÍME VEDOMOSTNÚ SPOLOČNOSŤ