WebArchive – Archive of the Czech Web Mgr. Jan HUTAŘ.

Slides:

Advertisements

Similar presentations

Harvesting and archiving the Web Nordunet2000, Juha Hakala Helsinki University Library.

Advertisements

Recent developments in digital archiving and preservation Jan Fullerton Director General National Library of Australia.

Bibliothèque nationale de France Tallinn, BnF update: production and development priorities in 2015.

Regina 2014 GALLOP Portal Status Update, Future Plans Greg Salmers Saskatchewan Legislative Library A Library & Research Session.

BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall

Depositing e-material to The National Library of Sweden.

Digitisation projects and preserving digital documents in Hungary Current trends in digitisation DELOS, Turin, 3-4. febr István Moldován Hungary,

1 Uppsala University Library Eva Müller Peter Hansson Stefan Andersson Uwe Klosa Electronic Publishing Centre Krister Östlund Waller project.

1 The IIPC Web Curator Tool: Steve Knight The National Library of New Zealand Philip Beresford and Arun Persad The British Library An Open Source Solution.

1 Archiving Workflow between a Local Repository and the National Library Archive Experiences from the DiVA Project Eva Müller, Peter Hansson, Uwe Klosa,

Constructing the Memories Creating a Digital Collection Linda J. White, Digital Project Coordinator.

1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

Developing PANDORA Mark Corbould Director, IT Business Systems.

Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.

1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

National Aeronautics and Space Administration Implementing DSpace at NASA Langley Research Center 1 Greta Lowe Librarian NASA Langley Research Center

Elizabeth Newbold and Samantha Tillett GL8 New Orleans, December 2006

The capture and preservation of websites at the National Library of New Zealand Gillian Lee Alexander Turnbull Library.

1 Archive-It Training University of Maryland July 12, 2007.

Building Library Web Site Using Drupal

Annick Le Follic Bibliothèque nationale de France Tallinn,

Digitization and scientific digital libraries Martin Lhoták Knihovna AV ČR, v. v. i. Academy of Sciences Library UISK, Universita Karlova v Praze.

Bibliography in the Digital Age - IFLA Satellite Meeting Warsaw, 9 August Online materials published in Austria collecting, archiving and metadata.

Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web.

WebArchiv Czech Web Archive IIPC 2007, Paris.

1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

Danish Legal Deposit on the Internet National Diet Library, Tokyo, January 2002 by Birgit N. Henriksen Head of Digitization and Web Department The Royal.

Svein Arne Brygfjeld National Library of Norway Nordic Web Archive.

Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.

Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014.

Merging the National Library and the National Archives LIBER General Annual Conference, Tartu, June 2012 Els van Eijck van Heslinga, Head Finance and Corporate.

1. 2 introductions Nicholas Fischio Development Manager Kelvin Smith Library of Case Western Reserve University Benjamin Bykowski Tech Lead and Senior.

DIGAR as the way and possibility to re-use the publications of public sector National Library of Estonia Kairi Felt Chief Specialist of E-Collections

Annick Le Follic Bibliothèque nationale de France Tallinn,

Cataloging and Metadata at the University Library.

IIPC GA Curator Tools Fair May 2014 WEB CURATOR TOOL Nicola Bingham Web Archivist.

Digital Preservation through Cooperation: LOCKSS Gail McMillan Digital Library and Archives, University Libraries Virginia Polytechnic Institute and State.

1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,

Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.

Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.

The Legislative Library of Ontario’s Ontario Documents Repository Road to Partnership.

ERIKA Eesti Ressursid Internetis Kataloogimine ja Arhiveerimine Estonian Resources in Internet, Indexing and Archiving.

1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.

Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.

Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.

Digital Archiving in the Hungarian Széchényi Library The story and the plans of the Hungarian Electronic Library Rome, 21. Oct István Moldován OSZK,

19/10/20151 Semantic WEB Scientific Data Integration Vladimir Serebryakov Computing Centre of the Russian Academy of Science Proposal: SkTech.RC/IT/Madnick.

P. Schirmbacher Humboldt-Universität zu Berlin The Changing Process of Scholarly Publishing or the Necessity of a New Culture of Electronic.

07/11/2002Thomas Baron - JACoW Workshop1 CERN Library Requirements T. Baron CERN ETT-DH-CDS.

Digital library projects in the Nordic national libraries Juha Hakala Helsinki University Library – The National Library of Finland.

CyberCemetery Preserving At-Risk Government Web Content.

Kurt Maly Department of Computer Science Old Dominion University Norfolk, Virginia 23529, USA Digital Libraries, OAI and Free Software.

Tsinghua University Library Yang Zhao & Airong Jiang Tsinghua University Library, Beijing China 4 June, 2004 Electronic Thesis and Dissertation System.

Persistent Digital Archives and Library System (PeDALS)

Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /

Preservation Program Digital Preservation Program Digital Preservation Services: Extending tools to meet campus needs Patricia Cruse, Director, Digital.

1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

DSpace - Digital Library Software

1 « Luxembourg, 18 April 2007 « Virtual Library of Official Statistics « Dissemination Working Group.

Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)

Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.

Grant Writing for Digital Projects September 2012 IODE Project Office IODE Project Office Oostende, Belgium Oostende, Belgium Sustainability and.

Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus

A Framework for Institutional Repositories José Luis Borbinha, Jorge Machado

Web Archiving Workshop Mark Phillips Texas Conference on Digital Libraries June 4, 2008.

DIGITAL RESOURCES Webharvesting and e-Born Archiving

László Drótos – Márton Németh National Széchényi Library Department of Electronic Library Services Web archiving Planning a new pilot project.

DDP/DAP Design and Technology Overview

Objectives, activities, and results of the database Lituanistika

Márton Németh – László Drótos How to catalogue a web archive?

Presentation transcript:

WebArchive – Archive of the Czech Web Mgr. Jan HUTAŘ

Why we started with WebArchiv?  amount of documents published on the Internet is growing dramatically – average lifespan is 40 days --> if the documents are not archived a part of the national cultural heritage would disappear forever  need to save and keep accessible the documents on the CZ web  about 90% documents on the web exist only in electronic form  trend around the world (Australia, Sweden, Internet Archive … etc.)  NK ČR is suppose to do it – it is deposit library  main mission of the NK is to collect, catalog, permanently preserve documents published in the territory and make them available to the general public

The beginning  launched in 2000 – till 2002 – grant project R&D „Registration, preservation and access of national electronic resources in the Internet“ by Ministry of Culture  cooperation with Moravian Library Brno and Institute of Computer Science at the Masaryk University Brno  they are our „IT department“ ;-)  only grants money  we are still going on!

Main Aims  to implement best solution in the field of archiving of the national web, i.e. bohemical online-born documents  prepare tools, methods and conditions for collecting, archiving and preserving web resources  to provide long-term access to them  large-scale automated harvesting of the entire national web and selective archiving are being carried out, including thematic „event-based“ collections  to solve current legal issues (the legal deposit legislation, CA) Legal Deposit Act doesn‘t cover online-born documents and according to the Copyright Act, it is not possible to make archived data available to public.  set selection criteria for selective approach / harvest  to establish conditions for cooperation between libraries and publishers of electronic documents

Workflows Prague:  Resource selection  Cataloguing for the National Bibliography (MARC21)  Providing Dublin Core metadata for interested publishers  Making archive access agreements with publishers Brno:  Running WebArchiv hardware  Software localization, maintenance and development  Pre-harvesting resource analysis  Harvesting, indexing, access Results so far:  4 harvesting rounds of.cz domain (2001, 2002, 2004, 2006)  5 event-oriented harvests  several times per year – harvests of sites under agreements  5.4 TB archive with 136 million files

Selection Criteria  The amount of documents on the Internet is quite big – for selective approach we need to find the ones with „research value“ For acquisition (harvesting) 2 approaches: 1. selective approach - only selected documents are harvested and archived – according to selec. criteria 2. complete harvest – of the entire national domain for example.cz. We need only to set harvester…  approaches are different in different countries  trend is to do both (Australia, Denmark)

Criteria –selective approach  to set selection criteria was very difficult Web Cultural Heritage EU Culture 2000 program)  we coordinated "Web Cultural Heritage„ project (in the frame of EU Culture 2000 program)  Content  Resource type  Original form  Access  Format  Domain  National aspect

Criteria –selective approach 1. Contents Web resources of art or research value, news stories and feature articles and resources as outputs of government and other offices. Promotion material of an individual or a corporation is omitted. 2. Resource Type Serials, monographs, conference proceedings, research and other reports, academic works etc. 3. Original form Only resources originally published in the web – it means they have no traditional/printed copy 4. Access Only freely accessed resources are collected

Criteria –selective approach 5. Format Resources available in formats that are interpreted by common web browsers without necessity of installing plug-ins are collected. 6. Domain Resources accessible at servers under the top level domain.cz and at servers under the other domains … 7. National aspect Resources according to „authors nationality“, „national language“, „country or nation as a subject“

What we have done…  continuous testing of:  SW tools  applications for harvesting, archiving, indexing and accessing of the web pages  only open source SW  effort / push to change legislation  international cooperation (activities in R&D within IIPC – even before we become a member)IIPC  we have opened part of our archive for public (since autumn 2005)archive  the whole archive archive should be open right now (only local access)

Harvest of the.cz domain  2001 first try of the whole domain harvest of the.cz domain, 1 PC + tape robot, cz2001 includes over 3 mil. of unique URLs (107 GB) – not completed  2002 harvest interrupted - lack of space on data storage and floods. cz2002 includes 315,5 GB, from URLs harvested over 10 mil. docs  in 2003 no harvest  2004 March- October, from URLs harvested 32,5 million files = 1,2 TB  september nd harvest of.cz by Heritrix. Stoped – no data storage space. Limits: max docs/server, max. file size 100 MB  all harvest executed by the NEDLIB harvester, deep links HERITRIX  from 2004 new harvester HERITRIX

Registrované domény v.cz

Harvests of.cz domain in quick view yearTotal downloaded documents Size – non comprimated [GB] Days of running Amount of second level domains % from all reg. domains 20013,015, ,32238% ,249, ,02269% ,141,5751, ,37875% 20059,336, ,7952% ,378,0193, ,88074%

Present state of the project  4-6 times/year is   4-6 times/year is harvested collection of selected resources (agreement with NK), about 350 servers. increase is around 10GB of data for each harvest  it is still rising  harvest of „small“ amounts of data is successful   analysis of the domain.cz was done  servers „suspicious“ from unrelevancy were rejected (mail, mysql apod.) as well as duplicates – number of URLs decreased from 540 to 378 thousands BUT BUT …   from 2004 we were not able to keep running the harvest of the whole.cz domain. – problem of Heritrix with memory using  new release solved this issue

Present state of the project  main standards are used (MARC21, DC, ISSN and URN)  selected docs are catalogued in an ALEPH library system which supports Z39.50 and OAI-PMH protocols  selected resources (with agreements) at least 4 times a year WebArchiv  at present we have in WebArchiv saved cca 5,5 TB of data (uncompressed) ≈ 158 milions of documents   in the end of 2007  all data will be moved on the new data repository  in 2008 archive of the project should become a part of prepared project of „National Digital Library“ at National Library (together with Kramerius and Manuscriptorium)

Software changes  2004 development and support of NEDLIB harvester was canceled – we replaced it by Heritrix  consecutive change over to SW developed by IIPC (International Internet Preservation Consortium) ARC  archival file format nedlib replaced by ARC format (used by Heritrix)  warc format in near future – then wayback

Harvester Heritrix – advantages system modularity, extensibility, continual development (v.1.8), very good and fast support from Internet Archive developers open source codes and modularity allow cooperation of third party on its development – good for us ;-)  2 parts – framework and add on modules  Framework – basic control over harvests, user interface, process managemenst, harvest settings  modules – used for specific harvest implementation, set up each harvest step by step

Harvester Heritrix - problems  not possible to leave the whole process of harvesting without the control of experts  trap detection  extraction of links from websites (Java)  memory problems (whole domain harvest) - solved  incremental harvest and changes detection

SW for access  from IIPC (IA) NutchWAX Nutch  fulltext document indexing - NutchWAX, extension/superstructure over search engine Nutch  WERA  WERA (successor of NWA tools) – user interface for accessing documents on the web – it can deal with Czech diacritics (accents etc. – display it, search by it, sort)user interface for accessing documents on the web  ARCWayback  ARCWayback make index over whole archive, it allows access into archive by URL and time  Wayback  Wayback for only restricted on-site access from within the library is possible to all files in the archive

Nutch a NutchWAX Nutch search engine, by IA  open source search engine, by IA  comes from Apache Lucene architecture Nutch is able to:  download and work up millions of sites in a month, manage and control their index and search in this index 1000times/second NutchWAX  superstructure over SE Nutch made for indexing of documents archived by Heritrix  set of indexing and query plug-in, which add some needed metadata to index

WERA - WEb aRchive Access  cooperation between IIPC, Internet Archive and NWA  use some parts from NWA  very easy navigation, nice user interface (time line with documents version in time)  search hits in URL form are displayed very digestedly, each hit has link to the timeline to get differ. version of the same URL  possibility to search by URL address (like Wayback M.)  archived docs and WERA are linked by NutchWAX index

How does it work actually?  harvest of docs – by the Heritrix crawler, docs are saved to data storage in ARC format  to make archived docs accessible we have to make index + interface, which display seach hits 1. making of the fulltext index over the collection of selected resources v- for searching by the words- NutchWAX 2. making of global index to provide access of the whole archive - ARCWayback  displaying of docs from archive - WERA and Wayback

WebArchiv – Infrastructure A1 new crawl; A2 end crawl -> index; A3 update fulltext; A4 update host list

WERA - ukázka

Our future  main aim  main aim – finish 2006 harvest, >> keep in processing the whole.cz domain harvest every year  go on with selective collection and increase the amount of resources in it  provide legal access to the whole archive – localy- according to the new CA (searching by URL and by the time of harvest  implemantation of incremental harvest (changes identification in repeatedly harvested docs)  Harvesting of bohemical resourcs outside the.cz domain - some language recognition tool  Adaptive incremental harvesting

Our future  Identification of duplicate (or rather very similar) documents  Incremental indexing - adding of new docs into already made index, not to make new one everytime  Fulltext indexing of the whole archive  Selective harvesting on demand  Permanent linking into the archive  Access limitations set by the new copyright law  OAI-PMH implementation on top of the registration database  Building METS structures on top of the archive  integration of the archive into the proposed NDL 2007/08

People Librarians, project management:  National Library: 3.5 FTE IT management  Moravian Library – 1 part- time IT  Masaryk University – 6 part- time

Useful links – in english;-)  WebArchiv homepage  Petr Žabička Digital Cultural Heritage and the Cooperation of National Memory Institutes Archiving the Czech Web: Issues and Challenges  this presentation  Petr ŽABIČKA: WebArchiv, Czech Web Archive