Download presentation
Presentation is loading. Please wait.
Published byStuart Powers Modified over 9 years ago
1
netarkivet RESAW seminar, Dec 2-3, 2013 Day 1
2
Who are we today □Birgit N. Henriksen, head of digital preservation, KB □Bjarne Andersen, head of digital preservation, SB □Eld Zierau, developer and researcher, KB □Ditte Laursen, curator and researcher, SB □Henrik Smith-Sivertsen, researcher, KB
3
Organization □a virtual center (SB/KB – IT development, IT operation, Collection department) □steering committee □daily manager □editorial advisory board
4
Collection policy □Legal deposit law 2005: ”Materials made public via electronic communication network” □Danish materials Websites on the.dk TLD Websites minded on a Danish audience / written in Danish Websites about Danish people (Hans Christian Andersen etc.) More or less any site of interest to Denmark
5
Collection strategies □4 strategies ■ 4 annual snapshots (KB) □ensure the wide picture ■ Selective harvesting of 80 domains (SB) □ensure frequently updated websites ■ Event-harvesting of 2-3 national events per year (KB/SB) □2013: Teachers’ lockout, International Melodi Grandprix, Danish local elections, Election of the pope (IIPC) … ■ Special havests (KB/SB), ie. wikileaks, kriseinfo.dk, nyalliance.dk …
6
Collection strategies coverage time snapshot selective event special
7
Access □The archive contains sensitive personal data, therefore the entire archive is considered sensitive ■ only researchers including PhD students can be granted access □if research on sensitive personal data, the Data Protection Agency assesses the application □if not, the library assesses the application □the Copyright Act defines research as being from PhD level and up □the Privacy Act defines research as something with a ’scientific purpose’ □Netarkivet is working on a wider access ■ for students and for the general public ■ small corpus
8
Use of the archive □Only a handful active researchers ■ no user friendly way of accessing the archive ■ lack of knowledge about the archive ■ new kind of data source □Research projects – examples ■ dr.dk’s history 1996-2006 ■ the history of internet newspapers ■ the mediation of art in the network society ■ the digital music revolution – the case of Sys Bjerre ■ Danish parlimentary elections 2007-2011 …
9
Technical setup □NetarchiveSuite (open source) □44 servers, 260 running java apps □WayBack-machine □Batch-jobs □Full-text indexing experiments □ARC/WARC
10
Some numbers □Total: 414 TB – 13 billion objects Snapshots: 353 TB Selective: 47 TB Events: 13 TB □One snapshot: approx. 30 TB (2006: 9 TB)
11
Current challenges □wider access □better access (free text search) □inclusion of older net collections □collection of websites with restricted access □advanced websites, ie. with sound/video/live interaction (chat, virtual worlds …) □electronic communication networks ≠ the web □long-term preservation □documentation
12
2013-2014 Tools search - free text indexes harvesting - the use of Heritrix3 and Live Archiving proxy Infrastructure web archives as part of a research infrastructure access to archived material using Persistant Identifiers Archiving methods capturing online games automatic methods to locate relevant Danish web materials outside the Danish TLD.dk
13
Ongoing activites related to RESAW’s topics □API improvement / so-called service layer □corpus building □documentation □full-text search □statistics □legal aspects (ie. broader access, data mining policy)
14
What is the RESAW project in 10 years? □a very strong partner to IIPC □common infrastructure across borders (ERIC / ESFRI status) □coordinated european collection building
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.