Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus

Slides:



Advertisements
Similar presentations
An Introduction To Heritrix
Advertisements

Harvesting and archiving the Web Nordunet2000, Juha Hakala Helsinki University Library.
The DRIVER Infrastructure (Digital Repository Infrastructure Vision for European Research) Paolo Manghi ISTI - National Research Council, Italy.
A survey of Web preservation initiatives Michael Day UKOLN, University of Bath 7 th European Conference on Research and Advanced Technology.
OCLC Digital Archive Overview Judith Cobb LIPA Meeting July 2006.
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
Bibliothèque nationale de France Tallinn, BnF update: production and development priorities in 2015.
Título de la presentación NetarchiveSuite at the BNE Juan Carlos García Arratia – Chief of IT Development Service, NLS Mar Pérez Morillo – Chief of Web.
BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall
Web Characterization Week 9 LBSC 690 Information Technology.
Developing PANDORA Mark Corbould Director, IT Business Systems.
Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
1 WEB ARCHIVING IN THE BRITISH LIBRARY John Tuck Head of British Collections February 2004.
Recent approaches to capture web content, which Heritrix can’t harvest  Capturing Social Media  Screen filming of Rich Media  Project: Event crawl of.
The capture and preservation of websites at the National Library of New Zealand Gillian Lee Alexander Turnbull Library.
1 Archive-It Training University of Maryland July 12, 2007.
Data-PASS Shared Catalog Micah Altman & Jonathan Crabtree 1 Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate.
Danish Legal Deposit on the Internet: Current Solutions and Approaches for the Future ECDL, September 2001 by Birgit N. Henriksen Head of Digitization.
Annick Le Follic Bibliothèque nationale de France Tallinn,
Danish Legal Deposit Experiences & the Need for Adjustments by Birgit N. Henriksen Head of Digitization and Web Department The Royal Library, Denmark.
Bibliography in the Digital Age - IFLA Satellite Meeting Warsaw, 9 August Online materials published in Austria collecting, archiving and metadata.
WebArchiv Czech Web Archive IIPC 2007, Paris.
1 News and media websites harvesting. 2 A daily crawl since December 2010 The selective crawl contains 92 websites National daily newspapers (
1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.
Danish Legal Deposit on the Internet National Diet Library, Tokyo, January 2002 by Birgit N. Henriksen Head of Digitization and Web Department The Royal.
How to Face the Challenges of Web Archiving? The experiences of a small library on the edge. Chloe Martin, Internet Memory Catherine Ryan, National Library.
Web The Internet Archive. Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for.
A centre of expertise in digital information managementwww.ukoln.ac.uk Digital Preservation / UK Web Focus Brian Kelly UKOLN University of Bath Bath, BA2.
Vanderbilt Television News Archive A resource for News, Popular Culture, and The Arts Marshall Breeding Director for Innovative Technology and Research.
Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.
WebArchive – Archive of the Czech Web Mgr. Jan HUTAŘ.
Web Archiving Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Web Science and Web Archive L3S Wolfgang Nejdl L3S Research Center Hannover, Germany.
Annick Le Follic Bibliothèque nationale de France Tallinn,
Generating Intelligent Links to Web Pages by Mining Access Patterns of Individuals and the Community Benjamin Lambert Omid Fatemieh CS598CXZ Spring 2005.
The ECHO DEPository Project A project of the University of Illinois at Urbana-Champaign and OCLC in partnership with the Library of Congress ALA Annual.
ECHO DEPository Project: Highlight on tools & emerging issues The ECHO DEPository Project is a 3-year digital preservation research and development project.
Plans for 2015 Tallinn, Jan 29 th, 2015 Ditte Laursen, Sabine Schostag,
NetarchiveSuite Sabine Schostag The Netarchive
ERIKA Eesti Ressursid Internetis Kataloogimine ja Arhiveerimine Estonian Resources in Internet, Indexing and Archiving.
1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
From here to perpetuity: challenges (and a few confessions) in preserving web-based AV content ASRA Conference 2011 Paul Koerbin Manager Web Archiving.
Vanderbilt Television News Archive Marshall Breeding Director for Innovative Technology and Research Vanderbilt University
Netarkivet RESAW seminar, Dec 2-3, 2013 Day 1. Who are we today □Birgit N. Henriksen, head of digital preservation, KB □Bjarne Andersen, head of digital.
Web Archiving: Avery Fisher Center for Music & Media Rhiannon Bettivia, Zack Lischer-Katz, Samantha Losben & Erica Wilson November 29, 2010 Digital Preservation.
The KB e-Depot long-term preservation of scientific publications in practice Marcel Ras, National library of The Netherlands.
Recent Developments in CLARIN-NL Jan Odijk P11 LREC, Istanbul, May 23,
Examples for Open Access Scholar Electronic Repository by New Bulgarian University IP LibCMASS Sofia 2011 Contract № 2011-ERA-IP-7 Sofia, September,
Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /
1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.
Access : connection to the Internet account : an arrangement you have with a company or Internet provider to use a service they provide. browse : to look.
Identifying priorities for digitisation and lobbying for funding: A glimpse from DK Birte Christensen-Dalsgaard State and University Library, Aarhus, Denmark.
1 NetarchiveSuite Workshop Paris November , 2011.
2015 NetarchiveSuite Workshop Eesti Rahvusraamatukogu Tallinn, Estonia January
1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping.
ELISQ Systems Demonstration Sagnik Ray Choudhury Doha -- May 2015.
Frompo is a Next Generation Curated Search Engine. Frompo has a community of users who come together and curate search results to help improve.
Use cases for BnF broad crawls Annick Lorthios. 2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Web Archiving Workshop Mark Phillips Texas Conference on Digital Libraries June 4, 2008.
Data mining in web applications
Workshop on Web Archiving
Joanne Archer University of Maryland Libraries
Challenges and Opportunities of Archiving the UK Web
László Drótos – Márton Németh National Széchényi Library Department of Electronic Library Services Web archiving Planning a new pilot project.
MSC photo:  It was taken some time in the late 1930s, but we don’t have an exact date.  The college was known as MSC from 1925 until 1955 when we became.
Márton Németh – László Drótos How to catalogue a web archive?
Metadata supported full-text search in a web archive
Presentation transcript:

Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus

Agenda  New legal deposit law in Denmark  Collection strategies  NetarchiveSuite software package  Snapshot harvesting  Selective harvesting  Event harvesting  Challenges in snapshot harvesting  Snapshot harvesting usefulness  Future work

Legal deposit law 1  Revision of the legal deposit law in > legal deposit included static documents on the internet  During in we found out that: We were actually preserving the least interesting part  Many of the documents in that collection are also available in print  A lot of work was done between pilot projects run by the two national libraries  Testing different software / different strategies for archiving / storing web material A governmental publication on ”preserving the Danish digital cultural heritage” (2003) A report to the ministry of culture (2004) outlining  Recommendations from the two national libraries on how to solve the ”entire” problem  Issues to be covered by a new revision of the legal deposit law

Legal deposit law 2  A new revision came into force on july 1st 2005 Allowing the two national libraries to automatically gather all danish websites Danish roughly defined as:  Websites on the.dk TLD  Websites minded on a Danish audience / written in Danish  Websites about Danish people (Hans Christian Andersen)  More or less any site of interest to Denmark We are by law granted access to all relevant data from the.dk TLD administrator

Legal deposit law 3  The law covers all public available material Material that all Danish people in principal can gain access to  Material which requires action before usage (payment, registration….)  Pay-sites should hand out username / password upon request (for free)  Other interesting parts Combined strategy (snapshot, selective and event-harvesting) Robots.txt explicitly mentioned in the regulations of the law  A lot of the very interesting websites have very restrictive robots.txt’s (we discovered around robots.txt-files)  During 6 snap shots of more than web sites we had fewer than 50 complaints about robots.txt

Legal deposit law 4  In the end led to funding of Netarchive.dk  Virtual centre in cooperation between The Royal Library, Copenhagen The State & University Library, Aarhus  Implementing a complete system  Running the archiving on a daily basis Currently with an annual budget of euros  Involving 15 people from the two libraries 4.5 Man-years of man-power

The 3 collection strategies  Illustrated by coverage over time  Amount of data collected so far Snapshots: 61 TB (6 times) Selective harvests: 9.5 TB (80 web sites) Event harvests: 5.6 TB (9 events)

NetarchiveSuite software package  We needed a curator tool ready at July 1st 2005 Requirement number 1: Operated by librarians  With the web interface librarians can: Define harvests (all three types)  Based on quite simple settings + a number of different predefined heritrix setups Do quality control  Looking at harvest results (simple reports and statistics)  Browsing through harvested material Automated pickup of missing URIs  NetarchiveSuite was released as Open Source in July 2007 Currently used by a number of national libraries

Snapshot harvesting  The.dk TLD currently holds > active domains We encountered around Danish domains outside the.dk TLD  By extracting links from the entire.dk web space – checking country-code by IP-number (GeoIP)  By doing Google searches on Danish localities (city names..)  With 8 machines we can do One complete snapshot (including deDuplication) at 20TB in 80 days DeDuplication saves around 30% of the storage space

Selective harvesting  Archiving of 80 selected websites News sites ”Typical” dynamic and heavily used sites representing civic society, the commercial sector and public authorities Experimental and/or unique sites, documenting new ways of using the web (e.g. net art) Harvested much more frequent  From weekly to several times per day

Event harvesting  Combining the other two strategies Taking a larger number of sites ( ) On a more frequent basis (daily / weekly) In a shorter period of time  We have done 9 event harvests so far Elections, different national events  We have pre-defined some harvest-definitions on especially news-sites (both local and national) With one click we can start these if a sudden event should happen – to ensure collection of important sites from the very beginning

Challenges in snapshot harvesting  Number of domains is constantly growing 2005: domains – 480.0o0 active 2008: domains – active  Domains are growing bigger and bigger Audio/Video is getting more and more popular Sites larger than 10Mb increased from to Sites larger than 500Mb increased from to  Web 2.0 makes harvesting difficult Web material is inlined from other web sites – from all over the world  The border of a web site is disappearing The web is going more and more dynamic – Flash / Ajax  The amount of traps and spam grows constantly In Denmark librarians manually inspect all websites larger than 1Gb  Currently over 3000 domains  They identify aliases and potential crawler traps That task should be (semi)-automated

Snapshot harvesting usefulness  With snap shot harvesting a web archive ensures cultural heritage by Archiving regular ”pictures” of entire national parts of the internet Archiving as much as possible in a quite cheap way  Netarchive.dk: Storage space and 15 hours per week for librarians  Snap shots is very useful for research in many different areas Linguistics Web technologies File formats and their evolution Web design Genealogy / Ancestor search Web site history And many many more – to be defined in the future  And off cause useful for more ordinary users wanting To find content disappeared from the live web – days lifetime  Getting more and more interesting over time  Currently access to Netarchive.dk is limited to researchers

Future work  Automating discovery of Danish web sites outside the.dk TLD  Automated quality assurance for large crawls  Automating filtering of web spam and traps  Improving archiving of web 2.0 Dynamic web content Streaming audio/video  Non of these problems are Danish Lets solve them together LIWA – European project working on most of these problems  Danish challenges Working for better access possibilities  On the system level: WayBack Machine / NutchWAX search  On the political level: Change of law

Questions ?