Download presentation
Presentation is loading. Please wait.
1
Workshop on Web Archiving
MODULE 2: EXISTING WEB ARCHIVES Janne Nielsen Asger Harlung Ulrich Karstoft Have
2
Module 2: Existing Web Collections
Introduction to web archives The Danish Netarkivet Internet Archive Library of Congress Other (US) web archives Ideas for NetLab workspace
3
Introduction to Web Archives
Focus on: The collection, including strategies Access Search Documentation
4
Netarkivet The collection, including strategies Access Search
Documentation
5
Netarkivet is run by the State and University Library (Aarhus) and the Royal Library (National Library of Denmark, Copenhagen). The Danish part of the Internet is defined as cultural heritage in the Legal Deposit Act (Act no of ), effective from June 1st, 2005 The ”Danish part of the Internet” = all Internet content in Danish or meant for Danes the top level domain .dk and danica (e.g. sites in Danish or addressing Danes on other domains such as .com, .eu, .nu, etc.) .dk domain names: in July 2005, in January 2013 Dead .dk domains from July 2005 to January 2013: 2011: Roughly 222 TB; 6 m objects, most common file types are html, jpeg, gif and png 2013: Most common file types are html, jpeg, pdf and mp4 (video) 2014: On July 27 the data in Netarkivet amounted to 501 TB 2015: On November 15 the data comprised 654 TB
6
From http://netarkivet.dk/om-netarkivet
Netarkivet 2005 Strategies: Broad/bulk Selective Event Special From Broad Coverage Time Event Selective E
7
Netarkivet The collection, including strategies Access Search options
Documentation Access is restricted to: researchers (online) thesis students (on-site) No-one else can get access.
8
Netarkivet The collection, including strategies Access Search options
Documentation Single URL search using the wayback interface
11
Netarkivet The collection, including strategies Access Search options
Documentation Single URL search using the wayback interface Free text search NetLab is working on: multiple URL search file type search
12
Netarkivet The collection, including strategies Access Search options
Documentation Manual documentation: At collection level (netarkivet.dk, word-dokument) Curators (wiki) Automated documentation: Harvesting data (metadata) Crawl logs, but not accessible yet
13
Internet Archive The collection, including strategies Access Search
Documentation
14
The Internet Archive: american non-profit from 1996 not based on national legislation in general based on cumulative archiving, following hyperlinks from what was already archived the worlds largest collection of archived web more than 491 billion web pages, collects app. 1 billion pages per week quality is erratic — often only top level(s) heterogenious collection, no overall strategy, including donations…
15
Internet Archive The collection Access Search
Documentation Free online access for everyone
16
Internet Archive The collection Access Search
Documentation Search for individual URLs, displayed via Open Wayback interface
19
Internet Archive The collection Access Search
Documentation No accessible documentation for the URL except harvest time General documentation about how the Internet Archive harvests (FAQ)
20
Exercise in Web Archives
Open Internet Archive on Find one or more websites in the Wayback Machine. Move around on the website by clicking hyperlinks. Are elements missing, or do you notice anything else? If you have access to Netarkivet, you can choose to do the excercise in Netarkivet:
21
Funny observations?
22
Internet Archive Archive-It — the Internet Archive’s subscription web archiving service A number of collections from their partners, including event collections Full-text searchable Archive-It Research Services (ARS) — provides access to data sets extracted from collections (metadata, link graphs, named entities, other data).
23
Library of Congress The collection, including strategies Access Search
Documentation
24
Library of Congress web archive:
from 2000 curated, topic based and selective collections harvested by the Internet Archive (not Archive-It) 763TB
25
Library of Congress The collection, including strategies Access Search
Documentation Free online access for everyone, via LoC Wayback In many cases only ‘flat’ image
26
Library of Congress The collection, including strategies Access Search
Documentation Search for individual URL, displayed via Open Wayback interface Full-text search in meta data
27
Library of Congress The collection, including strategies Access Search
Documentation Very well documented and curated Documentation about each collection, and about each website
28
Other (US) Web Archives
29
Other Web Archives IIPC Member Archives
List of Web archiving initiatives, Truman, G. (2016). WebArchiving Environmental Scan. Harvard Library Report.
30
Ideas for NetLab workspace
31
The Four Phases in Research
Corpus creation Analysis Dissemination Storage Search Duplicates Select Isolate Identifify Evaluate Select/remove/combine
34
Ideas for NetLab Workspace
Challenges: Large amounts of data How to distinguish between the many versions? No visual representation Needs: Different ways of filtering content Choosing and ‘bookmarking’ pages Isolation/extraction of corpus Flexible interface to present different metadata
35
Inspiration: LARM.fm
36
Inspiration: Trello
37
Inspiration: Papers 2
38
Ideas for NetLab Workspace
39
Ideas for NetLab Workspace
40
Ideas for NetLab Workspace
41
Ideas for NetLab Workspace
42
Ideas for NetLab Workspace
43
Ideas for NetLab Workspace
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.