Download presentation
Presentation is loading. Please wait.
Published byJanis Joseph Modified over 6 years ago
1
Documentation as part of curation in web archiving.
Word documents, extended fields and wikies My name is…. Working with Netarkivet from the very beginning in 2005 as a curator (half of my time). Furthermore: broad range of dissemination activities in the audiovisual department of SB (among other Europeana Sounds) Proposal IIPC GA Individual presentations can be a maximum of 20 mins. A panel session can be a maximum of 60 minutes with 2 or more presentations on a topic. A discussion session should include one or more introductory statements followed by a moderated discussion. Workshops can be up to a half-day in length; please include details on the proposed structure, content, and target audience. Documentation as part of curation in web archiving. From word documents to a service layer for users Being the national Danish web archive, the Netarchive crawls and archives millions of URL’s each year. The curators use the tool Netarchive Suite (NAS), a suitable tool for curating both huge (broad) crawls and selective crawls (ongoing selective crawls and event crawls). As to the selective crawls, URL’s from the selecteted domains are crawled with many different schedules and in different levels. How do we keep trace of all choices and decisions? The tool gives the possibility for annotations of the crawls, but there is not enough space for the documentation of our rather differentiated informations: Why do we crawl URL’s from a given domain, but not form another one? We have established, a workflow for the selective crawls, the URL’s to crawl and how to crawl them. This rather complex information is not only relevant for the curators to keep trace of a certain domain in the crawl workflow but it is definitely essential for users. The first documentation of selective crawl history was represented by Word documents in a folder system exemplifying the workflow. A big improvement was the migration of the documentation from the folder system to a “Media wiki”. But the “Media wiki” is an about 10 years old tool and, as the documentation was increasing, the challenge of keeping the documentation well structured was growing. The curators decided the migration to a new, more sophisticated platform, atlassian.com. We use both for the selective crawl workflow and backlogs and the wiki for supplementary internal and external documention. It was essential for the choice, that we had the possibility to develop a service layer for our users. This service layer picks informations addressed to users and makes them accessible simoultanously with the wayback access to the archive. NAS Workshop, Vienna, April 2017 Sabine
2
What about documentation?
How it all began In house developed tool Curators had no influence on the look and feel and functionality of the tool A tool for you What about documentation?
3
Collection strategies
Broad crawls: 4 times a year To reflect the Danish part of the internet over time Selective crawls: af /2016ff ca. 200 domains Event crawls (KB/SB) E.g. the 2015 parliamentary elections, the European refugee crisis (events, that boost the activity on social media and news media pages)
4
The curator tool NetarchiveSuite (NAS)
Great tool for hughes/national web archives Can not comply with the curators needs for documentation (especially for selective crawls)
5
Workflow, selctive crawls
3
6
Internal needs Overview of the domains chosen for selective crawls
Info on why and how History and state of the crawls: date for last analyze and last Who is working on which domain Overview on domains ever crawled selctively, still crawled selectively and rejected for selctive crawls and for what reason
7
External users needs Documentation along with the collections.
Selective crawls of which domains? When (period) How (depth, frequency)? Why or why not? Fulltext search does not suffice Why can they find ”this but not that” Service layer
8
Version 1: folder system (windows)
A word document for each domain to be crawled selectively
9
Version 2: folder system moves to a wiki
Powered by MediaWiki.org
10
Constraints Curators only
No possibility to restrict access to parts of the wiki space Sensitive data/information Established for internal use Curators only
11
Version 3(not implemented):Extended Fields
Created by ONB Goal: to gather all documentation in NAS Login-information Type of harvest When (period) How (depth, frequency)? Why or why not? etc NAS Andreas did a great job making extended fields work – but finally, when we implemented them into our test environment, something went wrong: We never dared implenent them in our production environment.
12
Extended Fields
13
Solution/Version 4: Atlassian JIRA/confluence
Flexible tool ”Netarchive Selective Crawls” created as a project (with issue tracker) designed individually according to the workflow
14
Solution/Version 3: Atlassian JIRA/confluence
Information can be extracted to dissimination systems (service layer for users) A more effective internal tool From manual to automated registration of tokens, dates, acitivties… Different overview displays, search and filtering is feasable
15
Solution/Version 4: Atlassian JIRA
16
Solution: Atlassian confluence
Overall documentation
18
Links Gammel workflow: Nyt workflow: Forskellige udtræksmuligheder Mere end 6 mdr siden: Vejledning Kun til mig selv IIPC GA 2016, Reykjavik IIPC GA 2016, Reykjavik
19
Links (2) Forskellige udtræksmuligheder
Mere end 6 mdr siden: Issue panel: Vejledning
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.