Documentation as part of curation in web archiving.

Slides:



Advertisements
Similar presentations
Título de la presentación NetarchiveSuite at the BNE Juan Carlos García Arratia – Chief of IT Development Service, NLS Mar Pérez Morillo – Chief of Web.
Advertisements

Looking Ahead Archive-It Partner Meeting November 12, 2013.
PRODUCT FOCUS 4/14/14 – 4/25/14 INTRODUCTION Our Product Focus for the next two weeks is Microsoft Office 365. Office 365 is Microsoft’s most successful.
Selecting Preservation Strategies for Web Archives Stephan Strodl, Andreas Rauber Department of Software.
Exploring archives online Joanne Fitton and Jonathan Ainsworth EMu Users Conference 23 April 2015.
11 WARC standard revision workshop Clément Oury IIPC General Assembly open workshops Stanford, April 28th, 2015 IIPC General Assembly – Stanford – April.
Copyright 2008 TietoEnator Corporation Developing Corporate Knowledge Management through Social Media Petra Säntti Supervisor:
Recent approaches to capture web content, which Heritrix can’t harvest  Capturing Social Media  Screen filming of Rich Media  Project: Event crawl of.
1 Archive-It Training University of Maryland July 12, 2007.
1 VPs of Search Engine Marketing: The New York Times Approach Marshall Simmonds Chief Search Strategist, The New York Times.
Web Archiving at the Innsbruck Newspaper Archive Innsbrucker Zeitungsarchiv / IZA Presentation by Renate Giacomuzzi, Elisabeth Sporer, Armin Schleicher.
Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.
Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.
IIPC GA Curator Tools Fair May 2014 WEB CURATOR TOOL Nicola Bingham Web Archivist.
1 Semanticommunity.info Tutorial Brand Niemann December 7, 2010.
Plans for 2015 Tallinn, Jan 29 th, 2015 Ditte Laursen, Sabine Schostag,
NetarchiveSuite Sabine Schostag The Netarchive
ERIKA Eesti Ressursid Internetis Kataloogimine ja Arhiveerimine Estonian Resources in Internet, Indexing and Archiving.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
NetarchiveSuite Meeting, Aarhus, 29./ Austria Updates and Plans for 2013 Michaela Mayr, Andreas P. Austrian National Library
Netarkivet RESAW seminar, Dec 2-3, 2013 Day 1. Who are we today □Birgit N. Henriksen, head of digital preservation, KB □Bjarne Andersen, head of digital.
Curator wishes for the roadmap november 2011 updates.
The Library of Congress Martha Anderson Program Officer, NDIIPP Office of Strategic Initiatives Library of Congress April 2005 LC Perspective : Preservation.
NetarchiveSuite Meeting, BnF, Austria Updates and Plans for 2012 Michaela Mayr, Andreas P. Austrian National Library
COMP 208/214/215/216 – Lecture 8 Demonstrations and Portfolios.
Webarchivering in het Audiovisuele Domein Web archiving in the audiovisual Domain Julia Vytopil- Nederlands Instituut voor Beeld en Geluid Netherlands.
NetarchiveSuite Meeting, Paris, * Austria Updates and Plans for 2014/2015 Michaela Mayr, Andreas Predikaka Austrian National Library.
Endangered Species A Collaborative Teaching Unit.
To be completed Your proposal  Your House style  Your site plan  Page plans (a draft layout for each of your five pages)  A design mock-up -  All.
Internet Documentation and Integration of Metadata (IDIOM) Presented by Ahmet E. Topcu Advisor: Prof. Geoffrey C. Fox 1/14/2009.
Thomas Kern | The system documentation as binding agent for and in between internal and external customers April 24th, 2009 | Page 1 The system documentation.
Building Collections on the Web BCWeb. What’s BCWeb ? BCWeb was developped entirely by the BnF for the content curators to replace its old selection tools.
1 NetarchiveSuite Workshop Paris November , 2011.
You Must… Gather answers to the following questions: Decide what your website will be about. Identify who your audience would be for this site. Outlined.
LEMAIA PROJECT Kick off meeting Rome February 2007 LEMAIA: a Project to foster e-learning diffusion Pietro RAGNI LEMAIA PROJECT Rome, 11 april.
CERN IT Department CH-1211 Genève 23 Switzerland t Services and Resources Web IT Services and Resources Web Pages A Proposal Tim Bell 1.
Netarchive Plans for the next year. Netarchive – Plans for the next year  4 broad crawls  One broad crawl lasts less than 55days  We are able to fullfill.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Use cases for BnF broad crawls Annick Lorthios. 2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the.
CCS Information and Support Center Introduction. What is the information center for? Not only does our web-based.
Copenhagen 11 March 2015 Dias 1 Theme 2a: Media Tools — NetLab, a Research Infrastructure for Internet Studies Niels Brügger, Aarhus University Advisory.
The ACT with Writing and ACT WorkKeys - Alabama
Disclosure of designs under the CDR
Institution update KB DK
Reviewing the concept of the UNECE Statistical Yearbook
Workshop on Web Archiving
Joanne Archer University of Maryland Libraries
Presenter Organisation(s)
UNC Digital Library Project
Session 4 ECO Documentation Database
László Drótos – Márton Németh National Széchényi Library Department of Electronic Library Services Web archiving Planning a new pilot project.
Technical Communication: Foundations
Presenter Organisation(s)
ERASMUS+ Capacity-building in Higher Education
A01 DESIGN To be completed Your proposal  Your House style 
CHAPTER 4 PROPOSAL.
CHAPTER 4 PROPOSAL.
Consumer Behaviour PROJECT WORK Laura Grazzini
Design Brief.
Márton Németh – László Drótos How to catalogue a web archive?
Producing Web Course Material with IBM Knowledge Factory Team
Web archives as a research subject
How to create the digital identity of an E-enterprise
How to set up PMO for any business project
Yale Digital Conference 2019
Webarchive Austria NetarchiveSuite Meeting Madrid 2019
What Can It Do For You? Spira | #InflectraCon
Presentation transcript:

Documentation as part of curation in web archiving. Word documents, extended fields and wikies My name is…. Working with Netarkivet from the very beginning in 2005 as a curator (half of my time). Furthermore: broad range of dissemination activities in the audiovisual department of SB (among other Europeana Sounds) Proposal IIPC GA Individual presentations can be a maximum of 20 mins. A panel session can be a maximum of 60 minutes with 2 or more presentations on a topic. A discussion session should include one or more introductory statements followed by a moderated discussion. Workshops can be up to a half-day in length; please include details on the proposed structure, content, and target audience.   Documentation as part of curation in web archiving. From word documents to a service layer for users Being the national Danish web archive, the Netarchive crawls and archives millions of URL’s each year. The curators use the tool Netarchive Suite (NAS), a suitable tool for curating both huge (broad) crawls and selective crawls (ongoing selective crawls and event crawls). As to the selective crawls, URL’s from the selecteted domains are crawled with many different schedules and in different levels. How do we keep trace of all choices and decisions? The tool gives the possibility for annotations of the crawls, but there is not enough space for the documentation of our rather differentiated informations: Why do we crawl URL’s from a given domain, but not form another one? We have established, a workflow for the selective crawls, the URL’s to crawl and how to crawl them. This rather complex information is not only relevant for the curators to keep trace of a certain domain in the crawl workflow but it is definitely essential for users. The first documentation of selective crawl history was represented by Word documents in a folder system exemplifying the workflow. A big improvement was the migration of the documentation from the folder system to a “Media wiki”. But the “Media wiki” is an about 10 years old tool and, as the documentation was increasing, the challenge of keeping the documentation well structured was growing. The curators decided the migration to a new, more sophisticated platform, atlassian.com. We use both https://www.atlassian.com/software/jira for the selective crawl workflow and backlogs and the wiki https://www.atlassian.com/software/confluence for supplementary internal and external documention. It was essential for the choice, that we had the possibility to develop a service layer for our users. This service layer picks informations addressed to users and makes them accessible simoultanously with the wayback access to the archive. NAS Workshop, Vienna, April 2017 Sabine

What about documentation? How it all began In house developed tool Curators had no influence on the look and feel and functionality of the tool A tool for you What about documentation?

Collection strategies Broad crawls: 4 times a year To reflect the Danish part of the internet over time Selective crawls: af 80-100/2016ff ca. 200 domains Event crawls (KB/SB) E.g. the 2015 parliamentary elections, the European refugee crisis (events, that boost the activity on social media and news media pages)

The curator tool NetarchiveSuite (NAS) Great tool for hughes/national web archives Can not comply with the curators needs for documentation (especially for selective crawls)

Workflow, selctive crawls 3

Internal needs Overview of the domains chosen for selective crawls Info on why and how History and state of the crawls: date for last analyze and last Who is working on which domain Overview on domains ever crawled selctively, still crawled selectively and rejected for selctive crawls and for what reason

External users needs Documentation along with the collections. Selective crawls of which domains? When (period) How (depth, frequency)? Why or why not? Fulltext search does not suffice Why can they find ”this but not that” Service layer

Version 1: folder system (windows) A word document for each domain to be crawled selectively

Version 2: folder system moves to a wiki Powered by MediaWiki.org

Constraints Curators only No possibility to restrict access to parts of the wiki space Sensitive data/information Established for internal use Curators only

Version 3(not implemented):Extended Fields Created by ONB Goal: to gather all documentation in NAS Login-information Type of harvest When (period) How (depth, frequency)? Why or why not? etc NAS Andreas did a great job making extended fields work – but finally, when we implemented them into our test environment, something went wrong: We never dared implenent them in our production environment.

Extended Fields

Solution/Version 4: Atlassian JIRA/confluence Flexible tool ”Netarchive Selective Crawls” created as a project (with issue tracker) designed individually according to the workflow https://www.atlassian.com/

Solution/Version 3: Atlassian JIRA/confluence Information can be extracted to dissimination systems (service layer for users) A more effective internal tool From manual to automated registration of tokens, dates, acitivties… Different overview displays, search and filtering is feasable

Solution/Version 4: Atlassian JIRA

Solution: Atlassian confluence Overall documentation

Links Gammel workflow: https://netarkivet.statsbiblioteket.dk/netarkivet/index.php?title=Workflow_for_selektive_h%C3%B8stninger Nyt workflow: https://sbprojects.statsbiblioteket.dk/jira/secure/RapidBoard.jspa?rapidView=25&selectedIssue=NETSH-15 Forskellige udtræksmuligheder Mere end 6 mdr siden: https://sbprojects.statsbiblioteket.dk/pages/viewpage.action?pageId=15994104 Vejledning https://sbprojects.statsbiblioteket.dk/pages/viewpage.action?pageId=15994104 Kun til mig selv IIPC GA 2016, Reykjavik IIPC GA 2016, Reykjavik

Links (2) Forskellige udtræksmuligheder Mere end 6 mdr siden: https://sbprojects.statsbiblioteket.dk/pages/viewpage.action?pageId=15994104 Issue panel: https://sbprojects.statsbiblioteket.dk/jira/browse/NETSH/?selectedTab=com.atlassian.jira.jira-projects-plugin:issues-panel Vejledning https://sbprojects.statsbiblioteket.dk/pages/viewpage.action?pageId=15994104