Web archive data and researchers’ needs: how might we meet them?

Slides:

Advertisements

Similar presentations

1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett Slides:

Advertisements

Evaluation and Quality Of electronic journals and related information resources.

Evaluation of electronic resources. Review of Internet quality issues Nearly anyone can publish information on the Internet so –academic journals sit.

USING WORDPRESS. WEEK 1 1.Why WP? 2.Setting Up WP 3.Exploring the Admin screen 4.Page Organization 5.Posting 6.Polls.

Providing collections, tools and services for digital humanities A national library perspective Clément Oury Head of Digital Legal Deposit Bibliothèque.

Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.

1 Co-developing access to the UK Web Archive Helen Hockx-Yu Head of Web Archiving, British Library.

1 Adaptive Management Portal April

1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.

What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.

1 Archive-It Training University of Maryland July 12, 2007.

Annick Le Follic Bibliothèque nationale de France Tallinn,

How to Face the Challenges of Web Archiving? The experiences of a small library on the edge. Chloe Martin, Internet Memory Catherine Ryan, National Library.

Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014.

Annick Le Follic Bibliothèque nationale de France Tallinn,

Updating the School Website St George’s C of E (V.A.) Primary School.

Metadata, the CARARE Aggregation service and 3D ICONS Kate Fernie, MDR Partners, UK.

ERIKA Eesti Ressursid Internetis Kataloogimine ja Arhiveerimine Estonian Resources in Internet, Indexing and Archiving.

1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.

Copyright ©: SAMSUNG & Samsung Hope for Youth. All rights reserved Tutorials The internet: Blogging Suitable for: Advanced.

Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.

PowerPoint Presentation of Essential Concepts PowerPoint Presentation of Essential Concepts Chalice Tillis LEM 511.

A bad case of content reuse Validator Website to Validate License Violations Validator – Only requires the URI of the site to check This work by Oshani.

A bad case of content reuse Validator Website to Validate License Violations Validator – Only requires the URI of the site to check for a license violation.

Dwayne Forrester Next  A blog is a discussion or informational site published on the World Wide Web and consisting of discrete entries ("posts") typically.

Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Comparability of language data and analysis Using an ontology for linguistics Scott Farrar, U.

Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.

L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.

Sketches and prototypes for the Orlando Six Degrees of Separation Project.

9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK.

A Semantic Knowledge Base for the UK Government Web Archive Tom Storrar & Claire Newing Applying records management processes principles to the open government.

WordPress for Beginners February 2, 2014 Facebook.

Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta.

Use cases for BnF broad crawls Annick Lorthios. 2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the.

GNU EPrints 2 Overview Christopher Gutteridge 19 th October 2002 CERN. Geneva, Switzerland.

© 2012 Cengage Learning. All Rights Reserved. This edition is intended for use outside of the U.S. only, with content that may be different from the U.S.

Search can be Your Best Friend You just Need to Know How to Talk to it IW 306 Ágnes Molnár.

Web Archiving Workshop Mark Phillips Texas Conference on Digital Libraries June 4, 2008.

Lesson 5 New Pages and Links. Objectives In this tutorial we will: ● Provide an overview of the "networked" structure of a wiki ● Demonstrate how to create.

Website Deployment Week 12. Software Engineering Practices Consider the generic process framework – Communication – Planning – Modeling – Construction.

Data mining in web applications

Accessing Your Online Courses in Litmos

Archiving & Preserving Digital Content

Interact 2: Options for organising and presenting content

Using JSTOR May 2016.

Content Management.

The IPT user interface and data quality tools

Strategies for improving Web site performance

Challenges and Opportunities of Archiving the UK Web

Accessing Spatial Information from MaineDOT

Create your Benner - intro

Extraction, aggregation and classification at Web Scale

Student Pathways Survey / Plan

Easy Way to Export All WordPress URLs in Plain Text Guided By: - WPGLOBALSUPPORTWPGLOBALSUPPORT.

4Shared Clone | File Sharing php Script from PHP Script directory.

Alison Valk Georgia Tech

Data Management: Documentation & Metadata

Improving public accessibility and user engagement

Easy Website Creation Using WordPress

Introducing The Knowledge Network

The Australian Government Web Archive

Obj. 2.3 Explain how researchers use inductive content analysis (thematic analysis) on interview transcripts. To view this presentation, first, turn up.

Web Mining Department of Computer Science and Engg.

ENDANGERED ANIMALS A RESEARCH PROJECT

Márton Németh – László Drótos How to catalogue a web archive?

Web archives as a research subject

Deep SEARCH 9 A new tool in the box for automatic content classification: DS9 Machine Learning uses Hybrid Semantic AI ConTech November.

Vancouver Public Library

APE EAD3 introduction - DARIAH - Brussels

Metadata supported full-text search in a web archive

Presentation transcript:

Web archive data and researchers’ needs: how might we meet them? Title Slide I want to explore, as a provocation, the issues we have around operating at scale with imperfect technologies – how we humans need to intervene to produce acceptable outcomes, both in achieving the aim of producing a decent archive of each website but also serving our users well. These occasionally require interventions. But where’s the line and why, when and how should we think about revealing these interventions to users, past and future? Tom Storrar 7 September 2018

UK Government Web Archive An open web archive of the UK Government web estate 1996-present: nationalarchives.gov.uk/webarchive Selective, domain-based (e.g. justice.gov.uk, iraqinquiry.org.uk etc.) and social media archiving, emphasis on quality and completeness. UKGWA is growing at ~20TB per year Lots of users online and growing research use Slide 1 We are a selective web archive: UK government. No limit on depth within target host. This means that even for what may appear to be a smaller domain of interest, some crawls of single websites often run over 1 million URLs and 100GB. Some sites, if left to their own devices would never finish or would crawl the entire web! Web archiving is imperfect.. Technical limitations exposed through website design (POST, AJAX etc.) Crawler traps: traditional settings deployed to avoid these Other problematic functionality (e.g. filter and sort links). the space is more heterogeneous than you may imagine – a diversity of technologies. It is the web after all so, from a web archiving perspective we have Good, Bad and Ugly… Poorly-configured websites – e.g. extremely slow - we still have to archive these TIME – we often don’t have weeks to perform the capture process. Most of the websites we capture fall into the Good category but other (maybe 10%) require intervention beyond standard approaches…

UK Government Web Archive – Research Use We already support research into the collection: Users can search the entire web archive (full text) from webarchive.nationalarchives.gov.uk/search/ Memento implemented Computationally: can be scraped to produce data for research, using tools such as HTTrack, scrapey and scraper under arrangement we can give access to raw ARC/WARC files. Reuse conditions on most of the content is friendly! Slide 1 We are a selective web archive: UK government. No limit on depth within target host. This means that even for what may appear to be a smaller domain of interest, some crawls of single websites often run over 1 million URLs and 100GB. Some sites, if left to their own devices would never finish or would crawl the entire web! Web archiving is imperfect.. Technical limitations exposed through website design (POST, AJAX etc.) Crawler traps: traditional settings deployed to avoid these Other problematic functionality (e.g. filter and sort links). the space is more heterogeneous than you may imagine – a diversity of technologies. It is the web after all so, from a web archiving perspective we have Good, Bad and Ugly… Poorly-configured websites – e.g. extremely slow - we still have to archive these TIME – we often don’t have weeks to perform the capture process. Most of the websites we capture fall into the Good category but other (maybe 10%) require intervention beyond standard approaches…

Research using the web archive To date this has involved text-based analysis, such as: Semantic search and natural language processing N-gram based research Entity identification and searching through concept of “generous interfaces” CSV dataset searching methods Slide 8

Researcher feedback Varied but some common themes: Most users need to know that research opportunities exist and make a formal(-ish) application to get the data (WARC files) Steep learning curve for familiarity of WARC files and handling them (e.g. size of WARC files when uncompressed) Pre-processing required to extract most useful elements – e.g. URLs, text from valid pages Some understanding required of how web archives work, their limitations and the tools available for research. Slide 8

Options To support research, it is possible to create text only, simplified and manageable content derived from the original source data. We have a few: Metadata files, such as Web Archive Transformation (WAT) files Plain text extracts from web archive data, such as WET files Crawl logs Web archive API But what are they? What do they look like and how can they be used? Examples from commoncrawl.org… Slide 8

Original resource – standard replay Common Crawl: archived page Slide 8

Metadata (e.g. WAT) files Useful for: Link and network analysis (what links to what?) Temporal changes – sites x time … JSON data, typically only 20-30% the size of the source WARC file. Slide 8

Extracted Plain Text (e.g. WET) files Text files with basic resource metadata (URL, date, mime-type): Entity extracting, linguistic analysis N-grams … Typically only 5-15% the size of the source WARC file. Slide 8

Crawl logs Useful as a record of the “decisions” the crawler has made during the archiving process, particularly relating to scope and why, or why not, a resource has been archived. Can assist with the boundaries between our web archive versus those of others. Slide 8

Pywb (replay) API Presents limited summary metadata in CDX format Slide 8

Examples Projects Groups such as Archives Unleashed and researchers using data from Common Crawl, the Internet Archive and national libraries have done some interesting work using web archive data… 1: 2: https://news.archivesunleashed.org/on-cloud-number-9-accessibility-usability-and-functionality-with-the-archives-unleashed-cloud-1dd19ec40873 https://wiki.digitalmethods.net/Dmi/Winter13SearchingTheArchive Slide 8

Things to know and next steps Data processing of any kind has costs associated with it. We are interested in reducing the overheads for research into web archive data. Some existing knowledge or interest in gaining knowledge will be required to process the files but by pre-processing the data we are aiming to lower the barrier! The first step will be to generate some sample WET and WAT files for particular crawls or groups of crawls Slide 8

Next steps …we’d love to hear from anyone who is interested in taking a look at a sample of UKGWA data that has been pre-processed as described here. Are you interested in getting your hands on some data? Let’s talk! Slide 8

Thank you! Questions or comments? webarchive@nationalarchives.gov.uk Slide 8