Web archive data and researchers’ needs: how might we meet them?

Web archive data and researchers’ needs: how might we meet them?
Title Slide I want to explore, as a provocation, the issues we have around operating at scale with imperfect technologies – how we humans need to intervene to produce acceptable outcomes, both in achieving the aim of producing a decent archive of each website but also serving our users well. These occasionally require interventions. But where’s the line and why, when and how should we think about revealing these interventions to users, past and future? Tom Storrar 7 September 2018

UK Government Web Archive
An open web archive of the UK Government web estate 1996-present: nationalarchives.gov.uk/webarchive Selective, domain-based (e.g. justice.gov.uk, iraqinquiry.org.uk etc.) and social media archiving, emphasis on quality and completeness. UKGWA is growing at ~20TB per year Lots of users online and growing research use Slide 1 We are a selective web archive: UK government. No limit on depth within target host. This means that even for what may appear to be a smaller domain of interest, some crawls of single websites often run over 1 million URLs and 100GB. Some sites, if left to their own devices would never finish or would crawl the entire web! Web archiving is imperfect.. Technical limitations exposed through website design (POST, AJAX etc.) Crawler traps: traditional settings deployed to avoid these Other problematic functionality (e.g. filter and sort links). the space is more heterogeneous than you may imagine – a diversity of technologies. It is the web after all so, from a web archiving perspective we have Good, Bad and Ugly… Poorly-configured websites – e.g. extremely slow - we still have to archive these TIME – we often don’t have weeks to perform the capture process. Most of the websites we capture fall into the Good category but other (maybe 10%) require intervention beyond standard approaches…

UK Government Web Archive – Research Use
We already support research into the collection: Users can search the entire web archive (full text) from webarchive.nationalarchives.gov.uk/search/ Memento implemented Computationally: can be scraped to produce data for research, using tools such as HTTrack, scrapey and scraper under arrangement we can give access to raw ARC/WARC files. Reuse conditions on most of the content is friendly! Slide 1 We are a selective web archive: UK government. No limit on depth within target host. This means that even for what may appear to be a smaller domain of interest, some crawls of single websites often run over 1 million URLs and 100GB. Some sites, if left to their own devices would never finish or would crawl the entire web! Web archiving is imperfect.. Technical limitations exposed through website design (POST, AJAX etc.) Crawler traps: traditional settings deployed to avoid these Other problematic functionality (e.g. filter and sort links). the space is more heterogeneous than you may imagine – a diversity of technologies. It is the web after all so, from a web archiving perspective we have Good, Bad and Ugly… Poorly-configured websites – e.g. extremely slow - we still have to archive these TIME – we often don’t have weeks to perform the capture process. Most of the websites we capture fall into the Good category but other (maybe 10%) require intervention beyond standard approaches…

Research using the web archive
To date this has involved text-based analysis, such as: Semantic search and natural language processing N-gram based research Entity identification and searching through concept of “generous interfaces” CSV dataset searching methods Slide 8

Researcher feedback Varied but some common themes:
Most users need to know that research opportunities exist and make a formal(-ish) application to get the data (WARC files) Steep learning curve for familiarity of WARC files and handling them (e.g. size of WARC files when uncompressed) Pre-processing required to extract most useful elements – e.g. URLs, text from valid pages Some understanding required of how web archives work, their limitations and the tools available for research. Slide 8

Options To support research, it is possible to create text only, simplified and manageable content derived from the original source data. We have a few: Metadata files, such as Web Archive Transformation (WAT) files Plain text extracts from web archive data, such as WET files Crawl logs Web archive API But what are they? What do they look like and how can they be used? Examples from commoncrawl.org… Slide 8

Original resource – standard replay
Common Crawl: archived page Slide 8

Metadata (e.g. WAT) files
Useful for: Link and network analysis (what links to what?) Temporal changes – sites x time … JSON data, typically only 20-30% the size of the source WARC file. Slide 8

Extracted Plain Text (e.g. WET) files
Text files with basic resource metadata (URL, date, mime-type): Entity extracting, linguistic analysis N-grams … Typically only 5-15% the size of the source WARC file. Slide 8

Crawl logs Useful as a record of the “decisions” the crawler has made during the archiving process, particularly relating to scope and why, or why not, a resource has been archived. Can assist with the boundaries between our web archive versus those of others. Slide 8

Pywb (replay) API Presents limited summary metadata in CDX format
Slide 8

Examples Projects Groups such as Archives Unleashed and researchers using data from Common Crawl, the Internet Archive and national libraries have done some interesting work using web archive data… 1: 2: Slide 8

Things to know and next steps
Data processing of any kind has costs associated with it. We are interested in reducing the overheads for research into web archive data. Some existing knowledge or interest in gaining knowledge will be required to process the files but by pre-processing the data we are aiming to lower the barrier! The first step will be to generate some sample WET and WAT files for particular crawls or groups of crawls Slide 8

Next steps …we’d love to hear from anyone who is interested in taking a look at a sample of UKGWA data that has been pre-processed as described here. Are you interested in getting your hands on some data? Let’s talk! Slide 8

Thank you! Questions or comments? webarchive@nationalarchives.gov.uk
Slide 8

Web archive data and researchers’ needs: how might we meet them?

Similar presentations

Presentation on theme: "Web archive data and researchers’ needs: how might we meet them?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web archive data and researchers’ needs: how might we meet them?

Similar presentations

Presentation on theme: "Web archive data and researchers’ needs: how might we meet them?"— Presentation transcript:

Similar presentations

About project

Feedback