Www.bl.uk 1 Co-developing access to the UK Web Archive Helen Hockx-Yu Head of Web Archiving, British Library.

Slides:



Advertisements
Similar presentations
An Introduction To Heritrix
Advertisements

Web Archives and Large-Scale Data: Preliminary Techniques for Facilitating Research Nicholas Woodward Latin American Network Information Center
LIFE 3 LIFE 3 : Predicting Long Term Preservation Costs Brian Hole LIFE 3 Project Manager The British Library IFLA conference 27/02/10.
DIGITAL HUMANITIES SUMMER SCHOOL 2011 DIGITAL LIBRARY TECHNOLOGIES AND BEST PRACTICE, PART 1: DECONSTRUCTING DIGITAL LIBRARIES Christine Madsen R&D Project.
Providing collections, tools and services for digital humanities A national library perspective Clément Oury Head of Digital Legal Deposit Bibliothèque.
Google Refine Tutorial April, Sathishwaran.R - 10BM60079 Vijaya Prabhu - 10BM60097 Vinod Gupta School of Management, IIT Kharagpur This Tutorial.
BnF projects and priorities On the collection side – Perform broad and focused crawls with a maximum of 100TB – Set up the legal deposit of ebooks.
Connected Histories Sources for Building British History, Funded under the JISC eContent Capital Programme for 18 months Partners:  Prof. Tim.
Overview of PubWEST Patent and Trademark Depository Library Training Seminar April 2006.
Chapter 2. Slide 1 CULTURAL SUBJECT GATEWAYS CULTURAL SUBJECT GATEWAYS Subject Gateways  Started as links of lists  Continued as Web directories  Culminated.
Introducing Copac Copac is a national catalogue giving access to the merged catalogues of c.50 major libraries and collections in the UK and Ireland Copac.
Features and Uses of a Multilingual Full-Text Electronic Theses and Dissertations (ETDs) System Yin Zhang Kent State University Kyiho Lee, Bumjong You.
IAEA International Atomic Energy Agency INIS Collection Search: Introduction and main features INIS Training Seminar 7-11 October 2013, Vienna Domenico.
Apache Solr at The UK Web Archive Andy Jackson Web Archive Technical Lead.
1 Adaptive Management Portal April
Information Access Douglas W. Oard College of Information Studies and Institute for Advanced Computer Studies Design Understanding.
Key Skills – Library Research David Sowerbutts
Using Social Care Online: an overview Version 1.0 April 2015.
Big UK Domain Data for the Arts and Humanities Jane Winters (Professor of Digital History, Institute of Historical Research) IIPC General Assembly Open.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Overview of Search Engines
Wiley Online Library. About Wiley Online Library Wiley Online Library hosts the world's broadest and deepest multidisciplinary collection of online resources.
Online the Library Michaelmas Term 2011 Trinity College Library Dublin 1 1.
1 WEB ARCHIVING IN THE BRITISH LIBRARY John Tuck Head of British Collections February 2004.
The capture and preservation of websites at the National Library of New Zealand Gillian Lee Alexander Turnbull Library.
Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Sharing of Community Practice through Semantics: A Case Study in Academic.
LIFE 3 LIFE 3 : Predicting Long Term Preservation Costs Brian Hole LIFE 3 Project Manager The British Library KeepIt training course 05/02/10.
Web The Internet Archive. Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for.
Databases and Library Catalogs Global Index Medicus/Global Health Library PubMed Source Bibliographic Database: International Health and Disability.
Web of Knowledge Service for UK Education April 2007 An Overview Web of Knowledge Support Officer
Web Archives: Interacting with Scholars Helen Hockx-Yu Head of Web Archiving British Library 28 November 2013.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Opening access to UK doctoral theses: the EThOS E-Theses Service 13 August 2014 Sara Gould.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
DNER Architecture Andy Powell 6 March 2001 UKOLN, University of Bath UKOLN is funded by Resource: The Council for.
Webarchivering in het Audiovisuele Domein Web archiving in the audiovisual Domain Julia Vytopil- Nederlands Instituut voor Beeld en Geluid Netherlands.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Introducing Intute: Social Sciences Your Guide to the Best of the Web.
OULS WISER Humanities E- Books Hilla Wait Colin Cook Philosophy Faculty Library.
An OAI-Compliant Federated Physics Digital Library for the NSDL Department of Computer Science Old Dominion University, Norfolk, VA In Collaboration.
EThOS: Where have we got to and where do we go next? FIL Conference 29 June 2015 Sara Gould.
Faceted browsing for ACL Anthology Praveen Bysani.
SciFinder Demonstration Search Pt 2 Chemistry 137 – Spring 2013 Grace Baysinger Head Librarian & Bibliographer, Swain Chemistry & Chemical Engineering.
ACT : Legal Deposit Annotation and Curation Tool Peter Webster British Library
ETDs in the UK Progress and Challenges Maja Maricevic Head of Higher Education October
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
Search and Access Technologies for Large Scale Web Archives Joseph JaJa, Sangchul Song, and Mike Smorul Institute for Advanced Computer Studies Department.
A Semantic Knowledge Base for the UK Government Web Archive Tom Storrar & Claire Newing Applying records management processes principles to the open government.
IAEA International Atomic Energy Agency INIS Collection Search: Introduction and main features The Role of the International Nuclear Information System.
ELISQ Systems Demonstration Sagnik Ray Choudhury Doha -- May 2015.
WISER Science: SOLO (Search Oxford Libraries Online) Juliet Ralph and Isabel McMann Radcliffe Science Library.
Primo at the British Library Mandy Stewart. 2 About the British Library The British Library is the National Library of the UK It is a world-class.
Use cases for BnF broad crawls Annick Lorthios. 2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the.
The Earth System Curator Metadata Infrastructure for Climate Modeling Rocky Dunlap Georgia Tech.
Searching the Web for academic information Ruth Stubbings.
TDG-Scholar 2.0 Nicolás Amador Griñolo Agustín Domínguez Alvera.
Archiving & Preserving Digital Content
Using Social Care Online: an overview
TRSS Terminology Registry Scoping Study
 Corpus Formation [CFT]  Web Pages Annotation [Web Annotator]  Web sites detection [NEACrawler]  Web pages collection [NEAC]  IE Remote.
Workshop on Web Archiving
Challenges and Opportunities of Archiving the UK Web
OUTLINE Basic ideas of traditional retrieval systems
PubMed Database Interface (Basic Course Module 4 Part A)
Web archive data and researchers’ needs: how might we meet them?
Magnet & /facet Zheng Liang
Web archives as a research subject
Presentation transcript:

1 Co-developing access to the UK Web Archive Helen Hockx-Yu Head of Web Archiving, British Library

2 Ten years of archiving the UK Web Archive Started web archiving in 2004, non-print Legal Deposit since April 2013 Three collections: over six billion resources and over 100TB compressed data Focus not just on content collection Proactive development of access and use, through close engagement with researchers – User survey – Content selection and curation – Brain-storming sessions and workshops to formulate research questions – Research projects

3 JISC UK Web Domain Dataset Funded by JISC to create a research collection of historical UK websites Collaboration between the Internet Archive, JISC and the British Library Copy of subset of the Internet Archive’s web collection that relates to the UK c.300 million resources, 60TB in total No local access – possible through the Internet Archive Can be used to generate secondary datasets

4 Co-design at every stage Research use case articulated Generic user requirements abstracted Requirements refined following feedback Iterative development cycles: Develop -> user testing -> feedback -> develop …

5 Use cases (generalised) Full-text/facet search -> individual resource Full-text/facet search -> analysis/visualisation Search -> corpus creation -> annotation/curation Corpus creation -> full-text search -> individual resource Corpus -> search -> analysis/visualisation [Derived datasets -> take-away] [Direct access to WARC/CDX -> take-away]

6 High-level requirements Query building Corpus formation and handling Annotation and curation In-corpus analysis Whole-dataset analysis

7 Prototype: Shine Full-text search, with proximity options, and to exclude specified text strings Apply and remove multiple facet filters to result sets – Content type, public suffix, domain, crawl year – Also available: postcode, links to public suffix, language, links domains Exclude single resources, or whole hosts from result sets Save a query Export basic query results, as CSV or similar Available at:

8 Advanced Search

9 Ngram Same search terms, different datasets Broadly similar trends Interesting to examine turning point Not useful without understanding of scope Visualisation not the end point

10 Pages mentioning “Gordon Brown” (2007)

11 Trends analysis

12 Access to data supporting trends

13 Next steps Inclusion of the full JISC dataset – seamless interface to all 3 components of UK Web Archive Better support for corpus creation (eg combination of existing corpus) Annotation and sharing of corpus (standard) analysis and visualisation of corpus Faceted search within user-define corpus (semantic) clustering of search results

14 Lessons learnt A learning process for both Not a choice between “big data” or “small data” “Macroscope” of the UK web history – “a single data point,.. both visualised at scale in the context of a billion other data points, and drilled down to its smallest compass” Context and paratext just as important User expectation / assumption Maximum transparency Scale remains a challenge