Challenges and Opportunities of Archiving the UK Web Helena Byrne Assistant Web Archivist @HBEE2015
Goals Capture and Preserve the UK web space Support access to the collection Enable research
All of the UK Public Web Space 5-10 million hosts (websites) What are we collecting? All of the UK Public Web Space 5-10 million hosts (websites) 2+ billion individual items a year Up to 80-100TB of data each year
What are we collecting?
What don’t we collect? Email Intranets Anything behind a user login Flash Most (but not all) video and audio content Very little Twitter or Facebook
Why are we collecting websites?
Big national organisations change
1996 2016 2001 First British Library website published in 1995 to the current 2017 website. 2011 2006 2017
Culture disappears
What We’ve Saved (2004-2015) Study done in 2016, slice of 1,000 websites from Open UKWA. Grades changes of websites.
Challenge 1 – Capturing the internet How often? Everything once a year (takes about 3 months) Selected sites more frequently (daily, weekly, monthly, quarterly, six-monthly) News and some other sites daily
Challenge 2 – Capturing ‘everything’ ‘Everything’ is not everything Most sites capped at 500mb (not BBC) Database driven websites very hard to collect Don’t always look how they should Wordpress is really hard
Challenge 3 - Access Licence required to display website publicly (approx 15,000 websites) Otherwise only in a reading room of a Legal Deposit Library One page at a time
Challenge 3 - Discovery How do you find something if you don’t know it’s there?
Search can’t work like google (google know a LOT about you) Challenge 4 - Discovery How do you find what you want when there are billions of potential results? Search can’t work like google (google know a LOT about you)
Challenge 5: Websites have no borders
The Future of Web Archiving www.webarchive.org.uk/shine Dataset obtain by JISC from the Internet Archive All .uk domains 1996-2013
Cats – Dogs – Birds
Magdalene - Queens' - St. Catharine's
Secondary Datasets JISC UK Web Domain Dataset (1996-2013): Format Profile Geo-Index Host-Level Links Crawled URL Index WATs (rich resource-level metadata, not released yet) UK Open (Selective) Web Archive: Website Classification Dataset Available as CC0 downloads: http://data.webarchive.org.uk/opendata/ Secondary Datasets: Composed of facts about the content But not ‘substitutable’ for the content Part of a long-standing tradition: The British Library’s bibliographic data has always been openly accessible Probably not copyrightable: Released as CC0 to avoid any ambiguity
Useful Links …. webarchive.org.uk/shine webarchive.org.uk/blog webarchive.org.uk/videos webarchive.org.uk/shine data.webarchive.org.uk/opendata