Challenges and Opportunities of Archiving the UK Web

Challenges and Opportunities of Archiving the UK Web
Helena Byrne Assistant Web Archivist @HBEE2015

Goals Capture and Preserve the UK web space Support access to the collection Enable research

All of the UK Public Web Space 5-10 million hosts (websites)
What are we collecting? All of the UK Public Web Space 5-10 million hosts (websites) 2+ billion individual items a year Up to TB of data each year

What are we collecting?

What don’t we collect? Intranets Anything behind a user login Flash Most (but not all) video and audio content Very little Twitter or Facebook

Why are we collecting websites?

Big national organisations change

1996 2016 2001 First British Library website published in 1995 to the current 2017 website. 2011 2006 2017

Culture disappears

What We’ve Saved ( ) Study done in 2016, slice of 1,000 websites from Open UKWA. Grades changes of websites.

Challenge 1 – Capturing the internet
How often? Everything once a year (takes about 3 months) Selected sites more frequently (daily, weekly, monthly, quarterly, six-monthly) News and some other sites daily

Challenge 2 – Capturing ‘everything’
‘Everything’ is not everything Most sites capped at 500mb (not BBC) Database driven websites very hard to collect Don’t always look how they should Wordpress is really hard

Challenge 3 - Access Licence required to display website publicly (approx 15,000 websites) Otherwise only in a reading room of a Legal Deposit Library One page at a time

Challenge 3 - Discovery How do you find something if you don’t know it’s there?

Search can’t work like google (google know a LOT about you)
Challenge 4 - Discovery How do you find what you want when there are billions of potential results? Search can’t work like google (google know a LOT about you)

Challenge 5: Websites have no borders

The Future of Web Archiving
Dataset obtain by JISC from the Internet Archive All .uk domains

Cats – Dogs – Birds

Magdalene - Queens' - St. Catharine's

Secondary Datasets JISC UK Web Domain Dataset (1996-2013):
Format Profile Geo-Index Host-Level Links Crawled URL Index WATs (rich resource-level metadata, not released yet) UK Open (Selective) Web Archive: Website Classification Dataset Available as CC0 downloads: Secondary Datasets: Composed of facts about the content But not ‘substitutable’ for the content Part of a long-standing tradition: The British Library’s bibliographic data has always been openly accessible Probably not copyrightable: Released as CC0 to avoid any ambiguity

Useful Links …. webarchive.org.uk/shine
webarchive.org.uk/blog webarchive.org.uk/videos webarchive.org.uk/shine data.webarchive.org.uk/opendata

Challenges and Opportunities of Archiving the UK Web

Similar presentations

Presentation on theme: "Challenges and Opportunities of Archiving the UK Web"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Challenges and Opportunities of Archiving the UK Web

Similar presentations

Presentation on theme: "Challenges and Opportunities of Archiving the UK Web"— Presentation transcript:

Similar presentations

About project

Feedback