Download presentation
Presentation is loading. Please wait.
Published byFrederick Fletcher Modified over 6 years ago
1
Challenges and Opportunities of Archiving the UK Web
Helena Byrne Assistant Web Archivist @HBEE2015
2
Goals Capture and Preserve the UK web space Support access to the collection Enable research
3
All of the UK Public Web Space 5-10 million hosts (websites)
What are we collecting? All of the UK Public Web Space 5-10 million hosts (websites) 2+ billion individual items a year Up to TB of data each year
4
What are we collecting?
5
What don’t we collect? Intranets Anything behind a user login Flash Most (but not all) video and audio content Very little Twitter or Facebook
6
Why are we collecting websites?
7
Big national organisations change
8
1996 2016 2001 First British Library website published in 1995 to the current 2017 website. 2011 2006 2017
9
Culture disappears
10
What We’ve Saved ( ) Study done in 2016, slice of 1,000 websites from Open UKWA. Grades changes of websites.
11
Challenge 1 – Capturing the internet
How often? Everything once a year (takes about 3 months) Selected sites more frequently (daily, weekly, monthly, quarterly, six-monthly) News and some other sites daily
13
Challenge 2 – Capturing ‘everything’
‘Everything’ is not everything Most sites capped at 500mb (not BBC) Database driven websites very hard to collect Don’t always look how they should Wordpress is really hard
16
Challenge 3 - Access Licence required to display website publicly (approx 15,000 websites) Otherwise only in a reading room of a Legal Deposit Library One page at a time
19
Challenge 3 - Discovery How do you find something if you don’t know it’s there?
20
Search can’t work like google (google know a LOT about you)
Challenge 4 - Discovery How do you find what you want when there are billions of potential results? Search can’t work like google (google know a LOT about you)
22
Challenge 5: Websites have no borders
23
The Future of Web Archiving
Dataset obtain by JISC from the Internet Archive All .uk domains
24
Cats – Dogs – Birds
25
Magdalene - Queens' - St. Catharine's
28
Secondary Datasets JISC UK Web Domain Dataset (1996-2013):
Format Profile Geo-Index Host-Level Links Crawled URL Index WATs (rich resource-level metadata, not released yet) UK Open (Selective) Web Archive: Website Classification Dataset Available as CC0 downloads: Secondary Datasets: Composed of facts about the content But not ‘substitutable’ for the content Part of a long-standing tradition: The British Library’s bibliographic data has always been openly accessible Probably not copyrightable: Released as CC0 to avoid any ambiguity
29
Useful Links …. webarchive.org.uk/shine
webarchive.org.uk/blog webarchive.org.uk/videos webarchive.org.uk/shine data.webarchive.org.uk/opendata
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.