Presentation is loading. Please wait.

Presentation is loading. Please wait.

Challenges and Opportunities of Archiving the UK Web

Similar presentations


Presentation on theme: "Challenges and Opportunities of Archiving the UK Web"— Presentation transcript:

1 Challenges and Opportunities of Archiving the UK Web
Helena Byrne Assistant Web Archivist @HBEE2015

2 Goals Capture and Preserve the UK web space Support access to the collection Enable research

3 All of the UK Public Web Space 5-10 million hosts (websites)
What are we collecting? All of the UK Public Web Space 5-10 million hosts (websites) 2+ billion individual items a year Up to TB of data each year

4 What are we collecting?

5 What don’t we collect? Intranets Anything behind a user login Flash Most (but not all) video and audio content Very little Twitter or Facebook

6 Why are we collecting websites?

7 Big national organisations change

8 1996 2016 2001 First British Library website published in 1995 to the current 2017 website. 2011 2006 2017

9 Culture disappears

10 What We’ve Saved ( ) Study done in 2016, slice of 1,000 websites from Open UKWA. Grades changes of websites.

11 Challenge 1 – Capturing the internet
How often? Everything once a year (takes about 3 months) Selected sites more frequently (daily, weekly, monthly, quarterly, six-monthly) News and some other sites daily

12

13 Challenge 2 – Capturing ‘everything’
‘Everything’ is not everything Most sites capped at 500mb (not BBC) Database driven websites very hard to collect Don’t always look how they should Wordpress is really hard

14

15

16 Challenge 3 - Access Licence required to display website publicly (approx 15,000 websites) Otherwise only in a reading room of a Legal Deposit Library One page at a time

17

18

19 Challenge 3 - Discovery How do you find something if you don’t know it’s there?

20 Search can’t work like google (google know a LOT about you)
Challenge 4 - Discovery How do you find what you want when there are billions of potential results? Search can’t work like google (google know a LOT about you)

21

22 Challenge 5: Websites have no borders

23 The Future of Web Archiving
Dataset obtain by JISC from the Internet Archive All .uk domains

24 Cats – Dogs – Birds

25 Magdalene - Queens' - St. Catharine's

26

27

28 Secondary Datasets JISC UK Web Domain Dataset (1996-2013):
Format Profile Geo-Index Host-Level Links Crawled URL Index WATs (rich resource-level metadata, not released yet) UK Open (Selective) Web Archive: Website Classification Dataset Available as CC0 downloads: Secondary Datasets: Composed of facts about the content But not ‘substitutable’ for the content Part of a long-standing tradition: The British Library’s bibliographic data has always been openly accessible Probably not copyrightable: Released as CC0 to avoid any ambiguity

29 Useful Links …. webarchive.org.uk/shine
webarchive.org.uk/blog webarchive.org.uk/videos webarchive.org.uk/shine data.webarchive.org.uk/opendata


Download ppt "Challenges and Opportunities of Archiving the UK Web"

Similar presentations


Ads by Google