Challenges and Opportunities of Archiving the UK Web

Slides:



Advertisements
Similar presentations
Zetoc.mimas.ac.uk Zetoc Electronic Table of Contents from the British Library Zetoc Support.
Advertisements

A survey of Web preservation initiatives Michael Day UKOLN, University of Bath 7 th European Conference on Research and Advanced Technology.
JISC/BL Workshop Digital Libraries and their services March 6, 2006 Richard Boulderstone Director eStrategy, The British Library.
Providing collections, tools and services for digital humanities A national library perspective Clément Oury Head of Digital Legal Deposit Bibliothèque.
BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall
1 Co-developing access to the UK Web Archive Helen Hockx-Yu Head of Web Archiving, British Library.
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Constructing the Memories Creating a Digital Collection Linda J. White, Digital Project Coordinator.
TC2-Computer Literacy Mr. Sencer February 4, 2010.
Social Bookmarking & Research What Delicious can do for you.
Supplementing the Library Collection with Digital Content from Engineering Departments Karen Clay Stanford University.
The Role of the Public Library in the Digital Age Sarah Ormes UKOLN University of Bath Bath, BA2 7AY UKOLN is funded by the Library and Information Commission,
Web archiving at the NLA ‘ Archiving the music web’ Music Council of Australia Annual Assembly 28 September 2009 Paul Koerbin Manager Digital Archiving.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
SOCIAL NETWORKING APP FACEBOOK. WHAT IS FACEBOOK Facebook was created in 2004 by Mark Zuckerburg and was first used on computers. It was one of the first.
1 WEB ARCHIVING IN THE BRITISH LIBRARY John Tuck Head of British Collections February 2004.
The capture and preservation of websites at the National Library of New Zealand Gillian Lee Alexander Turnbull Library.
What’s New in Search? How destinations can leverage new search trends.
1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.
Searching UCN Databases Finding Journal Articles Through Ebsco.
Web The Internet Archive. Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for.
A centre of expertise in digital information managementwww.ukoln.ac.uk Digital Preservation / UK Web Focus Brian Kelly UKOLN University of Bath Bath, BA2.
Build a Free Website1 Build A Website For Free 2 ND Edition By Mark Bell.
Keeping alert: the productive use of web-based current awareness Alison McNab JISC Collections Manager, Journals & Discovery Tools.
Web Archiving at the National Library of Australia National Library of Indonesia Staff 5 October 2010 Paul Koerbin Manager, Web Archiving National Library.
WHS joined Archive-It in the fall of 2010 Began capturing state information with the capture of Governor Jim Doyle’s websites at the end of the administration.
Mobile Apps For Small Businesses Your customers are mobile. Is your business?
“Social Media is THE KEY to Having a Steady Stream of New Customers!” HOST: Ken Krell – “The Kenergizer” GUEST: Gina Gaudio-Graves – “The J.V. Queen” HOST:
Social Media for Writers Presentation to Dorset Writers Network 10 th January 2015.
ERIKA Eesti Ressursid Internetis Kataloogimine ja Arhiveerimine Estonian Resources in Internet, Indexing and Archiving.
Where Are My Leads Coming From How to have a conversation with your prospects Presented by: Jordan Hatch.
Internet Skills The World Wide Web (Web) consists of billions of interconnected pages of information from a wide variety of sources. In this section: Web.
Web Archiving at the National Library of Australia Russell Latham Senior Web Archivist, National Library of Australia.
HOW BIG IS THE INTERNET? As of 2005, Internet size is estimated at 5 million terabytes: 5.
Uncovering the Invisible Web. Back in the day… Students used to research using resources hand-picked by librarians and teachers. These materials were.
+ Publishing Your First Post USING WORDPRESS. + A CMS (content management system) is an application that allows you to publish, edit, modify, organize,
Alison Prince Bodleian Libraries Web Manager Practical tips for creating online exhibitions Peter Pavement Surface Impression.
Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus
1 Chapter 5 (3 rd ed) Your library is an excellent resource tool. Your library is an excellent resource tool.
Ktisis: Building an Open Access Institutional and Cultural Repository Alexia Kounoudes, Petros Artemi, Marios Zervas Library and Information Services,
Digital Footprints By Erik Borge September 29, 2015.
Archiving & Preserving Digital Content
Introduction to Library Resources
Little, big, and vast steps towards open education
Research and Education Space
Social Media from our point of view!
Jill Sullivan Senior Marketing Manager Infront Webworks
Copyright and Plagiarism and Citations, Oh My! SCHOOL OF PHARMACY
Creating Web Collections with Archive-It
Linking persistent identifiers at the British Library
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Jay Bhatt Drexel University Libraries
AUTHORITATIVE WEBSITE
Data Management: Documentation & Metadata
Introducing the IRUSdataUK pilot
MSC photo:  It was taken some time in the late 1930s, but we don’t have an exact date.  The college was known as MSC from 1925 until 1955 when we became.
Problems when using the Internet?
Zetoc: Electronic Table of Contents from the British Library
WorldCat: Broad Web visibility for our collection
Zetoc: Electronic Table of Contents from the British Library
Web archive data and researchers’ needs: how might we meet them?
Metcalfe’s Law : Why is the Web so Big?
Click here for info on web crawlers
MSC photo:  It was taken some time in the late 1930s, but we don’t have an exact date.  The college was known as MSC from 1925 until 1955 when we became.
How Search Engines Work?
Locating and Listing Your Sources
Searching the Web.
The Bentley Digital Media Library
Citation databases and social networks for researchers: measuring research impact and disseminating results - exercise Elisavet Koutzamani
Why Social Media? Think of the marketing potential that is inexpensive, anyone can do, and how effective it is.
Presentation transcript:

Challenges and Opportunities of Archiving the UK Web Helena Byrne Assistant Web Archivist @HBEE2015

Goals Capture and Preserve the UK web space Support access to the collection Enable research

All of the UK Public Web Space 5-10 million hosts (websites) What are we collecting? All of the UK Public Web Space 5-10 million hosts (websites) 2+ billion individual items a year Up to 80-100TB of data each year

What are we collecting?

What don’t we collect? Email Intranets Anything behind a user login Flash Most (but not all) video and audio content Very little Twitter or Facebook

Why are we collecting websites?

Big national organisations change

1996 2016 2001 First British Library website published in 1995 to the current 2017 website. 2011 2006 2017

Culture disappears

What We’ve Saved (2004-2015) Study done in 2016, slice of 1,000 websites from Open UKWA. Grades changes of websites.

Challenge 1 – Capturing the internet How often? Everything once a year (takes about 3 months) Selected sites more frequently (daily, weekly, monthly, quarterly, six-monthly) News and some other sites daily

Challenge 2 – Capturing ‘everything’ ‘Everything’ is not everything Most sites capped at 500mb (not BBC) Database driven websites very hard to collect Don’t always look how they should Wordpress is really hard

Challenge 3 - Access Licence required to display website publicly (approx 15,000 websites) Otherwise only in a reading room of a Legal Deposit Library One page at a time

Challenge 3 - Discovery How do you find something if you don’t know it’s there?

Search can’t work like google (google know a LOT about you) Challenge 4 - Discovery How do you find what you want when there are billions of potential results? Search can’t work like google (google know a LOT about you)

Challenge 5: Websites have no borders

The Future of Web Archiving www.webarchive.org.uk/shine Dataset obtain by JISC from the Internet Archive All .uk domains 1996-2013

Cats – Dogs – Birds

Magdalene - Queens' - St. Catharine's

Secondary Datasets JISC UK Web Domain Dataset (1996-2013): Format Profile Geo-Index Host-Level Links Crawled URL Index WATs (rich resource-level metadata, not released yet) UK Open (Selective) Web Archive: Website Classification Dataset Available as CC0 downloads: http://data.webarchive.org.uk/opendata/ Secondary Datasets: Composed of facts about the content But not ‘substitutable’ for the content Part of a long-standing tradition: The British Library’s bibliographic data has always been openly accessible Probably not copyrightable: Released as CC0 to avoid any ambiguity

Useful Links …. webarchive.org.uk/shine webarchive.org.uk/blog webarchive.org.uk/videos webarchive.org.uk/shine data.webarchive.org.uk/opendata