Why We Need Multiple Archives

Slides:



Advertisements
Similar presentations
Archive What I See Now Mat Kelly, Michael L. Nelson, Michele C. Weigle Old Dominion University Web Science and Digital.
Advertisements

BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall
© and The Internet Copyright Law applies to materials found on the internet to the same extent it applies to materials in traditional formats.
University Archives University Archives & Archive-It WebCom
URI IS 373—Web Standards Todd Will. CIS Web Standards-URI 2 of 17 What’s in a name? What is a URI/URL/URN? Why are they important? What strategies.
The Open Archives Initiative Simeon Warner (Cornell University) Symposium on “Scholarly Publishing and Archiving on the Web”, University.
Images By Tara Frieszell By Tara Frieszell. Adding images to your website will make it more interesting and add to the design. However, some viewers aren’t.
1 William Y. Arms Cornell University April 4, 2003 Free Access to Information Today Who Benefits? What are the Risks? Who Pays?
The Designing of Web Services to Deliver Web Documents Associated with Historical Links David Chao College of Business San Francisco State University.
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
IS1500: Introduction to Web Development Search Engine Optimization Martin Schedlbauer, Ph.D. With content from David Hurd.
July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele.
Google Xtras. Google Maps Google Latitude tests Site mapping What is it? A New Standard: Search Engine Giants Adopt the XML Protocol In 2005, the search.
Memento Update CNI Task Force Meeting, Spring Memento Herbert Van de Sompel Robert Sanderson Michael L. Nelson Giant Leaps.
Web Archiving Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Extending the Scope of Learning Objects with XML Bill Tait COLMSCT Associate Teaching Fellow The Open University ALT-C Conference Sep 2007.
1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation.
UNIT 12 SERVER SIDE OF A WEBSITE Cambridge Technicals.
TEMPORAL SPREAD IN ARCHIVED COMPOSITE RESOURCES (WORK IN PROGRESS) SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY COMPUTER SCIENCE WADL 2013.
ResourceSync was funded by the Sloan Foundation & JISC A Modular Framework for Web-Based Resource Synchronization Martin Klein Los Alamos National Laboratory.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Archive What I See Now Mat Kelly, Michael L. Nelson, Michele C. Weigle Old Dominion University Web Science and Digital.
The OAI: overview and historical context OAI Open Meeting – Washington DC – January 23 rd 2001 Herbert Van de Sompel & Carl Lagoze Cornell University --
Hiberlink – Towards Time Travel for the Scholarly Web July 25 th 2013, Indianapolis, IN, USA 1 Hiberlink – Towards Time Travel for the Scholarly Web Martin.
WebDAV Working across the Internet: Peter Pierrou, Excosoft.
Hiberlink is funded by the Andrew W. Mellon Foundation Investigating Reference Rot in Web-Based Scholarly Communication Martin Klein Los Alamos National.
Hiberlink is funded by the Andrew W. Mellon Foundation The Missing Link Proposal #hiberlink #memento Herbert.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
What is Seo? SEO stands for “search engine optimization.” It is the process of getting traffic from the “free,” “organic,” “editorial” or “natural” search.
Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool Justin F. Brunelle Michael L. Nelson Lyudmila Balakireva Robert Sanderson.
Archiving & Preserving Digital Content
HTML5 Basics.
Doctypes and domain names
Version Control with Subversion
Scrolling Down vs. Multiple Pages
Whose Is It, Anyway?.
Jill Sullivan Senior Marketing Manager Infront Webworks
Joanne Archer University of Maryland Libraries
XHTML Basics.
HOW DO YOU USE THE INTERNET?
Creating Web Collections with Archive-It
What's It Like To Work in the WS-DL Lab?
Signposting the Scholarly Web: An Overview
JavaScript Part 2 Organizing JavaScript Code into Functions
Content Best Practices
Strategies for Multiplication
Alastair Technology Enhanced Learning Advisor
XHTML Basics.
Dynamic Web Pages JavaScript Jill Thomas Oct 14, 2003.
XHTML Basics.
Digital Design – Copyright Law
Two-Tiered Crawling Approach
Analyzing WebView Vulnerabilities in Android Applications
PHY120B Lab 2. Homework assignment Name: ?
Web Scrapers/Crawlers
Web archive data and researchers’ needs: how might we meet them?
Venkata Krishna Potta and Ketan Reddy Peddabachi
Intro to Web Development Links
Intro to Web Development First Web Page
Ground to Roof HTML/HTML5
Introduction to Digital Libraries Assignment #3
Open Archives Initiative Object Reuse & Exchange Resource Map Discovery Michael L. Nelson* Carl Lagoze, Herbert Van de Sompel, Pete Johnston, Robert.
XHTML Basics.
XHTML Basics.
Dreamweaver.
Locating and Listing Your Sources
Common Copyright Misconceptions
Common Copyright Misconceptions
Old Dominion University Computer Science IIPC New Member
Presentation transcript:

Why We Need Multiple Archives Michael L. Nelson & Herbert Van de Sompel @phonedude_mln @hvdsomp Digital Preservation of Federal Information Summit CNI Spring 2016 April 3, 2016

Two Common Misconceptions About Web Archiving Prior = old = obsolete = stale = bad who cares, not an interesting problem The Internet Archive has every copy of everything that has ever existed who cares, problem solved

one answer: replay of contemporary pages >> summary pages Why Care About The Past? From an anonymous WWW 2010 reviewer about our Memento paper (emphasis mine): "Is there any statistics to show that many or a good number of Web users would like to get obsolete data or resources? " one answer: replay of contemporary pages >> summary pages http://www.slideshare.net/phonedude/why-careaboutthepast http://www.nytimes.com/2013/06/19/books/seven-american-deaths-and-disasters-transcribes-the-news.html

timetravel.mementoweb.org e.g., bbc.co.uk in six different archives… http://timetravel.mementoweb.org/list/20140525002314/http://www.bbc.co.uk/

Seagal’s Law A man with a watch knows what time it is. A man with two watches is never sure. How to resolve conflicting archives? Personalization, GeoIP, mobile vs. desktop, etc. means “the” page rarely exists, only “a” page. Mat Kelly, Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, A Method for Identifying Personalized Representations in Web Archives, D-Lib Magazine, 19(11/12), 2013. http://www.dlib.org/dlib/november13/kelly/11kelly.html

Historicity of Web Archives https://twitter.com/phonedude_mln/status/490171976389238784

Remember MH17? https://twitter.com/phonedude_mln/status/490171976389238784

Alex is now 404. Would multiple, independent archives have convinced him? https://twitter.com/quicknquiet

A single archive is vulnerable http://www.bbc.com/news/uk-politics-24924185 http://ws-dl.blogspot.com/2013/11/2013-11-21-conservative-party-speeches.html

Houston, Tranquility Base Here. The Eagle has landed. http://ws-dl.blogspot.com/2013/03/2013-03-22-ntrs-web-archives-and-why-we.html http://ws-dl.blogspot.com/2013/06/2013-06-18-ntrs-memento-and-handles.html

http://www. theguardian http://www.theguardian.com/technology/2015/feb/19/google-acknowledges-some-people-want-right-to-be-forgotten

Economics working against archives In the paper world in order to monetize their content the copyright owner had to maximize the number of copies of it. In the Web world, in order to monetize their content the copyright owner has to minimize the number of copies. Thus the fundamental economic motivation for Web content militates against its preservation in the ways that Herbert and I would like. --David Rosenthal http://blog.dshr.org/2015/02/the-evanescent-web.html

Who pays for those extra archives? 1TB endowment = ~$4700: http://blog.dshr.org/2011/02/paying-for-long-term-storage.html see also: http://blog.dshr.org/2011/01/memento-marketplace-for-archiving.html

Archives aren’t magic web sites They’re just web sites. If you used Mummify, you’re now left with a bunch of defunct, shortened links like: https://mummify.it/XbmcMfE3

Don’t Throw Away the Original URL – Use Robust Links! <a href="http://www.w3.org/" data-versionurl="https://archive.today/r7cov" data-versiondate="2015-01-21"> my robust link to the live web</a> <a href="https://archive.today/r7cov" data-originalurl="http://www.w3.org/" data-versiondate="2015-01-21"> my robust link to an archived version</a> <!DOCTYPE html> <html lang="en" itemscope itemtype="http://schema.org/WebPage" itemid="http://robustlinks.mementoweb.org/spec/"> <head> <meta charset="utf-8" /> <meta itemprop="dateModified" content="2015-02-02"> <meta itemprop="datePublished" content="2015-01-23"> <title>Page Level Metadata Is The Least You Can Do</title> More examples / scenarios at: http://robustlinks.mementoweb.org/spec/

Archiving your internal stuff: Transactional Archiving Never miss an update; archive your site as it is being viewed by users. https://mementoweb.github.io/SiteStory/

Archiving your internal stuff: Heritrix & Wayback mementos of Mitre Intranet “MiiTube” – Complete With Javascript leakage Crawling your intranet: http://www.dlib.org/dlib/january16/brunelle/01brunelle.html Crawling JS “stuff” will take 5X more storage: http://arxiv.org/abs/1601.05142

(AKA “Fancy SiteMaps”) JavaScript == the new deep web; use ResourceSync to make sure your URIs are exposed <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:ln rel="up" href="http://example.com/dataset1/capabilitylist.xml"/> <rs:md capability="resourcelist" at="2013-01-03T09:00:00Z" completed="2013-01-03T09:01:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T14:00:00Z</lastmod> <rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e sha-256:854f61290e2e197a11bc91063afce22e43f8ccc655237050ace766adc68dc784" length="14599" type="application/pdf"/> </urlset> (AKA “Fancy SiteMaps”) http://www.openarchives.org/rs/

When all else fails, justify project with: “web archiving is Big Data”