Just-In-Time Recovery of Missing Web Pages

Slides:



Advertisements
Similar presentations
CLEARSPACE Digital Document Archiving system INTRODUCTION Digital Document Archiving is the process of capturing paper documents through scanning and.
Advertisements

CrossRef Linking and Library Users “The vast majority of scholarly journals are now online, and there have been a number of studies of what features scholars.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Search Engines and their Public Interfaces: Which APIs are the Most Synchronized? Frank McCown and Michael L. Nelson Department of Computer Science, Old.
Storage Refresh Project Migration of Enterprise Leased Shares Websites Home Directory Service.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Synchronicity: Just-In-Time Discovery of Lost Web Pages NDIIPP Partners Meeting June 25, 2009 Martin Klein & Michael L. Nelson Department of Computer Science.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
Website Reconstruction using the Web Infrastructure Frank McCown Doctoral Consortium June.
DEEP SEARCH Application of Primo Deep Search between Northwestern and Vanderbilt ELUNA 2015 Michael North - Northwestern University Dale Poulter - Vanderbilt.
Web Characterization: What Does the Web Look Like?
Web Archiving Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Cataloguing Electronic resources Prepared by the Cataloguing Team at Charles Sturt University.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
Use & Access 26 March Use “Proof of Concept” Model for General Libraries & IS faculty Model for General Libraries & IS faculty Test bed for DSpace.
CBSOR,Indian Statistical Institute 30th March 07, ISI,Kokata 1 Digital Repository support for Consortium Dr. Devika P. Madalli Documentation Research &
My Website Was Lost, But Now It’s Found Frank McCown CS 110 – Intro to Computer Science April 23, 2007.
DNER Architecture Andy Powell 6 March 2001 UKOLN, University of Bath UKOLN is funded by Resource: The Council for.
Freelib: A Self-sustainable Digital Library for Education Community Ashraf Amrou, Kurt Maly, Mohammad Zubair Computer Science Dept., Old Dominion University.
Archive Ingest and Handling Test: ODU’s Perspective Michael L. Nelson Department of Computer Science Old Dominion University
Repository Synchronization Using NNTP and SMTP Michael L. Nelson, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA
CLASS Information Management Presented at NOAATECH Conference 2006 Presented by Pat Schafer (CLASS-WV Development Lead)
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
World Wide Web “WWW”, "Web" or "W3". World Wide Web “WWW”, "Web" or "W3"
Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
Ph.D. Progress Report Frank McCown 4/14/05. Timeline Year 1 : Course work and Diagnostic Exam Year 2 : Course work and Candidacy Exam Year 3 : Write and.
Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007.
Your Digital Technology Briefcase My information…when and where I need it.
Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.
Archive Ingest and Handling Test: ODU’s Perspective Michael L. Nelson Department of Computer Science Old Dominion University
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
The Availability and Persistence of Web References in D-Lib Magazine Frank McCown, Sheffan Chan, Michael L. Nelson and Johan Bollen Old Dominion University.
Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University.
Evaluating Ingest Success: Using the AIHT Michael L. Nelson, Joan A. Smith Department of Computer Science Old Dominion University Norfolk VA DCC.
ADAPTIVE HYPERMEDIA Presented By:- Debraj Manna Raunak Pilani Gada Kekin Dhiraj.
Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.
Brass: A Queueing Manager for Warrick Frank McCown, Amine Benjelloun, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk,
Delivers local and global resources in a single search The first, easy step toward the first cooperative library service on the Web WorldCat Local “quick.
1 Introduction to Digital Libraries Week 15: Web Infrastructure for Preservation Old Dominion University Department of Computer Science CS 751/851 Fall.
Transparent Format Migration of Preserved Web Content D. S. H. Rosenthal, T. Lipkis, T. S. Robertson, S. Morabito Lib Magazine, 11(1), 2005
Introduction to Digital Analytics Keith MacDonald Guest Presentation.
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
CS 791-S04 Digital Preservation Seminar Presentation of: Arms, "Preservation of Scientific Serials: Three Current Examples", JEP, 5(2), 1999 and Nelson.
Databases vs the Internet Coconino Community College Revised August 2010.
Can’t Find Your 404s? Santa Fe Complex March 13, 2009 Martin Klein, Frank McCown, Joan Smith, Michael L. Nelson Department of Computer Science Old Dominion.
A Solution for Maintaining File Integrity within an Online Data Archive Dan Scholes PDS Geosciences Node Washington University 1.
Databases vs the Internet
Control Choices and Network Effects in Hypertext Systems
Databases vs the Internet
The How and Why of DOI Assigning DOI’s to IR content
An Overview of Data-PASS Shared Catalog
The Hosted Model Charl Roberts Good morning again,
The Online Smith Family Recipe Program
Software Documentation
PHP / MySQL Introduction
PNDS Architecture - an overview
Lazy Preservation, Warrick, and the Web Infrastructure
Making the Most of the Ellucian Support Center
B OOST W EBSITE P ERFORMANCE WITH T HE C USTOM W ORDPRESS P LUG -I N D EVELOPMENT
A Brief Introduction to the Internet
NASA Technical Report Server (NTRS) Project Overview April 2, 2003
Agreeing to Disagree: Search Engines and Their Public Interfaces
Information Retrieval
Correlation of Term Count and Document Frequency for Google N-Grams
Preservation of Digital Objects and Collections
Characterization of Search Engine Caches
Correlation of Term Count and Document Frequency for Google N-Grams
If You Harvest arXiv.org, Will They Come?
PubMed Database Interface (Basic Course: Module 4 Part A)
Presentation transcript:

Just-In-Time Recovery of Missing Web Pages Hypertext 2006 Odense, Denmark August 25, 2006 Terry L. Harrison & Michael L. Nelson Old Dominion University Norfolk VA, USA

Preservation: Fortress Model Five Easy Steps for Preservation: Get a lot of $ Buy a lot of disks, machines, tapes, etc. Hire an army of staff Load a small amount of data “Look upon my archive ye Mighty, and despair!” image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg

Alternate Models of Preservation Lazy Preservation Let Google, IA et al. preserve your website Just-In-Time Preservation Find a “good enough” replacement web page Shared Infrastructure Preservation Push your content to sites that might preserve it Web Server Enhanced Preservation Use Apache modules to create archival-ready resources image from: http://www.proex.ufes.br/arsm/knots_interlaced.htm

Outline The 404 problem Component technologies Opal web infrastructure lexical signatures OAI-PMH Opal architectural description analysis

404 Problem Kahle (97) - Average page lifetime 44 days Koehler (99, 04) - 67% URLs lost in 4 years Lawrence et al. (01) - 23%-53% URLs in CiteSeer papers invalid over 5 year span (3% of invalid URLs “unfindable”) Spinellis (03) - 27% URLs in CACM/Computer papers gone in 5 years Chan et al. (03) - 11 year half-life for URLs in D-Lib Magazine articles Nelson & Allen (02) - 3% objects in digital library gone in 1 year Lawrence - Using CiteSeer for recently authored papers Spinellis - Using Communications of the ACM & IEEE Computer Chan et al. - D-Lib Magazine ECDL 1999 “good enough” page available PSP 2003 exact copy at new URL Greynet 99 unavailable at any URL?

Web Infrastructure: Refreshing & Migrating

Lexical Signatures “Robust Hyperlinks Cost Just Five Words Each” Phelps & Wilensky (2000) http://www.cs.odu.edu/~tharriso/?lex-sig=terry+harrison+thesis+jcdl+awarded “Analysis of Lexical Signatures for Improving Information Presence on the World Wide Web” Park et al. (2004) Lexical Signature Calculation Technique Results from Google 2004+terry+digital+harrison+2003 TF Based 110000 modoai+netpreserve.org+mod_oai+heretrix+xmlsolutions IDF Based 1 terry+harrison+thesis+jcdl+awarded TFIDF Based 6

OAI-PMH Data Providers / Repositories Service Providers / Harvesters “A repository is a network accessible server that can process the 6 OAI-PMH requests … A repository is managed by a data provider to expose metadata to harvesters.”  “A harvester is a client application that issues OAI-PMH requests.  A harvester is operated by a service provider as a means of collecting metadata from repositories.”

OAI-PMH Aggregators aggregators allow for: data providers scalability for OAI-PMH load balancing community building discovery data providers (repositories) service providers (harvesters) aggregator

Observations One reason why the original Phelps & Wilensky vision was never realized is that required a priori LS calculation idea: use the Web Infrastructure to calculate LSs as they are needed Mass adoption of a system will occur only if it is really, really easy to do so idea: digital preservation systems should require only a small number of “heroes”

Description & Use Cases Allow many web servers to use a few Opal servers that use the caches of the Web Infrastructure to generate Lexical Signatures of recently 404 URLs to find either: the same page at a new URL example: bookmarked colleague is now 404 cached info is not useful similar pages probably not useful a “good enough” replacement page example: bookmarked recipe is now 404 cached info is useful similar pages probably useful

Opal Configuration: “Configure Two Things” edit httpd.conf add / edit custom 404 page

Opal High-Level Architecture 1. Get URL X Interactive User www.bar.org 2. Custom 404 page 3. Pagetag redirects User to Opal server 5. Opal gives user navigation options 4. Opal searches WI caches; creates LS opal.foo.edu

Locating Caches http://www.google.com/search?hl=en&ie=ISO-8859-1&q=http://www.cs.odu.edu/~tharriso http://search.yahoo.com/search?fr=FP-pull-web-t&ei=UTF8&p=http://www.cs.odu.edu/~tharriso

Internet Archive

WI Caches Last 7-51 days* IA caches forever, but: may not ever crawl you ~12 month latency no internal backups * Frank McCown, Joan A. Smith, Michael L. Nelson, Johan Bollen, Reconstructing Websites for the Lazy Webmaster, arXiv cs.IR/0512069, 2005. http://arxiv.org/abs/cs.IR/0512069

Term Frequency  Inverse Document Frequency Calculating Term Frequency is easy frequency of term in this document Calculating Document Frequency is hard frequency of term in all documents assumes knowledge of entire corpus! “Good terms” appear: frequently in a single document infrequently across all documents

Scraping Google to Approximate DF Frequency of term across all documents: How many documents?

GUI - Bootstrapping

GUI - Learned

GUI (cont) <url:similarURL datestamp="2005-05-13" votes="1" simURL="http://www.cs.odu.edu/~tharriso/" baseURL="http://invivo_test.com"> <![CDATA[ <p class=g> <a href="javascript:popUp('demo_dev.pl?method=vote&url=http://www.cs.odu.edu/~tharriso &match=http://www.cs.odu.edu/~tharriso/')"> <b>Terry</b> <b>Harrison</b> Profile Page</a><br><font size=-1>Burning Man Images Other Images (not really well sorted, sorry!) Email <b>Terry</b> <b>...</b><br> (May 2003), AR Zipf Fellowship <b>Awarded</b> to <b>Terry</b> <b>Harrison</b> - Press Release <b>...</b><br><font color=#008000>www.cs.odu.edu/~tharriso/ - 12k - </font></font>   ]]> </url:similarURL>

Opal Server Databases URL database Term database 404 URL  (LS, similarURL1, similarURL2, …, similarURLN) similarURL  (URL, datestamp, votes, Opal server) Term database term  (Opal server, source, datestamp, DF, corpus size, IDF) Define each URL and Term as OAI-PMH Records and we can harvest what an Opal server has “learned” - can accommodate late arrivers (no “cold start” for them) - pool the learning of multiple servers - incentives to cooperate

Opal Synchronization Group 1 Opal A Opal B Opal C Opal D * Terms URLs Other architectures possible Harvesting frequency determined by individual nodes Group 2 Opal A Opal D.1 * Opal D aggregates D.1-D.3 to Group 1 * Opal D aggregates A-C to Group 2 Opal D.2 Opal D.3 Terms URLs

Discovery via OAI-PMH

Connection Costs Costcache = (WI * N) + R Costcache = 3*1 + 1 = 4 WI = # of web infrastructure caches N = connections for each WI R = connection to get a datestamp Costpaths = Rc + T + Rl Rc = connections to get a cached copy T = connections required for each term Rl = connections to use LS Costcache = 3*1 + 1 = 4 Costpaths = 1 + T + 1

Analysis - Cumulative Terms Learned 1 Million terms 30000 Documents Result averages after 100 iterations

Analysis - Terms Learned Per Document 1 Million terms 30000 Documents Result averages after 100 iterations

Load Estimation

Future Work Testing on departmental server Code optimizations hard to test in-the-small Code optimizations many short cuts taken for demo system G & Y APIs not used; screen scraping only Lexical Signatures describe changes over time IDF calculation metrics is scraping Google valid? is it nice? Learning new code use OAI-PMH to update the system OpenURL resolver 404 URL = referent

Conclusions Lexical signatures can be generated just-in-time from WI caches as pages disappear Many web servers can be easily configured to use a single Opal server Multiple Opal servers can harvest each other to learn Terms and URLs more quickly