Website Reconstruction using the Web Infrastructure Frank McCown Doctoral Consortium June.

Website Reconstruction using the Web Infrastructure Frank McCown http://www.cs.odu.edu/~fmccown/ http://www.cs.odu.edu/~fmccown/ Doctoral Consortium June 11, 2006

Web Infrastructure

4 HTTP 404

6 Cost of Preservation H L H Publisher’s cost (time, equipment, knowledge) LOCKSS Browser cache TTApacheiPROXY Furl/Spurl InfoMonitor Filesystem backups Coverage of the Web H Client-view Server-view Web archives SE caches Hanzo:web

7 Research Questions How much digital preservation of websites is afforded by lazy preservation? Can we reconstruct entire websites from the WI? What factors contribute to the success of website reconstruction? Can we predict how much of a lost website can be recovered? How can the WI be utilized to provide preservation of server-side components?

8 Prior Work Is website reconstruction from WI feasible? Web repository: G,M,Y,IA Web-repository crawler: Warrick Reconstructed 24 websites How long do search engines keep cached content after it is removed?

9 Timeline of SE Resource Acquisition and Release Vulnerable resource – not yet cached (t ca is not defined) Replicated resource – available on web server and SE cache (t ca < current time < t r ) Endangered resource – removed from web server but still cached (t ca < current time < t cr ) Unrecoverable resource – missing from web server and cache (t ca < t cr < current time) Joan A. Smith, Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites, D-Lib Magazine, 12(2), February 2006. Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical report, arXiv cs.IR/0512069, 2005.

12 How Much Did We Reconstruct? A “Lost” web site Reconstructed web site BC DEF A B’C’ GE F Missing link to D; points to old resource G F can’t be found

13 Reconstruction Diagram added 20% identical 50% changed 33% missing 17%

Results Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.Reconstructing Websites for the Lazy Webmaster,

15 Warrick Milestones www2006.org – first lost website reconstructed (Nov 2005) www2006.org DCkickball.org – first website someone else reconstructed without our help (late Jan 2006) DCkickball.org www.iclnet.org – first website we reconstructed for someone else (mid Mar 2006) www.iclnet.org Internet Archive officially “blesses” Warrick (mid Mar 2006) 1 1 http://frankmccown.blogspot.com/2006/03/warrick-is-gaining-traction.html http://frankmccown.blogspot.com/2006/03/warrick-is-gaining-traction.html

16 Proposed Work How lazy can we afford to be? Find factors influencing success of website reconstruction from the WI Perform search engine cache characterization Inject server-side components into WI for complete website reconstruction Improving the Warrick crawler Evaluate different crawling policies Development of web-repository API for inclusion in Warrick

17 Factors Influencing Website Recoverability from the WI Previous study did not find statistically significant relationship between recoverability and website size or PageRank Methodology Sample large number of websites - dmoz.org Perform several reconstructions over time using same policy Download sites several times over time to capture change rates

18 Evaluation Use statistical analysis to test for the following factors: Size Makeup Path depth PageRank Change rate Create a predictive model – how much of my lost website do I expect to get back?

19 SE Cache Characterization Web characterization is an active field Search engine caches have never been characterized Methodology Randomly sample URLs from four popular search engines: Google, MSN, Yahoo, Ask Access cached version if present Download live version from the Web Examine HTTP headers and page content Attempt to access various resource types (PDF, Word, PS, etc.) in each SE cache

20 Evaluation Compute the ratio of indexed to cached Find types, size, age of resources Do http Cache-control directives ‘no-cache’ and ‘no-store’ stop resources from being cached? Compare different SE caches compare How prevalent is the use of NOARCHIVE meta tags to keep HTML pages from being cached? How much of the Web is cached by SEs? What is the overlap with the Internet Archive?

Marshall TR Server – running EPrints

We can recover the missing page and PDF, but what about the services?

23 Recovery of Web Server Components Recovering the client-side representation is not enough to reconstruct a dynamically- produced website How can we inject the server-side functionality into the WI? Web repositories like HTML Canonical versions stored by all web repos Text-based Comments can be inserted without changing appearance of page

24 Injection Techniques Inject entire server file into HTML comments Divide server file into parts and insert parts into HTML comments Use erasure codes to break a server file into chunks and insert the chunks into HTML comments of different pages

25 Recover Server File from WI

26 Evaluation Find the most efficient values for n and r (chunks created/recovered) Security Develop simple mechanism for selecting files that can be injected into the WI Address encryption issues Reconstruct an EPrints website with a few hundred resources

Recent Work URL canonicalization Crawling policies Naïve policy Knowledgeable policy Exhaustive policy Reconstruct 24 websites with each policy Found that exhaustive and knowledgeable are significantly more efficient at recovering websites Frank McCown and Michael L. Nelson, Evaluation of Crawling Policies for a Web- Repository Crawler, HYPERTEXT 2006, To appear.

28 Warrick API API should provide a clear and flexible interface for web repositories Goals: Shield Warrick from changes to WI Facilitate inclusion of new web repositories Minimize implementation and maintenance costs

29 Evaluation Internet Archive has endorsed use of Warrick Make Warrick available on SourceForge Measure the community adoption & modification

30 Risks and Threats Time for enough resources to be cached Search engine caching behavior may change at any time Repository antagonism Spam Cloaking

Timetable Timeline

32 Summary When this work is completed, I will have… demonstrated and evaluated the lazy preservation technique provided a reference implementation characterized SE caching behavior provided a layer of abstraction on top of SE behavior (API) explored how much we store in the WI (server-side vs. client-side representations)

33 Thank You Questions?

Website Reconstruction using the Web Infrastructure Frank McCown Doctoral Consortium June.

Similar presentations

Presentation on theme: "Website Reconstruction using the Web Infrastructure Frank McCown Doctoral Consortium June."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Website Reconstruction using the Web Infrastructure Frank McCown Doctoral Consortium June.

Similar presentations

Presentation on theme: "Website Reconstruction using the Web Infrastructure Frank McCown Doctoral Consortium June."— Presentation transcript:

Similar presentations

About project

Feedback