Website Reconstruction using the Web Infrastructure Frank McCown Doctoral Consortium June.

Slides:



Advertisements
Similar presentations
Getting Your Web Site Found. Meta Tags Description Tag This allows you to influence the description of your page with the web crawlers.
Advertisements

Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Project Prism Virtual Remote Control: Preservation Risk Management for Web Resources Nancy Y. McGovern, ECURE 2002.
1 ETT 429 Spring 2007 Microsoft Publisher II. 2 World Wide Web Terminology Internet Web pages Browsers Search Engines.
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
Internet Research Search Engines & Subject Directories.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
Meta Tags What are Meta Tags And How Are They Best Used?
Lazy Preservation: Reconstructing Websites from the Web Infrastructure Frank McCown Advisor: Michael L. Nelson Old Dominion University Computer Science.
Search Engines and their Public Interfaces: Which APIs are the Most Synchronized? Frank McCown and Michael L. Nelson Department of Computer Science, Old.
Norman SecureSurf Protect your users when surfing the Internet.
HTML Comprehensive Concepts and Techniques Intro Project Introduction to HTML.
July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele.
Dynamic Web Pages (Flash, JavaScript)
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
Web Characterization: What Does the Web Look Like?
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Strategies for improving Web site performance Google Webmaster Tools + Google Analytics Marshall Breeding Director for Innovative Technologies and Research.
Thinking Differently About Web Page Preservation Michael L. Nelson, Frank McCown, Joan A. Smith Old Dominion University Norfolk VA
Web Archiving Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Using a Web Browser What does a Web Browser do? A web browser enables you to surf the World Wide Web. What are the most popular browsers?
© 2008 CrawlWall.com Competitive Counter-Intelligence Stop Snooping Competitors Techniques for protecting your SEO investment from prying competitive eyes.
Project Proposal Interface Design Website Coding Website Testing & Launching Website Maintenance.
Web application: Operation and maintenance Basharat Mahmood, Department of Computer Science,CIIT,Islamabad, Pakistan. 1.
A Web Crawler Design for Data Mining
HT'061 Evaluation of Crawling Policies for a Web-Repository Crawler Frank McCown & Michael L. Nelson Old Dominion University Norfolk, Virginia, USA Odense,
1 Project 4: XML Product Browser (Not Graded). Objectives This project is an exercise of the following knowledge and skills: Accessing and displaying.
Developing a Web Site. Web Site Navigational Structures A storyboard is a diagram of a Web site’s structure, showing all the pages in the site and indicating.
Master Thesis Defense Jan Fiedler 04/17/98
How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
Objective Understand concepts used to web-based digital media. Course Weight : 5%
Interception and Analysis Framework for Win32 Scripts (not for public release) Tim Hollebeek, Ph.D.
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
My Website Was Lost, But Now It’s Found Frank McCown CS 110 – Intro to Computer Science April 23, 2007.
Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.
Ph.D. Progress Report Frank McCown 4/14/05. Timeline Year 1 : Course work and Diagnostic Exam Year 2 : Course work and Candidacy Exam Year 3 : Write and.
Digital Literacy Concepts and basic vocabulary. Digital Literacy Knowledge, skills, and behaviors used in digital devices (computers, tablets, smartphones)
Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007.
Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
The Availability and Persistence of Web References in D-Lib Magazine Frank McCown, Sheffan Chan, Michael L. Nelson and Johan Bollen Old Dominion University.
Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
SuperBot 3.1. SuperBot downloads entire websites automatically, and saves them on your computer. Thanks to SuperBot's HTML rewriting technology, the copied.
XP 1 Charles Edeki AIU Live Chat for Unit 2 ITC0381.
Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.
Brass: A Queueing Manager for Warrick Frank McCown, Amine Benjelloun, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk,
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Search Engine Optimization Presented By:- ARKA Softwares Effective! Affordable! Time Groove
Libraries in the digital age Collection & preservation for generational access part two The LOCKSS Program.
Introduction to Digital Libraries Week 15: Lazy Preservation Old Dominion University Department of Computer Science CS 751/851 Spring 2010 Michael L. Nelson.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Introduction to Digital Libraries Week 15: Web Infrastructure for Preservation Old Dominion University Department of Computer Science CS 751/851 Fall.
Web Page Design The Basics. The Web Page A document (file) created using the HTML scripting language. A document (file) created using the HTML scripting.
Computer Basics Introduction CIS 109 Columbia College.
Introduction to Digital Libraries Week 13: Lazy Preservation Old Dominion University Department of Computer Science CS 751/851 Spring 2011 Michael L. Nelson.
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
Data mining in web applications
Search Engines and Search techniques
W3 Status Analyzer.
Strategies for improving Web site performance
Lazy Preservation, Warrick, and the Web Infrastructure
Search Engines & Subject Directories
Agreeing to Disagree: Search Engines and Their Public Interfaces
Characterization of Search Engine Caches
Search Engines & Subject Directories
Search Engines & Subject Directories
SEO Hand Book.
Presentation transcript:

Website Reconstruction using the Web Infrastructure Frank McCown Doctoral Consortium June 11, 2006

Web Infrastructure

4 HTTP 404

6 Cost of Preservation H L H Publisher’s cost (time, equipment, knowledge) LOCKSS Browser cache TTApacheiPROXY Furl/Spurl InfoMonitor Filesystem backups Coverage of the Web H Client-view Server-view Web archives SE caches Hanzo:web

7 Research Questions How much digital preservation of websites is afforded by lazy preservation? Can we reconstruct entire websites from the WI? What factors contribute to the success of website reconstruction? Can we predict how much of a lost website can be recovered? How can the WI be utilized to provide preservation of server-side components?

8 Prior Work Is website reconstruction from WI feasible? Web repository: G,M,Y,IA Web-repository crawler: Warrick Reconstructed 24 websites How long do search engines keep cached content after it is removed?

9 Timeline of SE Resource Acquisition and Release Vulnerable resource – not yet cached (t ca is not defined) Replicated resource – available on web server and SE cache (t ca < current time < t r ) Endangered resource – removed from web server but still cached (t ca < current time < t cr ) Unrecoverable resource – missing from web server and cache (t ca < t cr < current time) Joan A. Smith, Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites, D-Lib Magazine, 12(2), February Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical report, arXiv cs.IR/ , 2005.

12 How Much Did We Reconstruct? A “Lost” web site Reconstructed web site BC DEF A B’C’ GE F Missing link to D; points to old resource G F can’t be found

13 Reconstruction Diagram added 20% identical 50% changed 33% missing 17%

Results Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/ , 2005.Reconstructing Websites for the Lazy Webmaster,

15 Warrick Milestones www2006.org – first lost website reconstructed (Nov 2005) www2006.org DCkickball.org – first website someone else reconstructed without our help (late Jan 2006) DCkickball.org – first website we reconstructed for someone else (mid Mar 2006) Internet Archive officially “blesses” Warrick (mid Mar 2006)

16 Proposed Work How lazy can we afford to be? Find factors influencing success of website reconstruction from the WI Perform search engine cache characterization Inject server-side components into WI for complete website reconstruction Improving the Warrick crawler Evaluate different crawling policies Development of web-repository API for inclusion in Warrick

17 Factors Influencing Website Recoverability from the WI Previous study did not find statistically significant relationship between recoverability and website size or PageRank Methodology Sample large number of websites - dmoz.org Perform several reconstructions over time using same policy Download sites several times over time to capture change rates

18 Evaluation Use statistical analysis to test for the following factors: Size Makeup Path depth PageRank Change rate Create a predictive model – how much of my lost website do I expect to get back?

19 SE Cache Characterization Web characterization is an active field Search engine caches have never been characterized Methodology Randomly sample URLs from four popular search engines: Google, MSN, Yahoo, Ask Access cached version if present Download live version from the Web Examine HTTP headers and page content Attempt to access various resource types (PDF, Word, PS, etc.) in each SE cache

20 Evaluation Compute the ratio of indexed to cached Find types, size, age of resources Do http Cache-control directives ‘no-cache’ and ‘no-store’ stop resources from being cached? Compare different SE caches compare How prevalent is the use of NOARCHIVE meta tags to keep HTML pages from being cached? How much of the Web is cached by SEs? What is the overlap with the Internet Archive?

Marshall TR Server – running EPrints

We can recover the missing page and PDF, but what about the services?

23 Recovery of Web Server Components Recovering the client-side representation is not enough to reconstruct a dynamically- produced website How can we inject the server-side functionality into the WI? Web repositories like HTML Canonical versions stored by all web repos Text-based Comments can be inserted without changing appearance of page

24 Injection Techniques Inject entire server file into HTML comments Divide server file into parts and insert parts into HTML comments Use erasure codes to break a server file into chunks and insert the chunks into HTML comments of different pages

25 Recover Server File from WI

26 Evaluation Find the most efficient values for n and r (chunks created/recovered) Security Develop simple mechanism for selecting files that can be injected into the WI Address encryption issues Reconstruct an EPrints website with a few hundred resources

Recent Work URL canonicalization Crawling policies Naïve policy Knowledgeable policy Exhaustive policy Reconstruct 24 websites with each policy Found that exhaustive and knowledgeable are significantly more efficient at recovering websites Frank McCown and Michael L. Nelson, Evaluation of Crawling Policies for a Web- Repository Crawler, HYPERTEXT 2006, To appear.

28 Warrick API API should provide a clear and flexible interface for web repositories Goals: Shield Warrick from changes to WI Facilitate inclusion of new web repositories Minimize implementation and maintenance costs

29 Evaluation Internet Archive has endorsed use of Warrick Make Warrick available on SourceForge Measure the community adoption & modification

30 Risks and Threats Time for enough resources to be cached Search engine caching behavior may change at any time Repository antagonism Spam Cloaking

Timetable Timeline

32 Summary When this work is completed, I will have… demonstrated and evaluated the lazy preservation technique provided a reference implementation characterized SE caching behavior provided a layer of abstraction on top of SE behavior (API) explored how much we store in the WI (server-side vs. client-side representations)

33 Thank You Questions?