Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

Slides:



Advertisements
Similar presentations
The Keys to Speed. File Extensions Definition A tag of three or four letters, preceded by a period, which identifies a data file's format or the application.
Advertisements

CHS GRAPHICS GDP UNIT 01 FILE FORMATS Understanding File Formats.
An Introduction to the Internet and the Web Frank McCown COMP 250 – Internet Development Harding University.
Internet Research Internet Applications. The Internet is not the Web Because of the great popularity of the World Wide Web, people think the Internet.
1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.
Chapter 4 Adding Images. Inserting and Aligning Images Using CSS When you choose graphics to add to a web page, it’s important to use graphic files in.
Hypertext Computer Science 01i Introduction to the Internet Neal Sample 6 February 2001.
Web Basics (Scary Acronyms Demystified). HTML Acronyms And your website… CSS URL HTTP SSL SMTP DNS RSS API ERQ iCAL Yippie! We have an Awesome website…
Nat 4/5 - Software Design and Development – Low Level Operations - 1 National 4/5 – Computing Science Information Systems Design and Development Media.
Lazy Preservation: Reconstructing Websites from the Web Infrastructure Frank McCown Advisor: Michael L. Nelson Old Dominion University Computer Science.
Search Engines and their Public Interfaces: Which APIs are the Most Synchronized? Frank McCown and Michael L. Nelson Department of Computer Science, Old.
Graphics Ms. Brewer Spring Bellwork Edmodo – log on! Quiz? Take it if you need to! Policies and Procedures Powerpoint in Shared Folder! Get to Know.
Web Archiving Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
Website Reconstruction using the Web Infrastructure Frank McCown Doctoral Consortium June.
Web Characterization: What Does the Web Look Like?
Chapter 4 Adding Images. Chapter 4 Lessons Introduction 1.Insert and align images 2.Enhance an image and use alternate text 3.Insert a background image.
Thinking Differently About Web Page Preservation Michael L. Nelson, Frank McCown, Joan A. Smith Old Dominion University Norfolk VA
Web Archiving Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
IT Introduction to Website Development Welcome!
5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.
Image Storage Bitmapped Graphics – in which an image is represented as a collection of dots Vector Graphics – in which an image is represented as a set.
Unit 1 – Web Concepts Instructor: Brent Presley. ASSIGNMENT Read Chapter 1 Complete lab 1 – Installing Portable Apps.
HT'061 Evaluation of Crawling Policies for a Web-Repository Crawler Frank McCown & Michael L. Nelson Old Dominion University Norfolk, Virginia, USA Odense,
Lions and Tigers and Bears, Oh My!* * Translation: Files and Folders and Such.
File Name Extensions Computer Applications 7th grade.
Dynamic Web File Format Transformations with Grace Daniel S. Swaney, Frank McCown, and Michael L. Nelson Old Dominion University Computer Science Department.
Adobe Dreamweaver CS3 Revealed CHAPTER THREE: WORKING WITH TEXT AND IMAGES.
My Website Was Lost, But Now It’s Found Frank McCown CS 110 – Intro to Computer Science April 23, 2007.
Repository Synchronization Using NNTP and SMTP Michael L. Nelson, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA
Web Server Design Assignment #1: Basic Operations Due: 02/03/2010 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin.
The Management of a Website’s Historical Resources David Chao College of Business San Francisco State University.
Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.
Project Two Adding Web Pages, Links, and Images Define and set a home page Add pages to a Web site Describe Dreamweaver's image accessibility features.
What is GIS? GIS is an integrated system used to view and manage information about geographic places, analyze spatial relationships, and model spatial.
Evaluation of the NSDL and Google for Obtaining Pedagogical Resources Frank McCown, Johan Bollen, and Michael L. Nelson Old Dominion University Computer.
Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007.
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
The Availability and Persistence of Web References in D-Lib Magazine Frank McCown, Sheffan Chan, Michael L. Nelson and Johan Bollen Old Dominion University.
AGCJ 407: Web Authoring in Agricultural Communications Understanding the File/Folder Fracas AGCJ 407 Web Authoring in Agricultural Communications.
Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University.
Kevin Murphy Images and Web Pages Masters Project CS 490.
1 2/22/05CS120 The Information Era Chapter 4 Basic Web Page Construction TOPICS: Images and placing pages on the server.
Image File Formats Which one is right for me?. The Only Three Image Formats Your Will Ever Need: Names Jpg “Joint Photographic Experts Group” Png “Portable.
Art for New Media TO DO today: 1 - Sketchbook assignment (see next page) 2 - Animation using Photoshop - - Animation Panel > view, frames -
Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.
Brass: A Queueing Manager for Warrick Frank McCown, Amine Benjelloun, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk,
CSE3310: Web training A JumpStart for Project. Outline Introduction to Website development Web Development Languages How to build simple Pages in PHP.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Website: Contact:
Introduction to Digital Libraries Week 15: Lazy Preservation Old Dominion University Department of Computer Science CS 751/851 Spring 2010 Michael L. Nelson.
1 Introduction to Digital Libraries Week 15: Web Infrastructure for Preservation Old Dominion University Department of Computer Science CS 751/851 Fall.
Web Page Design The Basics. The Web Page A document (file) created using the HTML scripting language. A document (file) created using the HTML scripting.
Transparent Format Migration of Preserved Web Content D. S. H. Rosenthal, T. Lipkis, T. S. Robertson, S. Morabito Lib Magazine, 11(1), 2005
Introduction to Digital Libraries Week 13: Lazy Preservation Old Dominion University Department of Computer Science CS 751/851 Spring 2011 Michael L. Nelson.
Hypertext Transfer Protocol
Lazy Preservation, Warrick, and the Web Infrastructure
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Agreeing to Disagree: Search Engines and Their Public Interfaces
How files are organized
Organizing Files What is a file?
Setting Up Your Folders Staying Organized
File Extension Mini-Lesson
Characterization of Search Engine Caches
Lesson 5: Multimedia on the Web
File Management Staying Organized.
System Software: Operating system, Utility Programs, & File Management
Hyperlinks, Images, Comments, and More…
Web Server Design Assignment #1: Basic Operations
An Introduction to the Internet and the Web
Lesson 6 File Types.
Presentation transcript:

Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007 Vancouver, BC June 20, 2007

2 Outline Web-repository crawling with Warrick How successful is a reconstruction? Reconstruction experiment Significant findings

3 Black hat: Virus image: Hard drive:

4 Crawling the Crawlers

5

6

7 Cached Image

Cached PDF MSN version Yahoo versionGoogle version canonical

9 Web Repository Characteristics TypeMIME typeFile extGoogleYahooLiveIA HTML text text/html html CCCC Plain text text/plain txt, ans MMMC Graphic Interchange Format image/gif gif MMMC Joint Photographic Experts Group image/jpeg jpg MM M C Portable Network Graphic image/png png MM M C Adobe Portable Document Format application/pdf pdf MMMC JavaScript application/javascript js MMC Microsoft Excel application/vnd.ms-excel xls M~SMC Microsoft PowerPoint application/vnd.ms- powerpoint ppt MMMC Microsoft Word application/msword doc MMMC PostScript application/postscript ps M~SC CCanonical version is stored MModified version is stored (modified images are thumbnails, all others are html conversions) ~SIndexed but not stored

10 McCown, et al., Brass: A Queueing Manager for Warrick, IWAW McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM Available at

11

12 How Much Did We Recover? A “Lost” web site Reconstructed web site BC DEF A B’C’ GE F Missing link to D; points to old resource G F can’t be found Four categories of recovered resources: 1) Identical: A, E 2) Changed: B, C 3) Missing: D, F 4) Added: G

13 Measuring the Difference (r c, r m, r a ) changed missing added Apply Recovery Vector for each resource Compute Difference Vector for website

14 Some Difference Vectors D = (changed, missing, added) (0,0,0) – Perfect recovery (1,0,0) – All resources are recovered but changed (0,1,0) – All resources are lost (0,0,1) – All recovered resources are at new URIs

15 How Much Change is a Bad Thing? LostRecovered

16 How Much Change is a Bad Thing? LostRecovered

17 Assigning Penalties Apply to each resource (P c, P m, P a ) Penalty Adjustment Or Difference vector

18 Defining Success success = 1 – d m Equivalent to percent of recovered resources 01 Less successful More successful

19 Reconstruction Experiment 300 websites chosen randomly from Open Directory Project (dmoz.org) Crawled and reconstructed each website every week for 14 weeks Examined change rates, age, decay, growth, recoverability

20 Success of website recovery each week *On average, we recovered 61% of a website on any given week.

21 Recovery of Textual Resources

22 Recovery by TLD

23 Birth and Decay

24 Recovery of HTML Resources

25 Recovery by Age

26 Statistics for Repositories

27 Which Factors Are Significant? External backlinks Internal backlinks Google’s PageRank Hops from root page Path depth MIME type Query string params Age Resource birth rate TLD Website size Size of resources

28 Mild Correlations Hops and –website size (0.428) –path depth (0.388) Age and # of query params (-0.318) External links and –PageRank (0.339) –Website size (0.301) –Hops (0.320)

29 Regression Analysis No surprises: all variables are significant, but overall model only explains about half of the observations Three most significant variables: PageRank, hops and age (R-squared = )

30 Regression Parameter Estimates

31 Conclusions Most of the sampled websites were relatively stable –One third of the websites never lost a single resource –Half of the websites never added any new resources The typical website can expect to get back 61% of its resources if it were lost today (77% textual, 42% images and 32% other) How to improve recovery from WI? Improve PageRank, decrease number of hops to resources, create stable URLs

32 Thank You Frank McCown Sorry, Dad… You lost me in the first two minutes.

33 Injecting Server Components into Crawlable Pages Erasure codes HTML pagesRecover at least m blocks

34 Database Perl script config Static files (html files, PDFs, images, style sheets, Javascript, etc.) Web Infrastructure Web Server Dynamic page Recoverable Not Recoverable