A large-scale study of the evolution of Web pages D. Fetterly, M. Manasse, M. Najork and L. Wiener SPE Vol.34 No.2 pages 213-237, Feb. 2004 Apr. 18. 2006.

Slides:



Advertisements
Similar presentations
HTML Basics Customizing your site using the basics of HTML.
Advertisements

1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
Freshness Policy Binoy Dharia, K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA.
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Computer Science 103 Chapter 2 HyperText Markup Language (HTML)
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
How the World Wide Web Works
Meta Tags What are Meta Tags And How Are They Best Used?
1 Archive-It Training University of Maryland July 12, 2007.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Annick Le Follic Bibliothèque nationale de France Tallinn,
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
Web Characterization: What Does the Web Look Like?
APPENDIX B Data Preparation and Univariate Statistics How are computer used in data collection and analysis? How are collected data prepared for statistical.
Annick Le Follic Bibliothèque nationale de France Tallinn,
XP Dreamweaver 8.0 Tutorial 3 1 Adding Text and Formatting Text with CSS Styles.
Introduction to HTML. What is a HTML File?  HTML stands for Hyper Text Markup Language  An HTML file is a text file containing small markup tags  The.
Predicting Content Change On The Web BY : HITESH SONPURE GUIDED BY : PROF. M. WANJARI.
Crawling Slides adapted from
Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc – SIAM Web Analytics Workshop.
 The World Wide Web is a collection of electronic documents linked together like a spider web.  These documents are stored on computers called servers.
INTRODUCTION. What is HTML? HTML is a language for describing web pages. HTML stands for Hyper Text Markup Language HTML is not a programming language,
Review session for Web development. Time line of the internet history When was the ARPANET first demostrated? When did the NFSNet replace it? When did.
Psychology’s Statistics Module 03. Module Overview Frequency Distributions Measures of Central Tendency Measures of Variation Normal Distribution Comparative.
Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Spamscatter: Characterizing Internet Scam Hosting Infrastructure By D. Anderson, C. Fleizach, S. Savage, and G. Voelker Presented by Mishari Almishari.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Analysis of Topic Dynamics in Web Search Xuehua Shen (University of Illinois) Susan Dumais (Microsoft Research) Eric Horvitz (Microsoft Research) WWW 2005.
Experimental Research Methods in Language Learning Chapter 9 Descriptive Statistics.
N-Gram-based Dynamic Web Page Defacement Validation Woonyon Kim Aug. 23, 2004 NSRI, Korea.
Thinking About Psychology The Science of Mind and Behavior 3e Charles T. Blair-Broeker & Randal M. Ernst PowerPoint Presentation Slides by Kent Korek Germantown.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
Tutorial 3 Adding and Formatting Text with CSS Styles.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?
Evolution of Web from a Search Engine Perspective Saket Singam
What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”
Computer Skills (1) Internet Explorer. To open the Internet Explorer: –Double click on the Internet Explorer icon on Desktop. –Or, from Start  All Programs.
`. Lecture Overview HTML Body Elements Linking techniques HyperText references Linking images Linking to locations on a page Linking to a fragment on.
How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
Allan Heydon and Mark Najork --Sumeet Takalkar. Inspiration of Mercator What is a Mercator Crawling Algorithm and its Functional Components Architecture.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Discovering Changes on the Web What’s New on the Web? The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas Junghoo Cho Christopher.
Review session for Web development. Today’s class Review the web designing. Filling out instructor evaluation form.
Tutorial 1 Getting Started with Adobe Dreamweaver CS5.
The Future of SEO: 2015 Ranking Factors survey (source moz.com)
1 What’s New on the Web? The Evolution of the Web from a Search Engine Perspective A. Ntoulas, J. Cho, and C. Olston, the 13 th International World Wide.
Database System Laboratory Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p ,
HTML Links CS 1150 Spring 2017.
Introduction to HTML.
Egyptian Language School General Questions Prep.2
Internet Searching: Finding Quality Information
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
LINKS.
Detecting Phrase-Level Duplication on the World Wide Web
Group 3: Olena Hunsicker and Divya Josyula
Characterization of Search Engine Caches
Internet Technologies I - Lect.01 - Waleed Ibrahim Osman
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
HTML Links CS 1150 Fall 2016.
Presentation transcript:

A large-scale study of the evolution of Web pages D. Fetterly, M. Manasse, M. Najork and L. Wiener SPE Vol.34 No.2 pages , Feb Apr So Jeong Han

2 Content Results Conclusions

3 Results(1) – Document Size Analyze Fig 2. Document Size (byte) versus Top-level domain HTTP status code of 200 x : Document size (2 x-1 ~ 2 x Byte) 14 with standard deviation 1DomainDescription all4 ~ 32KB (66.5%).com52.5% of all.org8% of all.gov1.1% of all.edu2 ~ 16KB (64.9%)

4 Results(2) – Document Size Analyze Fig 3. Document Size (Word) versus Top-level domain

5 Results(3) – Status Code Analyze Fig 4. Distribution of HTTP status codes over crawl generations. for each crawl generation the percentage of page retrievals resulting in categories of status codes. y : start at 85% URL’s lifetime is limited. Object: Why do downloads fail?CategoryDescription 200Successful (Ok) 3xxRedirection (moved) 4xxClient errors other500 status code (Server Error) RobotExclblocked networknetwork-related error (DNS, TCP) Heisenberg Effect

6 Results(4) – TDL Analyze Fig 5. Download success per Top-level domain. Pages in.jp &.de &.edu is available The decline in the curves bears out the limited lifetime of Web pages

7 Results(5) – TDL Analyze Fig 6. Last successfully downloaded Web page per TLD For viewing the lifetime of URLs from different domains. Pages in.cn expire sooner than average. Downloaded during the final crawl Unreachable after crawl 1

8 Results(6) – Change Analyze Fig 7. Distribution of change Fig 8. scaled to show the low-percentage categories cumulative percentage distribution

9 Results(7) Fig 9. Type of markup change Markup of 1,468,671 pages change Observing link evolution of this type Crawler recognize Session ID and remove Session ID avoid recrawling the same content

10 Results(8) Fig 10. Type of markup change normalized by hosts Fig 9 were influenced by URL? Host? Change of type is counted per host. % of changes attributed to URL query (48% -> 4%) Tend to appear many times on the same page. When session ID are embedded in links to URL, They tend to appear in all relative links on the page Fig 9. Shrink

11 Results(9) Fig 11. Breakdown of tags that changed changes that were additions or deletions of tagsTagDescription !Comment (23.1%) A90% of A are ‘HREF’ (48.1%) IMG52% of IMG are ‘SRC’

12 Results(10) - TLD, Change Analyze Fig 12. Clustered rates of change by TLD Each bar is divided into six region change cluster.de domain problem Automatically generate (‘stuffing’) Use distinct host names as a front to a single server Trick link-based ranking algorithm. Draw visitors to the Adultweb site. Complete change cluster No change clusterDomainDescription.com Frequently more change than.gov,.edu.deHigher rate and degree of change than others

13 Results(11) - TLD, Change Analyze Solution Symbolic host names of all the URLs in our data set Singled out each IP address with more than a 1000 symbolic host names mapping to it. Fig 13. Clustered rates of change by TLD after excluding automatically generated keyword-spam documents. Eliminated about 60%

14 Results(12) – TLD, Change Analyze Conclusion Adult content continues to skew our results. Shingling technique might not be well adapted to writing system like Chinese or Kanji (not employ inter-word spacing) Extent of change is quite consistent with other TLD Fig 14. Clustered rates of change by TLD omitting the no change cluster after excluding automatically generated keyword-spam documents.

15 Results(13) - Change Analyze Fig 15. scaled to show the low-percentage categories, after excluding automatically generated keyword-spam documents. Bucket 0 : cut in half Right is monotonous consider whether the length of pages impacts their rate of change Fig 8.Fig 12.

16 Results(14) Fig 16. Clustered rates of change by document size. (byte) Document size is strongly related to rate of change. Small documents are mostly to change Large documents (32KB above) change much more frequently than smaller ones (4KB below).

17 Results(15) Fig 17. Clustered rates of change by the number of words per document. Sensitivity of our shingling techniques depend on the number of words in a document ‘all- or-nothing’ similarity metric gives a relatively coarse. Large documents are more likely to change than smaller one.

18 Results(16) Fig 18. Clustered rates of change by the number of words per document, and omitting the no change cluster.

19 Results(17) Fig 19. Clustered rates of change by top-level domain and number of words per document.com &.net Stronger effect for larger documents (than.gov.edu) -- Commercial Web site : appearance of freshness -- Educational & Governmental Web site : archival purpose

20 Results(18) Fig 20. Distribution of the standard deviations of the rate of change in a given document over its lifetime

21 Results(19) Plate 1. Logarithmic histogram of intra-document changes over three successive weeks, showing the absolute number of changes. The number of pre-images in a document unchanged from Week n to n+1 The number of pre-images in a document unchanged from Week n-1 to n (85,85) : web page don’t change much over a 3week inteval times higher than any other feature

22 Results(20) Plate 2. Logarithmic histogram of intra-document changes over three successive weeks, ormalized to show the conditional probabilities of changes. Indicating once again that past change is a strong predictor of future change.

23 Conclusion(1) Purpose : measuring the rate and degree of Web page Method crawled 151 million pages once a week for 11 weeks saving salient information about each downloaded document including a feature vector of the text without markup plus the full text of 0.1% of all downloaded pages

24 Conclusion(2) Conclusion (We found..) Web pages change : markup or in trivial ways change Relation with TLD frequency of change of a document (strong) degree of change (weaker) Document size both frequency and degree of change. (.com &.net) large documents change more often and more extensively Predict future change -> implications for web crawlers ‘German anomaly’ : fast-changing page is not worthy. Fin.