Download presentation
Presentation is loading. Please wait.
Published byEdwina Hutchinson Modified over 9 years ago
1
Evolution of Web from a Search Engine Perspective Saket Singam sks2141@columbia.edu
2
Introduction Larger and Diverse growth of Web => Search Engine becoming “Killer Application” Search Engines typically “crawl” web pages in advance Discussion 1) What’s new on the Web ? New pages created @ rate of 8% per week 20 % of Web pages are accessible after 1 year Borrowing content from the existing pages- 62 % of the content in these pages is new, after 1 year, 50% of the Web has new content 2) How much change ? Once a page is created, it is likely to go through either a minor change or no change 3) Can we predict future changes ? Frequency of changes Degree of Change
3
Experimental Setup Download of Pages (almost a year) Pages from 154 “Popular” Web Sites Downloaded weekly in a Breadth-first order starting from Root pages of the Web Site until all reachable pages or a maximum of 200,000 pages Total Number of pages in weekly download = 3-5 million (avg 4.4 million) Size = 65 Gb before compression per week Total of 3.3TB of web history data and 4TB of arrived data (links,shingles) Table : “Fraction of pages included in this Experiment” Selection of Sites “Representative” as well as “Interesting” samples of Web About 5 top-ranked pages from a subset of Topical Categories of the Google Directory
4
What’s New on the Web? – Pages, Content and links Birth, death and replacement How many new pages created, disappear and replaced Crawling in Slow Mode and over a period of 39 weeks 20% Survival rate of web pages Weekly Birth Rate of pages How many new pages are created per week ? Identity - URL of the popular page Average Weekly Birth rate is 8% Once every month, # new pages higher than in previous week
5
Creation of a new Content How much new content is present Shingling Technique used W-shingle- contiguous ordered subsequence of “w” words New shingles are created at slower rate than the new pages New shingles @ 5% per week => 62% of URL content is new Link-Structure Evolution Search engines should efficiently capture the Link Structure Significantly Dynamic Structure Initial links are available @ 25% per week as compared to 8% for new pages and 5% for new content
6
Changes in the Existing Pages Change Frequency Distribution (Presence of Change) how often the web page is “Altered” Most pages change very frequently or very infrequently Degree of Change (SEO) Metrics:- TF.IDF Word Distance Exact order of Terms ignored Minor changes such as advertisements, counters etc cause minor changes in the content of the pages that are detected Search engines can exploit this only by re-downloading revised pages
7
Predictability Overall Predictability Metrics:- Group A (Red) :- top 80% Group B (Yellow) :- top 80-90% Group C (Green) :- top 90-95% Group D (Blue) :Remaining pages Why is this degree of predictability required? Predictability - individual site Individual sites – www.columbia.edu and www.eonline.com considered for study www.columbia.edu www.eonline.com
8
Conclusion Aspects of Evolving Web that are of particular interest in terms of search engine design has been studied through this research over a period of 1 year Existing pages are been removed “Rapidly” from the Web and replaced by New ones, whereas the new pages tend to borrow the contents from the existing ones Pages that are changing significantly over time have predictable degree of change Link Structure is evolving at a faster rate than most of the pages themselves Effort is to maximize Search Quality by making effective use of available resources to incorporate the changes
9
Thank You References: ->B.E. Brewington and G. Cybenko.How dynamic is the web? In proceeding of the Ninth WWW Conference, Amsterdam, The Netherlands, 2000 ->S. Brin and L.Page. The anatomy of large-scale hypertextual Web search engine. In the Proceeding of Seventh WWW Conference, Brisbane, Australia, 1998 -> D.Fetterly, M. Manasse, M. Najork and J.L. Wiener. A large-scale study of evolution of web pages. In Proceedings of Twelfth WWW Conference, Budapest, Hungary, 2003 -> B.H. Murray and A.Moore. Sizing the internet. White Paper, Cyveillance, Inc., 2000
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.