1 What’s New on the Web? The Evolution of the Web from a Search Engine Perspective A. Ntoulas, J. Cho, and C. Olston, the 13 th International World Wide Web Conference, 2004 April 11, 2006 Jeonghye Sohn
2 Contents Introduction Experimental Setup What’s New on the Web? Changes in the Existing Pages Predictability of Degree of Change
3 What’s New on the Web? : The creation of new content(1) Out of all unique shingles that existed in the first week, how many of them still exist in the nth week? How many unique shingles in the nth week did not exist in the first week? Shingle : A contiguous subsequence contained in a document For instance, the 4-shingling of (a,rose,is,a,rose,is,a,rose) is the set { (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is) }
4 What’s New on the Web? : The creation of new content(2) Measuring the number of unique existing and newly appearing shingles : how much “new content” is being introduced every week.
5 What’s New on the Web? : The creation of new content(3) On average: Each week around 5% of the unique shingles were new Each week roughly 8% of pages were new At most 5%/8%=62% of the content of new URLs introduced each week is actually new
6 What’s New on the Web? : Link-structure evolution(1) How much the overall link structure changes over time : how many of the links from the first snapshot existed in the subsequent snapshots how many of the links are newly created
7 What’s New on the Web? : Link-structure evolution(2) The link structure of the Web is significantly more dynamic than the pages and the content Search engines may need to update link-based ranking metrics (such as PageRank)
8 Changes in the existing pages : Change frequency distribution Grouped pages by change interval and obtained the distribution Most pages concentrated near one of the two extremes : change very frequently or very infrequently
9 Changes in the existing pages : Degree of change(1) Search engines are faced with a constrained optimization problem : maximize the accuracy of the local search repository and index given a constrained amount of resources available for (re)downloading pages from the Web and incorporating them into the search index Effective search engine crawlers: ignore insignificant changes devote resources to incorporating important changes
10 Changes in the existing pages : Degree of change(2) The distribution of degree of change is measured using two metrics : TF.IDF Cosine Distance Word Distance
11 Changes in the existing pages : Degree of change(3) TF.IDF Cosine Distance TF : Term frequency DF : Document frequency). IDF: inversed document frequency Vector Space Model
12 Changes in the existing pages : Degree of change(4) Moderate fraction of changes induce a nontrivial word distance while having almost no impact on cosine distance
13 Changes in the existing pages : Degree and frequency of change(1) The content of the pages that change very frequently (at least once per week) is significantly altered with each change
14 Changes in the existing pages : Degree and frequency of change(2) The cumulative degree of change increases substantially over time