Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discovering Changes on the Web What’s New on the Web? The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas Junghoo Cho Christopher.

Similar presentations


Presentation on theme: "Discovering Changes on the Web What’s New on the Web? The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas Junghoo Cho Christopher."— Presentation transcript:

1 Discovering Changes on the Web What’s New on the Web? The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas Junghoo Cho Christopher Olston UCLA Computer Science UCLA Computer Science Carnegie Mellon University Presented by: Vasudha Bhat Madhukesh Wali CS 791 - Web Syndication Formats

2 01/06/20082 Overview :  Introduction  Experimental Detail  Main Findings Of The Paper  Web Statistics  Changes On Web  Predictability Of Change  Conclusion  References

3 01/06/20083 The Evolution of the Web from a Search Engine Perspective  Why Search Engine?  How Search Engines Work? References References  Crawler-Based Search Engines  Human-Powered Directories There are a lot of research papers on the evolution of the Web. References References

4 01/06/20084 How Search Engines Work ? References References Typically search engines “crawl” Web pages in advance so that they can build the local copies and/or indexes of the pages. Figure reference: http://images.google.com/imgres?imgurl=http://www.searchtools.com/slides/SES01-sitesearch/index-search.gif&imgrefurl= http://www.searchtools.com/slides/SES01-sitesearch/sitesearch-02.html&h=363&w= 656&sz=8&hl=en&start=8&um=1&tbnid= VG1cyWMFP1BV8M:&tbnh=76&tbn

5 01/06/20085 Google as an example Reference Reference Figure reference:http://images.google.com/imgres?imgurl=http://static.howstuffworks.com/gif/search-engine- chart.gif&imgrefurl=http://computer.howstuffworks.com/search- engine1.htm&h=420&w=386&sz=32&hl=en&start=1&um=1&tbnid=yalqWdsm6aqQfM:&tbnh=125&tbnw=115&prev=/images%3Fq%25

6 01/06/20086 Crawler-Based Search Engines Reference Reference "Spiders“ /”crawlers” take a Web page's content and create key search words that enable online users to find pages they're looking for. FigureReference:http://images.google.com/imgres?imgurl=http://www.me.lv/jse/images/crawl.gif&imgrefurl=http://www.me.lv/jse/manualdownload.html&h=516&w= 716&sz=21&hl=en&start=34&um=1&tbnid=QjQc6Gv02cXt7M:&tbnh=101&tbnw=140&prev=/images%3Fq%3Dcrawl%2Bbased%2Bsearch%2Bengine

7 01/06/20087 What makes this paper unique?  Link-structure evolution:  Link structure plays an important role in selecting the pages to return for search engine queries.  New pages on the Web:  While a large fraction of existing pages change over time, a significant fraction of “changes” on the Web are due to new pages that are created over time. Details Details  Search-centric change metric:  Relevance of a page to a query is measured using 1) The TF.IDF distance metricTF.IDF 2) The number of new words introduced in each update.

8 01/06/20088 Main findings of this paper – Some interesting facts on Web circa 2002-2003 circa  At the rate of 8% per week, new pages are created.  4 billion pages present in current Web.  320 million new pages every week  3.8 terabytes in size.  Only 20% of the pages available today will be still accessible after one year.  5% of “new content” is being introduced every week.  About 50% of the contents on the Web are new, after a year.  25% new links are created every week.

9 01/06/20089  Half of the new pages created do not change over a year  If there are any changes they are minor.  80% of the links on the Web are replaced with new ones once in a year.  After one week, 70% of the changed pages show less than 5% difference from their initial version.  Even after one year, less than 50% of the changed pages show less than 5% difference. How much change?

10 01/06/200810  The frequency of change  Number of times a page changed within a particular interval.  Example: Three changes in a month.  The degree of change  How much change a page went through within an interval.  Example: 30% difference under the TF.IDF metric in a week. Can we predict future changes?

11 01/06/200811 EXPERIMENTAL SETUP 154 “popular” Web sites every week from October 2002 until October 2003 – 51 weeks in total.  Selection of the sites  Download of pages

12 01/06/200812 Selection of the sites ….  Representative  The sample should span various parts of the Web, covering a multitude of topics.  Interesting  Reasonably large number of users should be interested in those sites.  Five top-ranked pages from each topical category of the Google Directory satisfied both the requirements. Complete list of sites

13 01/06/200813  Every week pages were downloaded for over a period of almost one year.  Maximum limit of 200,000 pages per site  4 Web sites out of 154 hit the limit.  3 to 5 million pages per week.  An average of 4.4 million pages.  The size of each weekly snapshot was around 65 GB before compression.  Total of 3.3 TB of Web history data + 4 TB of derived data - such as links, shingles, etc.. Download of pages ….

14 01/06/200814 WHAT’S NEW Each week? Weekly statistics ….  How many new pages are created every week?  How much new content is created?  How many new links?

15 01/06/200815 Weekly birth rate of pages  Average weekly birth rate is about 8%.

16 01/06/200816  Many Web sites use the end of a calendar month to introduce new pages. Page Persistence

17 01/06/200817 Birth, death, and replacement  Only 75% of the first-week pages were still available after one month of crawling (week 4).  About 52% were available after 6 months of crawling (week 25).  After almost a year (week 51) nearly 60% of the pages were new and only slightly more than 40% from the initial set was still available.

18 01/06/200818 Not all the changes are the same  The creation of new content:  To quantify the amount of new content being introduced, shingling technique is been used.  Shingling technique: Details Details  A shingle is a contiguous ordered subsequence of words.  Group of adjacent words of the page to form a shingle wrapping at the end of the page.  By comparing the number of matching shingles we can determine whether two documents are duplicates or not.  The comparison of document subsets allows the algorithms to calculate a percentage of overlap between two documents.

19 01/06/200819 The more alternatives Shingling technique The more alternatives, the more difficult the choice. 3-shingling (The, more, alternatives, the, more, difficult, the, choice) {(The, more, alternatives ), (more, alternatives, the), (alternatives, the, more), (the, more, difficult), (more, difficult, the), (difficult, the, choice)} the more difficultalternatives the moremore alternatives the more difficult the difficult the choice The more alternatives alternatives the more more difficult the difficult the choice the more difficult more alternatives the Overlapping shingles

20 01/06/200820 shingle measurements  Approximately 4.3 billion unique shingles per week.

21 01/06/200821 Do You Know?  It takes nine months for 50% of the pages to be replaced with new ones  More than 50% of the shingles are still available even after nearly one year.  On average, each week around 5% of the unique shingles are new  Roughly 8% of pages each week are new.  At most 5%/8% = 62% of the content of new URL’s introduced each week is actually new.

22 01/06/200822  On average, it is measured that 25% new links are created every week, which is significantly larger than 8% new pages and 5% new contents. Link Structure details

23 Change-the Perception  Frequency Distribution  Degree of Change  The different metrics  Measuring Degree of change  Distribution of cosine distances  Correlating Degree and frequency of change  Predicting Degree of change

24 01/06/200824 Change-the Perception  Is change a good thing?  From a web user perspective  From the perspective of Search Engine  Frequency of Change  Measure of how often the web page changes.  Degree of Change  Measures how much of the web page has changed.

25 01/06/200825 Frequency Distribution  Average Change interval Distribution  50% web pages had a infinite value.  15% had change on a weekly basis.  Remaining pages were distributed in a spectrum that had a roughly U shaped Pattern.

26 01/06/200826 Frequency Distribution Inference  The distribution is concentrated on the two extremes.  This means that web sites either change very often or change very infrequently. Does frequency alone provide enough information for search Engine cache updates?

27 01/06/200827 Degree of Change  The amount of change is as important as the frequency of change for a search engine.  How much of change is tolerable?  Search engines face “Constrained Optimization Problem”: Constrained resources to download web pages and index them Vs Increase accuracy of local search repository and index

28 01/06/200828 The different metrics  TF.IDF cosine distance metric  The metrics along with other factors is generally used by search engines to rank search results based on the relevance of the content.  Word distance metric  Word distance is a measure of the amount of work needed to update the search index of a search engine.  Both metrics do not account for the order while Shingles metric accounts for the order of terms.

29 01/06/200829 Measuring Degree of change TF.IDF cosine distance document can be visualized as weighted multi dimensional vectors each dimension corresponds to the weighted index of a search term TF.IDF cosine distance determines the difference in the orientations of the document in the vector space Word Distance: Measures the difference in the count of words in the documents content that has been changed. TF.IDF is a more generic measure w.r.t. the collection Word distance concerns only the documents being compared D1 D1’

30 01/06/200830 Distribution of cosine distances  The figure shows the distribution of cosine distance change across all changes.  The red region show the change with reference to the previous range of cosine change.  The blue bars show the cumulative distribution.

31 01/06/200831 Cosine Distribution inference  most changes have a very low cosine distance value.  80% of all changes had a DCos value less than 0.2 from their previous versions.  More than half of the crawls during the experiment were induced due to very little changes in the content.  The low DCos values signify that pages remain to provide information on the same kind of content even after significant No. of changes. Similar distribution results were obtained with word distance metrics

32 01/06/200832 Correlating Degree and frequency of change  We saw that most changes have very low degree of change.  Hence, it is important for search engines to consider Degree of change before new crawls.  If correlation exists then search engines can estimate degree of change by measuring frequency

33 01/06/200833 Relationship between degree and frequency.. Average Degree of change Vs No. of changes  Inference:  Highest average degree of change is found in pages that change either very frequently or very rarely  What does this mean?

34 01/06/200834 Relationship between degree and frequency.. Averagechange Cumulative change  Most of the frequently changing pages change same portion of the page.  Pages with moderate frequency of change have a high cumulative degree of change.  Search Engine Perspective: Moderately changing websites need to be crawled more often.

35 01/06/200835 Predicting Degree of change  Why Predict?  Ability to differentiate between minor and major changes.  “Pull Oriented nature of the Web” Ability to differentiate change depends on the ability to predict.

36 01/06/200836 Comparing Cosine distance values over intervals of time  Each dot corresponds to an individual page.  Y=X means these pages changed the same amount between intervals  pages are grouped in terms of their proximity to the diagonal.

37 01/06/200837 Comparing Cosine distance values over intervals of time  The distance from the diagonal for each group has increased over time.  This means that the ability to predict change degrades over time.  Some pages defy any kind of prediction.  Conclusions based on analysis of popular websites considered.

38 01/06/200838 Comparing Cosine distance for individual sites  Columbia website has a high degree of predictability than eonline website

39 01/06/200839 Conclusion  New pages are added to the Web at a very high rate but their contents are still retained from old pages.  Link structures change at a higher rate than the pages.  The changes in existing pages are very less.  Degree of change is highly predictable based on past degree of change for most of the pages.  Frequency of change is not a good predictor of degree of change.

40 01/06/200840 Questions ???  What is the main difference between TF.IDF metric and Shingling Technique ?  What will make search engines smarter?  How does syndication help search engines?

41 01/06/200841 References:  Alexandros Ntoulas, Junghoo Cho, Christopher Olston >>What’s New on the Web? The Evolution of the Web from a Search Engine Perspective.  B. E. Brewington and G. Cybenko. >> How dynamic is the web? In Proceedings of the Ninth International World Wide Web Conference, Amsterdam, The Netherlands, May 2000.  D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A >>large-scale study of the evolution of web pages. In Proceedings of the Twelfth International World Wide Web Conference, Budapest, Hungary, May 2003.  J. Cho and H. Garcia-Molina. >>The evolution of the web and implications for an incremental crawler. In Proceedings of the Twenty-Sixth International Conference on Very Large Data Bases, pages 200–209, Cairo, Egypt, Sept. 2000.

42 01/06/200842 Reference Links:  http://searchenginewatch.com http://searchenginewatch.com >>How Search Engines Works and its details.  http://www.cs.rutgers.edu/~mlittman/courses/ml03/iCML03/papers/ramos.pdf >>Using TF-IDF to Determine Word Relevance in Document Queries  http://webarchive.cs.ucla.edu http://webarchive.cs.ucla.edu >>The WebArchive Project, UCLA Computer Science Department. Complete list of sites.  http://www.rorweb.com/smarter.htm http://www.rorweb.com/smarter.htm >>make the search engines smarter.


Download ppt "Discovering Changes on the Web What’s New on the Web? The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas Junghoo Cho Christopher."

Similar presentations


Ads by Google