Discovering Changes on the Web What’s New on the Web? The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas Junghoo Cho Christopher.

Slides:



Advertisements
Similar presentations
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.
@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.
Information Retrieval in Practice
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
1 Searching the Web Junghoo Cho UCLA Computer Science.
Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.
Towards a Better Understanding of Web Resources and Server Responses for Improved Caching Craig E. Wills and Mikhail Mikhailov Computer Science Department.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
A glance at the world of search engines July 2005 Matias Cuenca-Acuna Research Scientist Teoma Search Development.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.
Information Retrieval
distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
Optimal Crawling Strategies for Web Search Engines Wolf, Sethuraman, Ozsen Presented By Rajat Teotia.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Web Characterization: What Does the Web Look Like?
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Search Engines & Search Engine Optimization (SEO).
Predicting Content Change On The Web BY : HITESH SONPURE GUIDED BY : PROF. M. WANJARI.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
Influence of Search Engines Christina Pong cs349.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Search Engines By: Faruq Hasan.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Evolution of Web from a Search Engine Perspective Saket Singam
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Presented by : Manoj Kumar & Harsha Vardhana Impact of Search Engines on Page Popularity by Junghoo Cho and Sourashis Roy (2004)
A large-scale study of the evolution of Web pages D. Fetterly, M. Manasse, M. Najork and L. Wiener SPE Vol.34 No.2 pages , Feb Apr
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
1 What’s New on the Web? The Evolution of the Web from a Search Engine Perspective A. Ntoulas, J. Cho, and C. Olston, the 13 th International World Wide.
Automated Information Retrieval
Information Retrieval in Practice
Efficient Multi-User Indexing for Secure Keyword Search
DATA MINING Introductory and Advanced Topics Part III – Web Mining
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
IST 516 Fall 2011 Dongwon Lee, Ph.D.
How Search Engines Work?
CS246: Search-Engine Scale
Presentation transcript:

Discovering Changes on the Web What’s New on the Web? The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas Junghoo Cho Christopher Olston UCLA Computer Science UCLA Computer Science Carnegie Mellon University Presented by: Vasudha Bhat Madhukesh Wali CS Web Syndication Formats

01/06/20082 Overview :  Introduction  Experimental Detail  Main Findings Of The Paper  Web Statistics  Changes On Web  Predictability Of Change  Conclusion  References

01/06/20083 The Evolution of the Web from a Search Engine Perspective  Why Search Engine?  How Search Engines Work? References References  Crawler-Based Search Engines  Human-Powered Directories There are a lot of research papers on the evolution of the Web. References References

01/06/20084 How Search Engines Work ? References References Typically search engines “crawl” Web pages in advance so that they can build the local copies and/or indexes of the pages. Figure reference: &sz=8&hl=en&start=8&um=1&tbnid= VG1cyWMFP1BV8M:&tbnh=76&tbn

01/06/20085 Google as an example Reference Reference Figure reference: chart.gif&imgrefurl= engine1.htm&h=420&w=386&sz=32&hl=en&start=1&um=1&tbnid=yalqWdsm6aqQfM:&tbnh=125&tbnw=115&prev=/images%3Fq%25

01/06/20086 Crawler-Based Search Engines Reference Reference "Spiders“ /”crawlers” take a Web page's content and create key search words that enable online users to find pages they're looking for. FigureReference: 716&sz=21&hl=en&start=34&um=1&tbnid=QjQc6Gv02cXt7M:&tbnh=101&tbnw=140&prev=/images%3Fq%3Dcrawl%2Bbased%2Bsearch%2Bengine

01/06/20087 What makes this paper unique?  Link-structure evolution:  Link structure plays an important role in selecting the pages to return for search engine queries.  New pages on the Web:  While a large fraction of existing pages change over time, a significant fraction of “changes” on the Web are due to new pages that are created over time. Details Details  Search-centric change metric:  Relevance of a page to a query is measured using 1) The TF.IDF distance metricTF.IDF 2) The number of new words introduced in each update.

01/06/20088 Main findings of this paper – Some interesting facts on Web circa circa  At the rate of 8% per week, new pages are created.  4 billion pages present in current Web.  320 million new pages every week  3.8 terabytes in size.  Only 20% of the pages available today will be still accessible after one year.  5% of “new content” is being introduced every week.  About 50% of the contents on the Web are new, after a year.  25% new links are created every week.

01/06/20089  Half of the new pages created do not change over a year  If there are any changes they are minor.  80% of the links on the Web are replaced with new ones once in a year.  After one week, 70% of the changed pages show less than 5% difference from their initial version.  Even after one year, less than 50% of the changed pages show less than 5% difference. How much change?

01/06/  The frequency of change  Number of times a page changed within a particular interval.  Example: Three changes in a month.  The degree of change  How much change a page went through within an interval.  Example: 30% difference under the TF.IDF metric in a week. Can we predict future changes?

01/06/ EXPERIMENTAL SETUP 154 “popular” Web sites every week from October 2002 until October 2003 – 51 weeks in total.  Selection of the sites  Download of pages

01/06/ Selection of the sites ….  Representative  The sample should span various parts of the Web, covering a multitude of topics.  Interesting  Reasonably large number of users should be interested in those sites.  Five top-ranked pages from each topical category of the Google Directory satisfied both the requirements. Complete list of sites

01/06/  Every week pages were downloaded for over a period of almost one year.  Maximum limit of 200,000 pages per site  4 Web sites out of 154 hit the limit.  3 to 5 million pages per week.  An average of 4.4 million pages.  The size of each weekly snapshot was around 65 GB before compression.  Total of 3.3 TB of Web history data + 4 TB of derived data - such as links, shingles, etc.. Download of pages ….

01/06/ WHAT’S NEW Each week? Weekly statistics ….  How many new pages are created every week?  How much new content is created?  How many new links?

01/06/ Weekly birth rate of pages  Average weekly birth rate is about 8%.

01/06/  Many Web sites use the end of a calendar month to introduce new pages. Page Persistence

01/06/ Birth, death, and replacement  Only 75% of the first-week pages were still available after one month of crawling (week 4).  About 52% were available after 6 months of crawling (week 25).  After almost a year (week 51) nearly 60% of the pages were new and only slightly more than 40% from the initial set was still available.

01/06/ Not all the changes are the same  The creation of new content:  To quantify the amount of new content being introduced, shingling technique is been used.  Shingling technique: Details Details  A shingle is a contiguous ordered subsequence of words.  Group of adjacent words of the page to form a shingle wrapping at the end of the page.  By comparing the number of matching shingles we can determine whether two documents are duplicates or not.  The comparison of document subsets allows the algorithms to calculate a percentage of overlap between two documents.

01/06/ The more alternatives Shingling technique The more alternatives, the more difficult the choice. 3-shingling (The, more, alternatives, the, more, difficult, the, choice) {(The, more, alternatives ), (more, alternatives, the), (alternatives, the, more), (the, more, difficult), (more, difficult, the), (difficult, the, choice)} the more difficultalternatives the moremore alternatives the more difficult the difficult the choice The more alternatives alternatives the more more difficult the difficult the choice the more difficult more alternatives the Overlapping shingles

01/06/ shingle measurements  Approximately 4.3 billion unique shingles per week.

01/06/ Do You Know?  It takes nine months for 50% of the pages to be replaced with new ones  More than 50% of the shingles are still available even after nearly one year.  On average, each week around 5% of the unique shingles are new  Roughly 8% of pages each week are new.  At most 5%/8% = 62% of the content of new URL’s introduced each week is actually new.

01/06/  On average, it is measured that 25% new links are created every week, which is significantly larger than 8% new pages and 5% new contents. Link Structure details

Change-the Perception  Frequency Distribution  Degree of Change  The different metrics  Measuring Degree of change  Distribution of cosine distances  Correlating Degree and frequency of change  Predicting Degree of change

01/06/ Change-the Perception  Is change a good thing?  From a web user perspective  From the perspective of Search Engine  Frequency of Change  Measure of how often the web page changes.  Degree of Change  Measures how much of the web page has changed.

01/06/ Frequency Distribution  Average Change interval Distribution  50% web pages had a infinite value.  15% had change on a weekly basis.  Remaining pages were distributed in a spectrum that had a roughly U shaped Pattern.

01/06/ Frequency Distribution Inference  The distribution is concentrated on the two extremes.  This means that web sites either change very often or change very infrequently. Does frequency alone provide enough information for search Engine cache updates?

01/06/ Degree of Change  The amount of change is as important as the frequency of change for a search engine.  How much of change is tolerable?  Search engines face “Constrained Optimization Problem”: Constrained resources to download web pages and index them Vs Increase accuracy of local search repository and index

01/06/ The different metrics  TF.IDF cosine distance metric  The metrics along with other factors is generally used by search engines to rank search results based on the relevance of the content.  Word distance metric  Word distance is a measure of the amount of work needed to update the search index of a search engine.  Both metrics do not account for the order while Shingles metric accounts for the order of terms.

01/06/ Measuring Degree of change TF.IDF cosine distance document can be visualized as weighted multi dimensional vectors each dimension corresponds to the weighted index of a search term TF.IDF cosine distance determines the difference in the orientations of the document in the vector space Word Distance: Measures the difference in the count of words in the documents content that has been changed. TF.IDF is a more generic measure w.r.t. the collection Word distance concerns only the documents being compared D1 D1’

01/06/ Distribution of cosine distances  The figure shows the distribution of cosine distance change across all changes.  The red region show the change with reference to the previous range of cosine change.  The blue bars show the cumulative distribution.

01/06/ Cosine Distribution inference  most changes have a very low cosine distance value.  80% of all changes had a DCos value less than 0.2 from their previous versions.  More than half of the crawls during the experiment were induced due to very little changes in the content.  The low DCos values signify that pages remain to provide information on the same kind of content even after significant No. of changes. Similar distribution results were obtained with word distance metrics

01/06/ Correlating Degree and frequency of change  We saw that most changes have very low degree of change.  Hence, it is important for search engines to consider Degree of change before new crawls.  If correlation exists then search engines can estimate degree of change by measuring frequency

01/06/ Relationship between degree and frequency.. Average Degree of change Vs No. of changes  Inference:  Highest average degree of change is found in pages that change either very frequently or very rarely  What does this mean?

01/06/ Relationship between degree and frequency.. Averagechange Cumulative change  Most of the frequently changing pages change same portion of the page.  Pages with moderate frequency of change have a high cumulative degree of change.  Search Engine Perspective: Moderately changing websites need to be crawled more often.

01/06/ Predicting Degree of change  Why Predict?  Ability to differentiate between minor and major changes.  “Pull Oriented nature of the Web” Ability to differentiate change depends on the ability to predict.

01/06/ Comparing Cosine distance values over intervals of time  Each dot corresponds to an individual page.  Y=X means these pages changed the same amount between intervals  pages are grouped in terms of their proximity to the diagonal.

01/06/ Comparing Cosine distance values over intervals of time  The distance from the diagonal for each group has increased over time.  This means that the ability to predict change degrades over time.  Some pages defy any kind of prediction.  Conclusions based on analysis of popular websites considered.

01/06/ Comparing Cosine distance for individual sites  Columbia website has a high degree of predictability than eonline website

01/06/ Conclusion  New pages are added to the Web at a very high rate but their contents are still retained from old pages.  Link structures change at a higher rate than the pages.  The changes in existing pages are very less.  Degree of change is highly predictable based on past degree of change for most of the pages.  Frequency of change is not a good predictor of degree of change.

01/06/ Questions ???  What is the main difference between TF.IDF metric and Shingling Technique ?  What will make search engines smarter?  How does syndication help search engines?

01/06/ References:  Alexandros Ntoulas, Junghoo Cho, Christopher Olston >>What’s New on the Web? The Evolution of the Web from a Search Engine Perspective.  B. E. Brewington and G. Cybenko. >> How dynamic is the web? In Proceedings of the Ninth International World Wide Web Conference, Amsterdam, The Netherlands, May  D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A >>large-scale study of the evolution of web pages. In Proceedings of the Twelfth International World Wide Web Conference, Budapest, Hungary, May  J. Cho and H. Garcia-Molina. >>The evolution of the web and implications for an incremental crawler. In Proceedings of the Twenty-Sixth International Conference on Very Large Data Bases, pages 200–209, Cairo, Egypt, Sept

01/06/ Reference Links:  >>How Search Engines Works and its details.  >>Using TF-IDF to Determine Word Relevance in Document Queries  >>The WebArchive Project, UCLA Computer Science Department. Complete list of sites.  >>make the search engines smarter.