1 What’s New on the Web? The Evolution of the Web from a Search Engine Perspective A. Ntoulas, J. Cho, and C. Olston, the 13 th International World Wide.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Chapter 5: Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
1 Searching the Web Junghoo Cho UCLA Computer Science.
Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.
1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
1 Searching the Web Representation and Management of Data on the Internet.
1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.
Information Retrieval
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
1 Automatic Indexing The vector model Methods for calculating term weights in the vector model : –Simple term weights –Inverse document frequency –Signal.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Web Characterization: What Does the Web Look Like?
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
The identification of interesting web sites Presented by Xiaoshu Cai.
11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.
Chapter 6: Information Retrieval and Web Search
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Search Engines By: Faruq Hasan.
Vector Space Models.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Evolution of Web from a Search Engine Perspective Saket Singam
Setting up a search engine KS 2 Search: appreciate how results are selected.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.
How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.
Search Engines Session 5 INST 301 Introduction to Information Science.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Discovering Changes on the Web What’s New on the Web? The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas Junghoo Cho Christopher.
Presented by : Manoj Kumar & Harsha Vardhana Impact of Search Engines on Page Popularity by Junghoo Cho and Sourashis Roy (2004)
A large-scale study of the evolution of Web pages D. Fetterly, M. Manasse, M. Najork and L. Wiener SPE Vol.34 No.2 pages , Feb Apr
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
Syntactic Clustering of the Web By Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig CSCI 572 Ameya Patil Syntactic Clustering of the.
The Vector Space Models (VSM)
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
IST 516 Fall 2011 Dongwon Lee, Ph.D.
IST 497 Vladimir Belyavskiy 11/21/02
Representation of documents and queries
From frequency to meaning: vector space models of semantics
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

1 What’s New on the Web? The Evolution of the Web from a Search Engine Perspective A. Ntoulas, J. Cho, and C. Olston, the 13 th International World Wide Web Conference, 2004 April 11, 2006 Jeonghye Sohn

2 Contents Introduction Experimental Setup What’s New on the Web? Changes in the Existing Pages Predictability of Degree of Change

3 What’s New on the Web? : The creation of new content(1) Out of all unique shingles that existed in the first week, how many of them still exist in the nth week? How many unique shingles in the nth week did not exist in the first week? Shingle : A contiguous subsequence contained in a document For instance, the 4-shingling of (a,rose,is,a,rose,is,a,rose) is the set { (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is) }

4 What’s New on the Web? : The creation of new content(2) Measuring the number of unique existing and newly appearing shingles : how much “new content” is being introduced every week.

5 What’s New on the Web? : The creation of new content(3) On average:  Each week around 5% of the unique shingles were new  Each week roughly 8% of pages were new At most 5%/8%=62% of the content of new URLs introduced each week is actually new

6 What’s New on the Web? : Link-structure evolution(1) How much the overall link structure changes over time :  how many of the links from the first snapshot existed in the subsequent snapshots  how many of the links are newly created

7 What’s New on the Web? : Link-structure evolution(2) The link structure of the Web is significantly more dynamic than the pages and the content Search engines may need to update link-based ranking metrics (such as PageRank)

8 Changes in the existing pages : Change frequency distribution Grouped pages by change interval and obtained the distribution Most pages concentrated near one of the two extremes : change very frequently or very infrequently

9 Changes in the existing pages : Degree of change(1) Search engines are faced with a constrained optimization problem :  maximize the accuracy of the local search repository and index  given a constrained amount of resources available for (re)downloading pages from the Web and incorporating them into the search index Effective search engine crawlers:  ignore insignificant changes  devote resources to incorporating important changes

10 Changes in the existing pages : Degree of change(2) The distribution of degree of change is measured using two metrics :  TF.IDF Cosine Distance  Word Distance

11 Changes in the existing pages : Degree of change(3) TF.IDF Cosine Distance  TF : Term frequency  DF : Document frequency).  IDF: inversed document frequency Vector Space Model

12 Changes in the existing pages : Degree of change(4) Moderate fraction of changes induce a nontrivial word distance while having almost no impact on cosine distance

13 Changes in the existing pages : Degree and frequency of change(1) The content of the pages that change very frequently (at least once per week) is significantly altered with each change

14 Changes in the existing pages : Degree and frequency of change(2) The cumulative degree of change increases substantially over time