Evolution of Web from a Search Engine Perspective Saket Singam

Slides:



Advertisements
Similar presentations
Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Advertisements

@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.
The influence of search engines on preferential attachment Dan Li CS3150 Spring 2006.
Freshness Policy Binoy Dharia, K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA.
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.
1 Searching the Web Junghoo Cho UCLA Computer Science.
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.
The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University.
distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus.
SEO PLAN Presented By Mangesh Dolse. Lead Management Tool( Sample)
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Web Characterization: What Does the Web Look Like?
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Utilizing Global Search Engines to Penetrate Global Markets Curt Porritt SVP of Marketing
Predicting Content Change On The Web BY : HITESH SONPURE GUIDED BY : PROF. M. WANJARI.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
SEO Why do I Need It? © 2008 Interface Communications Group.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Basic Search Engine Optimization. What is SEO?  SEO is an abbreviation for search engine optimization.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
NTU Natural Language Processing Lab. 1 Investment and Attention in the Weblog Community Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
Analysis of Topic Dynamics in Web Search Xuehua Shen (University of Illinois) Susan Dumais (Microsoft Research) Eric Horvitz (Microsoft Research) WWW 2005.
A Statistical Comparison of Tag and Query Logs Mark J. Carman, Robert Gwadera, Fabio Crestani, and Mark Baillie SIGIR 2009 June 4, 2010 Hyunwoo Kim.
SEO Friendly Website Building a visually stunning website is not enough to ensure any success for your online presence.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam AND.
A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.
The anatomy of a Large-Scale Hypertextual Web Search Engine.
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Discovering Changes on the Web What’s New on the Web? The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas Junghoo Cho Christopher.
Presented by : Manoj Kumar & Harsha Vardhana Impact of Search Engines on Page Popularity by Junghoo Cho and Sourashis Roy (2004)
A large-scale study of the evolution of Web pages D. Fetterly, M. Manasse, M. Najork and L. Wiener SPE Vol.34 No.2 pages , Feb Apr
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
1 What’s New on the Web? The Evolution of the Web from a Search Engine Perspective A. Ntoulas, J. Cho, and C. Olston, the 13 th International World Wide.
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
McGraw-Hill Technology Education
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
The Internet and the World Wide Web
Detecting Phrase-Level Duplication on the World Wide Web
Characterization of Search Engine Caches
Presentation transcript:

Evolution of Web from a Search Engine Perspective Saket Singam

Introduction  Larger and Diverse growth of Web => Search Engine becoming “Killer Application”  Search Engines typically “crawl” web pages in advance  Discussion 1) What’s new on the Web ?  New pages rate of 8% per week  20 % of Web pages are accessible after 1 year  Borrowing content from the existing pages- 62 % of the content in these pages is new, after 1 year, 50% of the Web has new content 2) How much change ?  Once a page is created, it is likely to go through either a minor change or no change 3) Can we predict future changes ?  Frequency of changes  Degree of Change

Experimental Setup  Download of Pages (almost a year)  Pages from 154 “Popular” Web Sites  Downloaded weekly in a Breadth-first order starting from Root pages of the Web Site until all reachable pages or a maximum of 200,000 pages  Total Number of pages in weekly download = 3-5 million (avg 4.4 million)  Size = 65 Gb before compression per week  Total of 3.3TB of web history data and 4TB of arrived data (links,shingles)  Table : “Fraction of pages included in this Experiment”  Selection of Sites  “Representative” as well as “Interesting” samples of Web  About 5 top-ranked pages from a subset of Topical Categories of the Google Directory

What’s New on the Web? – Pages, Content and links  Birth, death and replacement  How many new pages created, disappear and replaced  Crawling in Slow Mode and over a period of 39 weeks  20% Survival rate of web pages  Weekly Birth Rate of pages  How many new pages are created per week ?  Identity - URL of the popular page  Average Weekly Birth rate is 8%  Once every month, # new pages higher than in previous week

 Creation of a new Content  How much new content is present  Shingling Technique used  W-shingle- contiguous ordered subsequence of “w” words  New shingles are created at slower rate than the new pages  New 5% per week => 62% of URL content is new  Link-Structure Evolution  Search engines should efficiently capture the Link Structure  Significantly Dynamic Structure  Initial links are 25% per week as compared to 8% for new pages and 5% for new content

Changes in the Existing Pages  Change Frequency Distribution (Presence of Change)  how often the web page is “Altered”  Most pages change very frequently or very infrequently  Degree of Change (SEO)  Metrics:- TF.IDF Word Distance  Exact order of Terms ignored  Minor changes such as advertisements, counters etc cause minor changes in the content of the pages that are detected  Search engines can exploit this only by re-downloading revised pages

Predictability  Overall Predictability  Metrics:- Group A (Red) :- top 80% Group B (Yellow) :- top 80-90% Group C (Green) :- top 90-95% Group D (Blue) :Remaining pages  Why is this degree of predictability required?  Predictability - individual site  Individual sites – and considered for study

Conclusion Aspects of Evolving Web that are of particular interest in terms of search engine design has been studied through this research over a period of 1 year Existing pages are been removed “Rapidly” from the Web and replaced by New ones, whereas the new pages tend to borrow the contents from the existing ones Pages that are changing significantly over time have predictable degree of change Link Structure is evolving at a faster rate than most of the pages themselves Effort is to maximize Search Quality by making effective use of available resources to incorporate the changes

Thank You References: ->B.E. Brewington and G. Cybenko.How dynamic is the web? In proceeding of the Ninth WWW Conference, Amsterdam, The Netherlands, >S. Brin and L.Page. The anatomy of large-scale hypertextual Web search engine. In the Proceeding of Seventh WWW Conference, Brisbane, Australia, > D.Fetterly, M. Manasse, M. Najork and J.L. Wiener. A large-scale study of evolution of web pages. In Proceedings of Twelfth WWW Conference, Budapest, Hungary, > B.H. Murray and A.Moore. Sizing the internet. White Paper, Cyveillance, Inc., 2000