1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

Slides:

Advertisements

Similar presentations

Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.

@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS

Information Retrieval in Practice

Freshness Policy Binoy Dharia, K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA.

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

Web Crawlers Nutch. Agenda What are web crawlers Main policies in crawling Nutch Nutch architecture.

CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

1 Searching the Web Junghoo Cho UCLA Computer Science.

Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.

1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.

Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.

CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.

1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University.

1 Searching the Web Representation and Management of Data on the Internet.

A glance at the world of search engines July 2005 Matias Cuenca-Acuna Research Scientist Teoma Search Development.

1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

CS345 Data Mining Crawling the Web. Web Crawling Basics get next url get page extract urls to visit urls visited urls web pages Web Start with a “seed.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University.

Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

Overview of Search Engines

CS246 Search Engine Scale. Junghoo "John" Cho (UCLA Computer Science) 2 High-Level Architecture  Major modules for a search engine? 1. Crawler  Page.

Automatic Cache Update Control for Scalable Resource Information Service with WS-Management September 23, 2009 Kumiko Tadano, Fumio Machida, Masahiro Kawato,

Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University

1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol.

استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.

An Approach to Persistence of Web Resources Joachim Feise University of California, Irvine Information and Computer Science

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University

GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.

1 Searching the Web Representation and Management of Data on the Internet.

The Forest and the Trees Julia Stoyanovich Candidacy Exam in Database Systems Fall 2005.

MySQL spatial indexing for GIS data in a web 2.0 internet application Brian Toone Samford University

E-infrastructure shared between Europe and Latin America FP6−2004−Infrastructures−6-SSA gLite Information System Pedro Rausch IF.

Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.

What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.

1 Page Quality: In Search of an Unbiased Web Ranking Presented by: Arjun Dasgupta Adapted from slides by Junghoo Cho and Robert E. Adams SIGMOD 2005.

Sigir’99 Inside Internet Search Engines: Spidering and Indexing Jan Pedersen and William Chang.

Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Our MP3 Search Engine Crawler –Searching for Artist Name –Searching for Song Title Website Difficulties Looking Back.

WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.

September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

Discovering Changes on the Web What’s New on the Web? The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas Junghoo Cho Christopher.

E-science grid facility for Europe and Latin America Updates on Information System Annamaria Muoio - INFN Tutorials for trainers 01/07/2008.

Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:

1 What’s New on the Web? The Evolution of the Web from a Search Engine Perspective A. Ntoulas, J. Cho, and C. Olston, the 13 th International World Wide.

Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철

SEARCH ENGINE by: by: B.Anudeep B.Anudeep Y5CS016 Y5CS016.

Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,

CS-791/891--Preservation of Digital Objects and Collections

SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.

Old Dominion University Feburary 1st, 2005

How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho

Prepared by Rao Umar Anwar For Detail information Visit my blog:

IST 497 Vladimir Belyavskiy 11/21/02

CS246 Page Refresh.

CS246 Search Engine Scale.

Junghoo “John” Cho UCLA

CS246: Search-Engine Scale

Presentation transcript:

1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science

2 Legacy database Plain text files Biblio sever Information Galore

3 Challenges: Too much information? Discovery Discovery Management Management Overload Overload Access Access …

4 Approaches Central caching and Indexing Central caching and Indexing –Google, Excite, AltaVista Dynamic integration Dynamic integration –MySimon, BizRate

5 Central Caching and Indexing Central Index

6 Challenges Page selection and download Page selection and download –What page to download? Page and index update Page and index update –How to update pages? Page ranking Page ranking –What page is “important” or “relevant”? Scalability Scalability

7 Dynamic Integration Mediator Wrapper Source 1 Wrapper Source 2 Wrapper Source n

8 Heterogeneous sources Heterogeneous sources –Different data models: relational, object-oriented –Different schemas and representations: “Keanu Reeves” or “Reeves, K.” etc. Limited query capabilities Limited query capabilities Mediator caching Mediator caching Challenges

9 Outline of This Talk How can we maintain pages fresh? How does the Web change? How does the Web change? What do we mean by “fresh” pages? What do we mean by “fresh” pages? How should we refresh pages? How should we refresh pages?

10 Web Evolution Experiment How often does a Web page change? How often does a Web page change? How long does a page stay on the Web? How long does a page stay on the Web? How long does it take for 50% of the Web to change? How long does it take for 50% of the Web to change? How do we model Web changes? How do we model Web changes?

11 Experimental Setup February 17 to June 24, 1999 February 17 to June 24, sites visited (with permission) 270 sites visited (with permission) –identified 400 sites with highest “PageRank” –contacted administrators 720,000 pages collected 720,000 pages collected –3,000 pages from each site daily –start at root, visit breadth first (get new & old pages) –ran only 9pm - 6am, 10 seconds between site requests

12 Average Change Interval fraction of pages  average change interval 

13 Change Interval – By Domain fraction of pages   average change interval

14 Modeling Web Evolution Poisson process with rate Poisson process with rate T is time to next event T is time to next event f T (t) =  e - t (t > 0) f T (t) =  e - t (t > 0)

15 Change Interval of Pages for pages that change every 10 days on average interval in days fraction of changes with given interval Poisson model

16 Change Metrics Freshness Freshness –Freshness of element e i at time t is F ( e i ; t ) = 1 if e i is up-to-date at time t 0 otherwise eiei eiei... webdatabase Freshness of the database S at time t is F( S ; t ) = F( e i ; t ) (Assume “equal importance” of pages)  N 1 N i=1

17 Change Metrics Age Age –Age of element e i at time t is A( e i ; t ) = 0 if e i is up-to-date at time t t - (modification e i time) otherwise eiei eiei... webdatabase Age of the database S at time t is A( S ; t ) = A( e i ; t ) (Assume “equal importance” of pages)  N 1 N i=1

18 Change Metrics F(e i ) A(e i ) time update refresh Time averages:

19 Trick Question Two page database Two page database changes daily e 1 changes daily changes once a week e 2 changes once a week Can visit one page per week Can visit one page per week How should we visit pages? How should we visit pages? –... [uniform] –e 1 e 2 e 1 e 2 e 1 e 2 e 1 e 2... [uniform] – … [proportional] –e 1 e 1 e 1 e 1 e 1 e 1 e 1 e 2 e 1 e 1 … [proportional] –... –e 1 e 1 e 1 e 1 e 1 e 1... –... –e 2 e 2 e 2 e 2 e 2 e 2... –? e1e1 e2e2 e1e1 e2e2 web database

20 Proportional Often Not Good! Visit fast changing Visit fast changing e 1  get 1/2 day of freshness  get 1/2 day of freshness Visit slow changing Visit slow changing e 2  get 1/2 week of freshness Visiting e 2 is a better deal! Visiting e 2 is a better deal!

21 Optimal Refresh Frequency Problem Given       and f, find f 1, f 2,.., f N that maximize find f 1, f 2,.., f N that maximize

22 Optimal Refresh Frequency Shape of curve is the same in all cases Holds for any change frequency distribution

23 Optimal Refresh for Age Shape of curve is the same in all cases Holds for any change frequency distribution

24 Comparing Policies Based on Statistics from experiment and revisit frequency of every month

25 Not Every Page is Equal! e1e1 e2e2 Accessed by users 20 times/day Accessed by users 10 times/day Some pages are “more important” Some pages are “more important” In general, F (S ) = 1 F (e 1 ) + 2 F (e 2 )

26 Weighted Freshness w = 1 w = 2 f

27 Change Frequency Estimation How to estimate change frequency? How to estimate change frequency? –Naïve Estimator: X/T –X: number of detected changes –T: monitoring period –2 changes in 10 days: 0.2 times/day Change detected 1 day Page visited Page changed Incomplete change history Incomplete change history

28 Improved Estimator Based on the Poisson model Based on the Poisson model –X: number of detected changes –N: number of accesses –f : access frequency 3 changes in 10 days: 0.36 times/day  Accounts for “missed” changes

29 Improvement Significant? Application to a Web crawler Application to a Web crawler –Visit pages once every week for 5 weeks –Estimate change frequency –Adjust revisit frequency based on the estimate »Uniform: do not adjust »Naïve: based on the naïve estimator »Ours: based on our improved estimator

30 Improvement from Our Estimator Detected changes Ratio to uniform Uniform2,147, % 100% Naïve4,145,582193% Ours4,892,116228% (9,200,000 visits in total)

31 WebArchive Project Can we store the history of the Web? Can we store the history of the Web? –Web is ephemeral –Study of the Web evolution Challenges Challenges –Update? –Compression? –New storage? –Indexing?

32 Conclusion Exciting area and many challenges ahead! Exciting area and many challenges ahead! Thank you for your attention Thank you for your attention For more information visit For more information visithttp://