CS246 Page Refresh.

Slides:

Advertisements

Similar presentations

Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Advertisements

Paging: Design Issues. Readings r Silbershatz et al: ,

Imbalanced data David Kauchak CS 451 – Fall 2013.

G. Alonso, D. Kossmann Systems Group

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Search and Replication in Unstructured Peer-to-Peer Networks Pei Cao, Christine Lv., Edith Cohen, Kai Li and Scott Shenker ICS 2002.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

Freshness Policy Binoy Dharia, K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA.

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

1 Searching the Web Junghoo Cho UCLA Computer Science.

Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.

1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.

Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.

CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University.

Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University Presented By: Raffi Margaliot Ori Elkin.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University.

distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

Optimal Crawling Strategies for Web Search Engines Wolf, Sethuraman, Ozsen Presented By Rajat Teotia.

CS246 Search Engine Scale. Junghoo "John" Cho (UCLA Computer Science) 2 High-Level Architecture  Major modules for a search engine? 1. Crawler  Page.

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol.

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.

What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.

Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Warsaw Summer School 2015, OSU Study Abroad Program Normal Distribution.

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.

Krishnendu ChatterjeeFormal Methods Class1 MARKOV CHAINS.

Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:

1 Design and Implementation of a High-Performance Distributed Web Crawler Polytechnic University Vladislav Shkapenyuk, Torsten Suel 06/13/2006 석사 2 학기.

Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철

Continuous Control with Prioritized Experience Replay

Understanding Sampling Distributions: Statistics as Random Variables

CS-791/891--Preservation of Digital Objects and Collections

Dan C. Marinescu Office: HEC 439 B. Office hours: M, Wd 3 – 4:30 PM.

8.4 Management of Postdelivery Maintenance

The Impact of Replacement Granularity on Video Caching

Old Dominion University Feburary 1st, 2005

Statistics: The Z score and the normal distribution

How will execution time grow with SIZE?

How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho

Whether you decide to use hidden frames or XMLHttp, there are several things you'll need to consider when building an Ajax application. Expanding the role.

Cache Memory Presentation I

7CCSMWAL Algorithmic Issues in the WWW

IST 497 Vladimir Belyavskiy 11/21/02

Sequence comparison: Multiple testing correction

Finding replicated web collections

Robotic Search Engines for the Physical World

CS246 Web Characteristics.

CS246 Search Engine Scale.

Data Mining – Chapter 4 Cluster Analysis Part 2

Markov Random Fields Presented by: Vladan Radosavljevic.

Markov Decision Problems

Warsaw Summer School 2017, OSU Study Abroad Program

Junghoo “John” Cho UCLA

Sequence comparison: Multiple testing correction

CS246: Search-Engine Scale

Replica Placement Model: We consider objects (and don’t worry whether they contain just data or code, or both) Distinguish different processes: A process.

CS246: Web Characteristics

Presentation transcript:

CS246 Page Refresh

What is a Crawler? initial urls init to visit urls get next url web get page visited urls extract urls web pages

Crawling Issues Load at the site Load at the crawler Page selection Crawler should be unobtrusive to visited sites Load at the crawler Download billions of Web pages in short time Page selection Many pages, limited resources Page refresh Refresh pages incrementally not in batch

Junghoo "John" Cho (UCLA Computer Science) Today’s Topic Page refresh How can we maintain “cached” pages “fresh”? Web search engines, data warehouse, etc. Refresh Source Copy Junghoo "John" Cho (UCLA Computer Science)

Other Caching Problems Disk buffers Disk page, memory page buffer Memory hierarchy 1st level cache, 2nd level cache, … Is Web caching any different? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Main Difference Origination of changes Cache to source Source to cache Freshness requirement Perfect caching Stale caching Role of a cache Transient space: cache replacement policy Main data source for application Refresh delay Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Main Difference Limited refresh resources Many independent sources Network bandwidth Computational resources … Mainly pull model Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Ideas? How can we maintain pages “fresh”? Some pages change often, some pages do not. News archive Daily news article A set of pages change together Java manual pages Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Topics to Cover Freshness paper Sampling paper Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Freshness Paper What is the problem? What are the main ideas? What does it try to “optimize”? What is its goal function? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Change Metrics Freshness Freshness of element ei at time t is F( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise Freshness of the database S at time t is F( S ; t ) = F( ei ; t )  N 1 i=1 ei ... web database Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Change Metrics Age Age of element ei at time t is A( ei ; t ) = 0 if ei is up-to-date at time t t - (modification ei time) otherwise Age of the database S at time t is A( S ; t ) = A( ei ; t )  N 1 i=1 ei ... web database Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Change Metrics F(ei) F( S ) = lim F(S ; t ) dt  t 1 t F( ei ) = lim F(ei ; t ) dt Time averages: similar for age... 1 time A(ei) time update refresh Junghoo "John" Cho (UCLA Computer Science)

Discussions in the Paper Web evolution experiments Framework Web change model Freshness metrics Refresh policies Analysis of various policies Junghoo "John" Cho (UCLA Computer Science)

Discussions in the Paper Web evolution experiments Framework Web change model Freshness metrics Refresh policies Analysis of various policies Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Experimental Setup February 17 to June 24, 1999 270 sites visited (with permission) identified 400 sites with highest “page rank” contacted administrators 720,000 pages collected 3,000 pages from each site daily start at root, visit breadth first (get new & old pages) Junghoo "John" Cho (UCLA Computer Science)

Average Change Interval fraction of pages Junghoo "John" Cho (UCLA Computer Science)

Average Change Interval — By Domain fraction of pages Junghoo "John" Cho (UCLA Computer Science)

Discussions in the Paper Web evolution experiments Framework Web change model Freshness metrics Refresh policies Analysis of various policies Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) How It Started How to maintain pages up-to-date? Probability to change: Pi Only two refreshes Evaluation metric: Freshness Only uniform, semi-proportional policies No age metric, no Poisson model One suggestion on research Make the model as simple as possible At least in the beginning Junghoo "John" Cho (UCLA Computer Science)

Modeling Web Evolution Poisson process with rate . Memoryless, independent Split time into very small intervals Flip a coin at each interval with the same probability p:  ~ p T is time to the next event fT(t) = e-t (t > 0).      Junghoo "John" Cho (UCLA Computer Science)

Change Interval of Pages for pages that change every 10 days on average fraction of changes with given interval Poisson model interval in days Junghoo "John" Cho (UCLA Computer Science)

Questions on Experiments Is the Poisson model correct? Seems to be okay for the pages in the range of once every week -- once every month Other pages questionable Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Poisson Model Junghoo "John" Cho (UCLA Computer Science)

Discussions in the Paper Web evolution experiments Framework Web change model Freshness metrics Refresh policies Analysis of various policies Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Refresh Policies Synchronization frequency How often to refresh? Synchronization points When to refresh? Resource allocation How often each page? Refresh order In what order? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Refresh Order Fixed order Example: Explicit list of URLs to visit Random Order Example: Start from seed URLs & follow links Purely Random Example: Refresh pages on demand, as requested by user database web ei ei ... ... Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Freshness vs. Order r =  / f = average change frequency / average visit frequency Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Age vs. Order = Age / time to refresh all N elements r =  / f = average change frequency / average visit frequency Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Resource Allocation Two page database e1 changes daily e2 changes once a week Can visit pages once a week How should we visit pages? e1 e1 e1 e1 e1 e1 ... e2 e2 e2 e2 e2 e2 ... e1 e2 e1 e2 e1 e2 ... [uniform] e1 e1 e1 e1 e1 e1 e2 e1 e1 ... [proportional] ? e1 e1 e2 e2 web database Junghoo "John" Cho (UCLA Computer Science)

Proportional Often Not Good! Visit fast changing e1  get 1/2 day of freshness Visit slow changing e2  get 1/2 week of freshness Visiting e2 is a better deal! Junghoo "John" Cho (UCLA Computer Science)

Proportional vs Uniform Uniform is always better than proportional Assuming every page changes Proportional allocates resources to pages that change too often Proof is based on the concavity of freshness/age curve Junghoo "John" Cho (UCLA Computer Science)

Optimal Refresh Frequency Problem Given 1, 1, .., N and f , find f1, f2,.., fN that maximize Junghoo "John" Cho (UCLA Computer Science)

Selecting Optimal Refresh Frequency Shape of curve is the same in all cases Holds for any change frequency distribution Junghoo "John" Cho (UCLA Computer Science)

Optimal Refresh Frequency for Age Shape of curve is the same in all cases Holds for any distribution Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Comparing Policies Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Limitation How can we know the change frequencies of pages? Is the independent Poisson model valid? Does the paper have an algorithm? Can you implement the optimal policy? Is it easy to compute the optimal policy? Is every page equal? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Paper Writing Hint Lots of concrete examples Deliver the points much more concretely and intuitively Make the paper easy to read Hide technical details But emphasize the it is very difficult Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Topics to Cover Freshness paper Sampling paper Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Sampling Paper What is the main idea? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) How It Started First idea: Changes of pages may be correlated Initial approach: Find a set of correlated pages Sample a page and check for change Correlation model: P{C|S} – higher better Hidden Markov model Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) How It Evolved More Ideas: Why one page? Sample more pages! If 5 out of 10 pages changed, has sample “changed”? How to use sampling results? Proportional? Greedy? Is correlation necessary? Junghoo "John" Cho (UCLA Computer Science)

Is Correlation Necessary? Random sampling Correlation not necessary. Only random sampling 4/5 1/5 Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Research? Scrap previous ideas and models No correlation, partition, hidden markov model New Idea: Purely sampling based Sample a small number of pages from each site Download more pages from the sites with more changed samples Research is like finding an exit in a maze Explore all different paths without knowing exit In the hindsight, everything is obvious… Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Challenge How can we write a paper using this simple idea? Anyone can come up with such a simple idea What are the issues? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Issues How to measure effectiveness? How to use sampling results? Proportional vs Greedy How many samples? Dynamic sample size adjustment? What if we have very limited resources? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Evaluation Metric Fixed download resource in each cycle Maximize the number of detected changes ChangeRatio: No of changed & downloaded pages No of downloaded pages Why not Freshness or Age? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Evaluation Metric Why not Freshness or Age? Difficult to measure in practice Sampling may not work very well for Freshness or Age metric Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Issues How to measure effectiveness? How to use sampling results? Proportional vs Greedy How many samples? Dynamic sample size adjustment? What if we have very limited resources? Junghoo "John" Cho (UCLA Computer Science)

Greedy vs Proportional Download pages from the sites with most changes Proportional Download pages proportionally to the detected changes Theorem Greedy is optimal if we use a fixed number of samples Is it too obvious? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Sample Size How many samples from each site? No clue in the beginning First analysis Two Web sites, 100 pages each Find optimal sample size Varying download resources Varying number of changes in each site Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Sample Size ChangeRatio (20, 80) (40, 60) Optimal sample size Sample size Junghoo "John" Cho (UCLA Computer Science)

Change Fraction Distribution  : fraction of pages changed in a site f(): distribution of  values fraction of sites f(  )  t Junghoo "John" Cho (UCLA Computer Science)

Change Fraction Distribution Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Optimal Sample Size N: no of pages in a site r: total download resources/total no of pages Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Research Advice Start with simple examples Helps you understand the problem and solution better Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Dynamic Sample Size? Do we need the same sample size for every site?  = 0,  = 0.45,  = 0.55,  = 1 Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Adaptive Sampling If the estimated r is high/low enough, make an early decision What does “high enough” mean? Confidence interval above threshold ( ) i ( ) i ( ) i So that is the basic intuition of our adaptive policy. In the adaptive policy, if the estimated rho value of a site is high enough, we decide to download pages from the site even early on, and if the estimated rho value is low enough, then we just stop downloading pages. And we formalize this notion of “low enough” or “high enough”, by using the confidence interval of estimated rho value. So after downloading a small number of samples, say 5 pages, from each Web site, we compute, say 95% confidence interval of the estimated rho value of each site. Then if the confidence interval of a site is strictly above the threshold rho_t like this, then it means that with 95% confidence we can say that the site will eventually be downloaded, so we decide download pages from the site at the point. Also if the confidence interval is strictly lower than the threshold, it means that with 95% confidence we can say that the site will be eventually discarded, so we decide to stop downloading pages from the site. Only when the confidence interval spans over the threshold value, we decide to sample more pages from the site before we make any decision. So roughly this is how our adaptive policy works. For more detailed description of our adaptive policy, please look at our paper.  t Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Issues How to measure effectiveness? How to use sampling results? Proportional vs Greedy How many samples? Dynamic sample size adjustment? What if we have very limited resources? Junghoo "John" Cho (UCLA Computer Science)

Low Download Resources What if we have very limited resources? Cannot sample enough pages Divide and Conquer: Subset sampling Sample only a small subset in each download cycle Junghoo "John" Cho (UCLA Computer Science)

Comparison of Policies ChangeRatio And this is the result that we got. Here. In this graph, we show the ChangeRatio, the fraction of changed pages among the downloaded pages averaged over the 5 download cycles that each policy ran. Here RR represents Round-robin policy, FRQ is frequency-based policy, PRP is the proportional policy, GRD is the greedy policy and ADP is the adaptive policy. For greedy and proportional policies, we used sample size 10. The poor performance of RR policy is expected because it blindly download pages in a round-robin manner. We can see that our greedy policy and adaptive policy shows very significant improvement over the round-robin and the frequency-based policy. One thing that we note here is that this experiment was a little bit unfair for the frequency-based policy, because we had relatively short change histories of the pages. In order for the frequency-based policy to perform well, it requires relatively long change histories of the pages, but because the frequency-based policy downloaded pages in 5 download cycles and detected changes at most 5 times, it did not have enough time and data to optimize itself for the change frequencies of the pages. We will shortly study the long-term performance of the frequency-based policy and the sampling-based policy. Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Frequency vs. Sampling ChangeRatio Frequency Greedy And this graph shows the results for this experiment. In this graph, the horizontal axis shows the download cycle and the vertical axis shows the changeratio. The red line here is the result of the frequency-based policy and the blue line here is the result of the greedy policy. From this result, we can clearly see that the Greedy policy performs much better than the frequency policy in the beginning. At least until the 100th download cycle, the greedy policy shows much better performance than the frequency based policy. Only after 100th download cycle, the frequency-based policy shows better performance than the greedy policy on average. From this result we can learn that if the change frequencies of the pages do not change over time, then in a very long term, frequency-based policy can perform better than the greedy policy. However, note that for a long period of time, in this case for 100 download cycles, in other words, for 100 months in this experiment, greedy policy performed better. Another interesting thing that we can observe is that the frequency-based policy shows some fluctuation in its performance. Essentially the dips in the graph is when the frequency policy redownload the pages that rarely change. In case of frequency-based policy, it cannot be sure whether a particular change will ever change, even if the page has never changed, so it has to periodically revisit the page to make sure that the page did not change. Of course, as time goes on, the frequency policy gains more confidence that the page does not change, so the interval to check those rarely changing pages increases over time. That is why the interval between the dips in the graph increases over time in this graph. Download Cycle Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Other Questions Would it work for Freshness? Mostly not… Download frequently changing pages more often Should we sample based on sites? Not really… Other sampling unit might be better Junghoo "John" Cho (UCLA Computer Science)

Summary (Page Refresh) Frequency-based approach Intuitive policy may not work very well Sampling-based approach Sample size issue Dynamic sampling, low download resources Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Questions? Junghoo "John" Cho (UCLA Computer Science)