Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS246 Page Refresh.

Similar presentations


Presentation on theme: "CS246 Page Refresh."— Presentation transcript:

1 CS246 Page Refresh

2 What is a Crawler? initial urls init to visit urls get next url web
get page visited urls extract urls web pages

3 Crawling Issues Load at the site Load at the crawler Page selection
Crawler should be unobtrusive to visited sites Load at the crawler Download billions of Web pages in short time Page selection Many pages, limited resources Page refresh Refresh pages incrementally not in batch

4 Junghoo "John" Cho (UCLA Computer Science)
Today’s Topic Page refresh How can we maintain “cached” pages “fresh”? Web search engines, data warehouse, etc. Refresh Source Copy Junghoo "John" Cho (UCLA Computer Science)

5 Other Caching Problems
Disk buffers Disk page, memory page buffer Memory hierarchy 1st level cache, 2nd level cache, … Is Web caching any different? Junghoo "John" Cho (UCLA Computer Science)

6 Junghoo "John" Cho (UCLA Computer Science)
Main Difference Origination of changes Cache to source Source to cache Freshness requirement Perfect caching Stale caching Role of a cache Transient space: cache replacement policy Main data source for application Refresh delay Junghoo "John" Cho (UCLA Computer Science)

7 Junghoo "John" Cho (UCLA Computer Science)
Main Difference Limited refresh resources Many independent sources Network bandwidth Computational resources Mainly pull model Junghoo "John" Cho (UCLA Computer Science)

8 Junghoo "John" Cho (UCLA Computer Science)
Ideas? How can we maintain pages “fresh”? Some pages change often, some pages do not. News archive Daily news article A set of pages change together Java manual pages Junghoo "John" Cho (UCLA Computer Science)

9 Junghoo "John" Cho (UCLA Computer Science)
Topics to Cover Freshness paper Sampling paper Junghoo "John" Cho (UCLA Computer Science)

10 Junghoo "John" Cho (UCLA Computer Science)
Freshness Paper What is the problem? What are the main ideas? What does it try to “optimize”? What is its goal function? Junghoo "John" Cho (UCLA Computer Science)

11 Junghoo "John" Cho (UCLA Computer Science)
Change Metrics Freshness Freshness of element ei at time t is F( ei ; t ) = if ei is up-to-date at time t otherwise Freshness of the database S at time t is F( S ; t ) = F( ei ; t ) N 1 i=1 ei ... web database Junghoo "John" Cho (UCLA Computer Science)

12 Junghoo "John" Cho (UCLA Computer Science)
Change Metrics Age Age of element ei at time t is A( ei ; t ) = if ei is up-to-date at time t t - (modification ei time) otherwise Age of the database S at time t is A( S ; t ) = A( ei ; t ) N 1 i=1 ei ... web database Junghoo "John" Cho (UCLA Computer Science)

13 Junghoo "John" Cho (UCLA Computer Science)
Change Metrics F(ei) F( S ) = lim F(S ; t ) dt t 1 t F( ei ) = lim F(ei ; t ) dt Time averages: similar for age... 1 time A(ei) time update refresh Junghoo "John" Cho (UCLA Computer Science)

14 Discussions in the Paper
Web evolution experiments Framework Web change model Freshness metrics Refresh policies Analysis of various policies Junghoo "John" Cho (UCLA Computer Science)

15 Discussions in the Paper
Web evolution experiments Framework Web change model Freshness metrics Refresh policies Analysis of various policies Junghoo "John" Cho (UCLA Computer Science)

16 Junghoo "John" Cho (UCLA Computer Science)
Experimental Setup February 17 to June 24, 1999 270 sites visited (with permission) identified 400 sites with highest “page rank” contacted administrators 720,000 pages collected 3,000 pages from each site daily start at root, visit breadth first (get new & old pages) Junghoo "John" Cho (UCLA Computer Science)

17 Average Change Interval
fraction of pages Junghoo "John" Cho (UCLA Computer Science)

18 Average Change Interval — By Domain
fraction of pages Junghoo "John" Cho (UCLA Computer Science)

19 Discussions in the Paper
Web evolution experiments Framework Web change model Freshness metrics Refresh policies Analysis of various policies Junghoo "John" Cho (UCLA Computer Science)

20 Junghoo "John" Cho (UCLA Computer Science)
How It Started How to maintain pages up-to-date? Probability to change: Pi Only two refreshes Evaluation metric: Freshness Only uniform, semi-proportional policies No age metric, no Poisson model One suggestion on research Make the model as simple as possible At least in the beginning Junghoo "John" Cho (UCLA Computer Science)

21 Modeling Web Evolution
Poisson process with rate . Memoryless, independent Split time into very small intervals Flip a coin at each interval with the same probability p:  ~ p T is time to the next event fT(t) = e-t (t > 0). Junghoo "John" Cho (UCLA Computer Science)

22 Change Interval of Pages
for pages that change every 10 days on average fraction of changes with given interval Poisson model interval in days Junghoo "John" Cho (UCLA Computer Science)

23 Questions on Experiments
Is the Poisson model correct? Seems to be okay for the pages in the range of once every week -- once every month Other pages questionable Junghoo "John" Cho (UCLA Computer Science)

24 Junghoo "John" Cho (UCLA Computer Science)
Poisson Model Junghoo "John" Cho (UCLA Computer Science)

25 Discussions in the Paper
Web evolution experiments Framework Web change model Freshness metrics Refresh policies Analysis of various policies Junghoo "John" Cho (UCLA Computer Science)

26 Junghoo "John" Cho (UCLA Computer Science)
Refresh Policies Synchronization frequency How often to refresh? Synchronization points When to refresh? Resource allocation How often each page? Refresh order In what order? Junghoo "John" Cho (UCLA Computer Science)

27 Junghoo "John" Cho (UCLA Computer Science)
Refresh Order Fixed order Example: Explicit list of URLs to visit Random Order Example: Start from seed URLs & follow links Purely Random Example: Refresh pages on demand, as requested by user database web ei ei ... ... Junghoo "John" Cho (UCLA Computer Science)

28 Junghoo "John" Cho (UCLA Computer Science)
Freshness vs. Order r =  / f = average change frequency / average visit frequency Junghoo "John" Cho (UCLA Computer Science)

29 Junghoo "John" Cho (UCLA Computer Science)
Age vs. Order = Age / time to refresh all N elements r =  / f = average change frequency / average visit frequency Junghoo "John" Cho (UCLA Computer Science)

30 Junghoo "John" Cho (UCLA Computer Science)
Resource Allocation Two page database e1 changes daily e2 changes once a week Can visit pages once a week How should we visit pages? e1 e1 e1 e1 e1 e1 ... e2 e2 e2 e2 e2 e2 ... e1 e2 e1 e2 e1 e [uniform] e1 e1 e1 e1 e1 e1 e2 e1 e [proportional] ? e1 e1 e2 e2 web database Junghoo "John" Cho (UCLA Computer Science)

31 Proportional Often Not Good!
Visit fast changing e1  get 1/2 day of freshness Visit slow changing e2  get 1/2 week of freshness Visiting e2 is a better deal! Junghoo "John" Cho (UCLA Computer Science)

32 Proportional vs Uniform
Uniform is always better than proportional Assuming every page changes Proportional allocates resources to pages that change too often Proof is based on the concavity of freshness/age curve Junghoo "John" Cho (UCLA Computer Science)

33 Optimal Refresh Frequency
Problem Given 1, 1, .., N and f , find f1, f2,.., fN that maximize Junghoo "John" Cho (UCLA Computer Science)

34 Selecting Optimal Refresh Frequency
Shape of curve is the same in all cases Holds for any change frequency distribution Junghoo "John" Cho (UCLA Computer Science)

35 Optimal Refresh Frequency for Age
Shape of curve is the same in all cases Holds for any distribution Junghoo "John" Cho (UCLA Computer Science)

36 Junghoo "John" Cho (UCLA Computer Science)
Comparing Policies Junghoo "John" Cho (UCLA Computer Science)

37 Junghoo "John" Cho (UCLA Computer Science)
Limitation How can we know the change frequencies of pages? Is the independent Poisson model valid? Does the paper have an algorithm? Can you implement the optimal policy? Is it easy to compute the optimal policy? Is every page equal? Junghoo "John" Cho (UCLA Computer Science)

38 Junghoo "John" Cho (UCLA Computer Science)
Paper Writing Hint Lots of concrete examples Deliver the points much more concretely and intuitively Make the paper easy to read Hide technical details But emphasize the it is very difficult Junghoo "John" Cho (UCLA Computer Science)

39 Junghoo "John" Cho (UCLA Computer Science)
Topics to Cover Freshness paper Sampling paper Junghoo "John" Cho (UCLA Computer Science)

40 Junghoo "John" Cho (UCLA Computer Science)
Sampling Paper What is the main idea? Junghoo "John" Cho (UCLA Computer Science)

41 Junghoo "John" Cho (UCLA Computer Science)
How It Started First idea: Changes of pages may be correlated Initial approach: Find a set of correlated pages Sample a page and check for change Correlation model: P{C|S} – higher better Hidden Markov model Junghoo "John" Cho (UCLA Computer Science)

42 Junghoo "John" Cho (UCLA Computer Science)
How It Evolved More Ideas: Why one page? Sample more pages! If 5 out of 10 pages changed, has sample “changed”? How to use sampling results? Proportional? Greedy? Is correlation necessary? Junghoo "John" Cho (UCLA Computer Science)

43 Is Correlation Necessary?
Random sampling Correlation not necessary. Only random sampling 4/5 1/5 Junghoo "John" Cho (UCLA Computer Science)

44 Junghoo "John" Cho (UCLA Computer Science)
Research? Scrap previous ideas and models No correlation, partition, hidden markov model New Idea: Purely sampling based Sample a small number of pages from each site Download more pages from the sites with more changed samples Research is like finding an exit in a maze Explore all different paths without knowing exit In the hindsight, everything is obvious… Junghoo "John" Cho (UCLA Computer Science)

45 Junghoo "John" Cho (UCLA Computer Science)
Challenge How can we write a paper using this simple idea? Anyone can come up with such a simple idea What are the issues? Junghoo "John" Cho (UCLA Computer Science)

46 Junghoo "John" Cho (UCLA Computer Science)
Issues How to measure effectiveness? How to use sampling results? Proportional vs Greedy How many samples? Dynamic sample size adjustment? What if we have very limited resources? Junghoo "John" Cho (UCLA Computer Science)

47 Junghoo "John" Cho (UCLA Computer Science)
Evaluation Metric Fixed download resource in each cycle Maximize the number of detected changes ChangeRatio: No of changed & downloaded pages No of downloaded pages Why not Freshness or Age? Junghoo "John" Cho (UCLA Computer Science)

48 Junghoo "John" Cho (UCLA Computer Science)
Evaluation Metric Why not Freshness or Age? Difficult to measure in practice Sampling may not work very well for Freshness or Age metric Junghoo "John" Cho (UCLA Computer Science)

49 Junghoo "John" Cho (UCLA Computer Science)
Issues How to measure effectiveness? How to use sampling results? Proportional vs Greedy How many samples? Dynamic sample size adjustment? What if we have very limited resources? Junghoo "John" Cho (UCLA Computer Science)

50 Greedy vs Proportional
Download pages from the sites with most changes Proportional Download pages proportionally to the detected changes Theorem Greedy is optimal if we use a fixed number of samples Is it too obvious? Junghoo "John" Cho (UCLA Computer Science)

51 Junghoo "John" Cho (UCLA Computer Science)
Sample Size How many samples from each site? No clue in the beginning First analysis Two Web sites, 100 pages each Find optimal sample size Varying download resources Varying number of changes in each site Junghoo "John" Cho (UCLA Computer Science)

52 Junghoo "John" Cho (UCLA Computer Science)
Sample Size ChangeRatio (20, 80) (40, 60) Optimal sample size Sample size Junghoo "John" Cho (UCLA Computer Science)

53 Change Fraction Distribution
 : fraction of pages changed in a site f(): distribution of  values fraction of sites f(  ) t Junghoo "John" Cho (UCLA Computer Science)

54 Change Fraction Distribution
Junghoo "John" Cho (UCLA Computer Science)

55 Junghoo "John" Cho (UCLA Computer Science)
Optimal Sample Size N: no of pages in a site r: total download resources/total no of pages Junghoo "John" Cho (UCLA Computer Science)

56 Junghoo "John" Cho (UCLA Computer Science)
Research Advice Start with simple examples Helps you understand the problem and solution better Junghoo "John" Cho (UCLA Computer Science)

57 Junghoo "John" Cho (UCLA Computer Science)
Dynamic Sample Size? Do we need the same sample size for every site?  = 0,  = 0.45,  = 0.55,  = 1 Junghoo "John" Cho (UCLA Computer Science)

58 Junghoo "John" Cho (UCLA Computer Science)
Adaptive Sampling If the estimated r is high/low enough, make an early decision What does “high enough” mean? Confidence interval above threshold ( ) i ( ) i ( ) i So that is the basic intuition of our adaptive policy. In the adaptive policy, if the estimated rho value of a site is high enough, we decide to download pages from the site even early on, and if the estimated rho value is low enough, then we just stop downloading pages. And we formalize this notion of “low enough” or “high enough”, by using the confidence interval of estimated rho value. So after downloading a small number of samples, say 5 pages, from each Web site, we compute, say 95% confidence interval of the estimated rho value of each site. Then if the confidence interval of a site is strictly above the threshold rho_t like this, then it means that with 95% confidence we can say that the site will eventually be downloaded, so we decide download pages from the site at the point. Also if the confidence interval is strictly lower than the threshold, it means that with 95% confidence we can say that the site will be eventually discarded, so we decide to stop downloading pages from the site. Only when the confidence interval spans over the threshold value, we decide to sample more pages from the site before we make any decision. So roughly this is how our adaptive policy works. For more detailed description of our adaptive policy, please look at our paper. t Junghoo "John" Cho (UCLA Computer Science)

59 Junghoo "John" Cho (UCLA Computer Science)
Issues How to measure effectiveness? How to use sampling results? Proportional vs Greedy How many samples? Dynamic sample size adjustment? What if we have very limited resources? Junghoo "John" Cho (UCLA Computer Science)

60 Low Download Resources
What if we have very limited resources? Cannot sample enough pages Divide and Conquer: Subset sampling Sample only a small subset in each download cycle Junghoo "John" Cho (UCLA Computer Science)

61 Comparison of Policies
ChangeRatio And this is the result that we got. Here. In this graph, we show the ChangeRatio, the fraction of changed pages among the downloaded pages averaged over the 5 download cycles that each policy ran. Here RR represents Round-robin policy, FRQ is frequency-based policy, PRP is the proportional policy, GRD is the greedy policy and ADP is the adaptive policy. For greedy and proportional policies, we used sample size 10. The poor performance of RR policy is expected because it blindly download pages in a round-robin manner. We can see that our greedy policy and adaptive policy shows very significant improvement over the round-robin and the frequency-based policy. One thing that we note here is that this experiment was a little bit unfair for the frequency-based policy, because we had relatively short change histories of the pages. In order for the frequency-based policy to perform well, it requires relatively long change histories of the pages, but because the frequency-based policy downloaded pages in 5 download cycles and detected changes at most 5 times, it did not have enough time and data to optimize itself for the change frequencies of the pages. We will shortly study the long-term performance of the frequency-based policy and the sampling-based policy. Junghoo "John" Cho (UCLA Computer Science)

62 Junghoo "John" Cho (UCLA Computer Science)
Frequency vs. Sampling ChangeRatio Frequency Greedy And this graph shows the results for this experiment. In this graph, the horizontal axis shows the download cycle and the vertical axis shows the changeratio. The red line here is the result of the frequency-based policy and the blue line here is the result of the greedy policy. From this result, we can clearly see that the Greedy policy performs much better than the frequency policy in the beginning. At least until the 100th download cycle, the greedy policy shows much better performance than the frequency based policy. Only after 100th download cycle, the frequency-based policy shows better performance than the greedy policy on average. From this result we can learn that if the change frequencies of the pages do not change over time, then in a very long term, frequency-based policy can perform better than the greedy policy. However, note that for a long period of time, in this case for 100 download cycles, in other words, for 100 months in this experiment, greedy policy performed better. Another interesting thing that we can observe is that the frequency-based policy shows some fluctuation in its performance. Essentially the dips in the graph is when the frequency policy redownload the pages that rarely change. In case of frequency-based policy, it cannot be sure whether a particular change will ever change, even if the page has never changed, so it has to periodically revisit the page to make sure that the page did not change. Of course, as time goes on, the frequency policy gains more confidence that the page does not change, so the interval to check those rarely changing pages increases over time. That is why the interval between the dips in the graph increases over time in this graph. Download Cycle Junghoo "John" Cho (UCLA Computer Science)

63 Junghoo "John" Cho (UCLA Computer Science)
Other Questions Would it work for Freshness? Mostly not… Download frequently changing pages more often Should we sample based on sites? Not really… Other sampling unit might be better Junghoo "John" Cho (UCLA Computer Science)

64 Summary (Page Refresh)
Frequency-based approach Intuitive policy may not work very well Sampling-based approach Sample size issue Dynamic sampling, low download resources Junghoo "John" Cho (UCLA Computer Science)

65 Junghoo "John" Cho (UCLA Computer Science)
Questions? Junghoo "John" Cho (UCLA Computer Science)


Download ppt "CS246 Page Refresh."

Similar presentations


Ads by Google