Download presentation
Presentation is loading. Please wait.
1
Monitoring the dynamic Web to respond to Continuous Queries Sandeep Pandey Krithi Ramamritham Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/laiir/
2
2 Motivation Web pages change rapidly: 40% commercial pages 23% of all pages change per day (Sethuraman et al.) Current search engine users Need to repeat queries (how often?) and Diff results with recent versions Or poll frequently updated collections (e.g., Google news)
3
3 Continuous Queries (CQ) Users register long-lived queries of interest Pages of interest may be added, modified, and deleted System continually updates responses Example applications Commuter updates: traffic and weather conditions Alerts on cricket scores, stock portfolios
4
4 Discrete vs. continuous queries Query lives for an “instant”, one-shot anwer Optimize corpus freshness at all times Objective penalizes delay from update to refresh Usually handled by bulk crawls with diverse periods Queries have positive lifetime, many updates over time Updates must track changes closely Objective penalizes number or importance of missed updates Dynamic monitoring with more restrictive network resources
5
5 Talk outline Introduction and motivation Previous approaches Our contributions Continuous Adaptive Monitoring (CAM) How to allocate limited polling resources among pages How to schedule poll instants Experiments Conclusion
6
6 Related work CONQUER and WebCQ (Liu, Pu and Tang) Query language and architecture for CQ Do not address monitoring for freshness optimization NIAGARA (DeWitt and Naughton) Query evaluation and optimization techniques Database query optimization setting ChangeDetector (Boyapati et al.) Fixed-priority polling for given set of pages Freshness for discrete queries Poisson updates (Cho and Garcia-Molina) Quasi-deterministic and other distributions (Sethuraman, Wolf, Squillante, Yu)
7
7 Our contributions New statistical recency objective for CQs New monitoring framework to fit statistical models of page change behavior Recency optimization problem constrained by network resources Two-phase solution to optimization tailored to CQ search systems Resource allocation (knapsack) Poll scheduling (flow-shop)
8
8 Continuous Adaptive Monitoring Planning horizon or “epoch” Time proceeds in discrete steps {j } over epoch Each time step j, each page i has probability ρ i,j of an update Can capture predictable bursts, periodicity j ρ i,j = i, the expected #updates to page i ( “ change rate ” ) Decision variables y ij Is page i polled at time step j?
9
9 Profit, relevance and importance Each registered query q has a profit q Relevance r iq of page i w.r.t. query q We use cosine in TFIDF space as in IR Other measures (e.g. PageRank) may be integrated Page i has “importance” W i —function of Currently resident queries and their “profits” Relevance of page i to each resident query Importance
10
10 Returned Information Ratio Update information reported for page i is Goal is to maximize importance-weighted updates reported, i W i R i subject to polling resource constraint Returned info ratio (RIR) is Importance-weighted updates captured by system Total importance-weighted expected updates
11
11 CAM system overview Time proceeds in epochs At the end of every epoch we re-evaluate Relevance Update probabilities For the next epoch We select instants at which to poll each page (resource allocation) Schedule these instants subject to resource constraint Determining relevant pages Parameter tracking Resource allocation Scheduling Monitoring
12
12 CAM overview: Tracking phase Relevance r iq changes with time, polled periodically Modeling relevance change nontrivial, e.g., snippet-level changes Collect instants when page change was detected during current epoch Revise estimates of ρ i,j for use in the next epoch’s poll optimization Determining relevant pages Parameter tracking Resource allocation Scheduling Monitoring
13
13 Resource allocation Existing policies Uniform: Resources (#polls) distributed uniformly among all pages irrespective of their change frequency Proportional: #polls allocated to a page is proportional to the frequency with which it changes For discrete queries, uniform better than proportional for any inter-update distribution CAM: solve a knapsack problem Better than uniform and proportional Proportional better than uniform Evidence that CQ objective discrete objective
14
14 Scheduling Suppose our crawler can fetch M pages concurrently, and An epoch is T time steps long Then we can fetch a total of C=MT pages during an epoch Ensured by resource allocation phase But at each instant we cannot schedule more than M fetches Want small planned-to-actual poll delays May fail to schedule all poll jobs in an epoch Determining relevant pages Parameter tracking Resource allocation Scheduling Monitoring Tentative y ij s
15
15 A flow-shop problem M “machines” available at any time Each y ij which is equal to 1 is a “job” Job “k” is “released” at time step r k (= j ) “Processing time” = crawl time = t j “Completion time” of job j is C j Want to minimize “total flow” NP-hard problem We use earliest deadline heuristic Time Job
16
16 Experiments Synthetic data Change frequency distribution: a few pages change very often (Zipfian) Update probability distribution: a few ρ i,j ’s are large, most are small (Zipfian again) Page importance distribution: also Zipfian (Wolman, 1999) Real data Eight cricket score sites High update rate FIXME
17
17 CAM > Proportional > Uniform Uniform update and importance distrib. Plot RIR against ratio of resources to expected changes RIR for CAM is >3 times better Proportional is better than uniform in the CQ setting Intuition from “minimum total stale duration” does not apply to CQ
18
18 Resource allocation Sort pages by increasing change rate Place in ten equally populated bins (10=fastest) Uniform spends same resource for each bin Proportional wastes fewer resources on slow- changing bins, but is not aggressive enough CAM invests more aggressively in fast-changing bins, achieving the greatest RIR
19
19 Skew-handling and adaptation Fixed monitoring/ change ratio Vary skew in update probability distribution CAM’s gains increase with skew CAM improves over initial epochs Change distribution estimates stabilize within a few epochs RIR
20
20 Experiments on real pages Eight sites with dynamic cricket match information In fact, Zipfian updates Adversarial setup: monitor/change < 1 CAM close to best possible For M/C=2, CAM updates on 80% of the information changed
21
21 Conclusion Continual queries are inherently different from discrete queries Approach used in CAM Identify relevant pages Track the pages as they change Characterize page change behavior Decide when to monitor the pages in future CAM approach performs better than other naïve approaches
22
22 References J. Cho, H. Gracia-Molina. Synchronizing the database to improve freshness. ACM- SIGMOD, 2000. J. Cho, H. Gracia-Molina. Estimating frequency of change. Technical Report, 2000. J. Sethuram, J. L. Wolf, M. S. Squillante, P. S. Yu. Optimal Crawling strategies for Web search-engines. World Wide Web, 2002.
23
23 References S. Pandey, K. Ramamritham, S. Chakrabarti. Monitoring the dynamic Web to respond to Continual Queries. World Wide Web, 2003. S. Pandey, K. Ramamritham, S. Chakrabarti, S. Garg, A. Vyas. Web-CAM: Monitoring the dynamic Web to respond to Continual Queries. Submitted to 29th VLDB conference, 2003.
24
24 Future Research Possibilities Maintaining Inverted Index current Account for new entries to the Web Identifying the changes relevant to the query Measuring query-specific change behaviour of a page Reusing the page change statistics for other related queries
25
25 Skewed update probability distribution CAM still performs much better than others In fact CAM exploits the skewed nature of distribution and performs even better than the uniform setting
26
26 Adaptive nature of CAM No difference in allocation of resources in Uniform and Proportional strategy CAM considers the probability distribution while allocating resources Lesser frequency bins also get resources now due to some updating moments of high probability
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.