Monitoring the dynamic Web to respond to Continuous Queries Sandeep Pandey Krithi Ramamritham Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/laiir/

Slides:

Advertisements

Similar presentations

Feedback Control Real- time Scheduling James Yang, Hehe Li, Xinguang Sheng CIS 642, Spring 2001 Professor Insup Lee.

Advertisements

QoS-based Management of Multiple Shared Resources in Dynamic Real-Time Systems Klaus Ecker, Frank Drews School of EECS, Ohio University, Athens, OH {ecker,

Chapter 5: Introduction to Information Retrieval

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

Qinqing Gan Torsten Suel Improved Techniques for Result Caching in Web Search Engines Presenter: Arghyadip ● Konark.

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

Online Scheduling with Known Arrival Times Nicholas G Hall (Ohio State University) Marc E Posner (Ohio State University) Chris N Potts (University of Southampton)

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.

@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.

All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.

Freshness Policy Binoy Dharia, K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA.

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

Xyleme A Dynamic Warehouse for XML Data of the Web.

1 Searching the Web Junghoo Cho UCLA Computer Science.

1 Staleness vs.Waiting time in Universal Discrete Broadcast Michael Langberg California Institute of Technology Joint work with Jehoshua Bruck and Alex.

Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.

1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.

Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.

ICNP'061 Benefit-based Data Caching in Ad Hoc Networks Bin Tang, Himanshu Gupta and Samir Das Department of Computer Science Stony Brook University.

WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University.

Computer Organization and Architecture

An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

Jianliang XU, Dik L. Lee, and Bo Li Dept. of Computer Science Hong Kong Univ. of Science & Technology April 2002 On Bandwidth Allocation for Data Dissemination.

Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.

Chapter 5: Information Retrieval and Web Search

PROMISE: Peer-to-Peer Media Streaming Using CollectCast Presented by: Randeep Singh Gakhal CMPT 886, July 2004.

Optimal Crawling Strategies for Web Search Engines Wolf, Sethuraman, Ozsen Presented By Rajat Teotia.

Metaheuristics The idea: search the solution space directly. No math models, only a set of algorithmic steps, iterative method. Find a feasible solution.

Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.

MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.

Preference-Aware Query and Update Scheduling in Web-databases Huiming Qu Department of Computer Science University of Pittsburgh Joint work with Prof.

Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

Master Thesis Defense Jan Fiedler 04/17/98

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.

Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

20 October 2006Workflow Optimization in Distributed Environments Dynamic Workflow Management Using Performance Data David W. Walker, Yan Huang, Omer F.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

A Dynamic Data Grid Replication Strategy to Minimize the Data Missed Ming Lei, Susan Vrbsky, Xiaoyan Hong University of Alabama.

Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,

Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines.

Chapter 6: Information Retrieval and Web Search

استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Tracking Irregularly Moving Objects based on Alert-enabling Sensor Model in Sensor Networks 1 Chao-Chun Chen & 2 Yu-Chi Chung Dept. of Information Management.

Queueing and Active Queue Management Aditya Akella 02/26/2007.

Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.

Client Assignment in Content Dissemination Networks for Dynamic Data Shetal Shah Krithi Ramamritham Indian Institute of Technology Bombay Chinya Ravishankar.

Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.

September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.

Efficient Coflow Scheduling with Varys

18 May 2006CCGrid2006 Dynamic Workflow Management Using Performance Data Lican Huang, David W. Walker, Yan Huang, and Omer F. Rana Cardiff School of Computer.

Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:

Keep the Adversary Guessing: Agent Security by Policy Randomization

How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho

Data Mining Chapter 6 Search Engines

Planning and Scheduling in Manufacturing and Services

Dissemination of Dynamic Data on the Internet

Presented By: Darlene Banta

Presentation transcript:

Monitoring the dynamic Web to respond to Continuous Queries Sandeep Pandey Krithi Ramamritham Soumen Chakrabarti IIT Bombay

2 Motivation  Web pages change rapidly: 40% commercial pages 23% of all pages change per day (Sethuraman et al.)  Current search engine users Need to repeat queries (how often?) and Diff results with recent versions Or poll frequently updated collections (e.g., Google news)

3 Continuous Queries (CQ)  Users register long-lived queries of interest  Pages of interest may be added, modified, and deleted  System continually updates responses  Example applications Commuter updates: traffic and weather conditions Alerts on cricket scores, stock portfolios

4 Discrete vs. continuous queries  Query lives for an “instant”, one-shot anwer  Optimize corpus freshness at all times  Objective penalizes delay from update to refresh  Usually handled by bulk crawls with diverse periods  Queries have positive lifetime, many updates over time  Updates must track changes closely  Objective penalizes number or importance of missed updates  Dynamic monitoring with more restrictive network resources

5 Talk outline  Introduction and motivation  Previous approaches  Our contributions Continuous Adaptive Monitoring (CAM) How to allocate limited polling resources among pages How to schedule poll instants  Experiments  Conclusion

6 Related work  CONQUER and WebCQ (Liu, Pu and Tang) Query language and architecture for CQ Do not address monitoring for freshness optimization  NIAGARA (DeWitt and Naughton) Query evaluation and optimization techniques Database query optimization setting  ChangeDetector (Boyapati et al.) Fixed-priority polling for given set of pages  Freshness for discrete queries Poisson updates (Cho and Garcia-Molina) Quasi-deterministic and other distributions (Sethuraman, Wolf, Squillante, Yu)

7 Our contributions  New statistical recency objective for CQs  New monitoring framework to fit statistical models of page change behavior  Recency optimization problem constrained by network resources  Two-phase solution to optimization tailored to CQ search systems Resource allocation (knapsack) Poll scheduling (flow-shop)

8 Continuous Adaptive Monitoring  Planning horizon or “epoch”  Time proceeds in discrete steps {j } over epoch  Each time step j, each page i has probability ρ i,j of an update Can capture predictable bursts, periodicity  j ρ i,j = i, the expected #updates to page i ( “ change rate ” )  Decision variables y ij Is page i polled at time step j?

9 Profit, relevance and importance  Each registered query q has a profit  q  Relevance r iq of page i w.r.t. query q We use cosine in TFIDF space as in IR Other measures (e.g. PageRank) may be integrated  Page i has “importance” W i —function of Currently resident queries and their “profits” Relevance of page i to each resident query  Importance

10 Returned Information Ratio  Update information reported for page i is  Goal is to maximize importance-weighted updates reported,  i W i R i subject to polling resource constraint  Returned info ratio (RIR) is Importance-weighted updates captured by system Total importance-weighted expected updates

11 CAM system overview  Time proceeds in epochs  At the end of every epoch we re-evaluate Relevance Update probabilities  For the next epoch We select instants at which to poll each page (resource allocation) Schedule these instants subject to resource constraint Determining relevant pages Parameter tracking Resource allocation Scheduling Monitoring

12 CAM overview: Tracking phase  Relevance r iq changes with time, polled periodically Modeling relevance change nontrivial, e.g., snippet-level changes  Collect instants when page change was detected during current epoch  Revise estimates of ρ i,j for use in the next epoch’s poll optimization Determining relevant pages Parameter tracking Resource allocation Scheduling Monitoring

13 Resource allocation  Existing policies Uniform: Resources (#polls) distributed uniformly among all pages irrespective of their change frequency Proportional: #polls allocated to a page is proportional to the frequency with which it changes  For discrete queries, uniform better than proportional for any inter-update distribution  CAM: solve a knapsack problem Better than uniform and proportional Proportional better than uniform Evidence that CQ objective  discrete objective

14 Scheduling  Suppose our crawler can fetch M pages concurrently, and  An epoch is T time steps long  Then we can fetch a total of C=MT pages during an epoch Ensured by resource allocation phase  But at each instant we cannot schedule more than M fetches Want small planned-to-actual poll delays May fail to schedule all poll jobs in an epoch Determining relevant pages Parameter tracking Resource allocation Scheduling Monitoring Tentative y ij s

15 A flow-shop problem  M “machines” available at any time  Each y ij which is equal to 1 is a “job”  Job “k” is “released” at time step r k (= j )  “Processing time” = crawl time = t j  “Completion time” of job j is C j  Want to minimize “total flow”  NP-hard problem We use earliest deadline heuristic Time Job

16 Experiments  Synthetic data Change frequency distribution: a few pages change very often (Zipfian) Update probability distribution: a few ρ i,j ’s are large, most are small (Zipfian again) Page importance distribution: also Zipfian (Wolman, 1999)  Real data Eight cricket score sites High update rate FIXME

17 CAM > Proportional > Uniform  Uniform update and importance distrib.  Plot RIR against ratio of resources to expected changes  RIR for CAM is >3 times better  Proportional is better than uniform in the CQ setting Intuition from “minimum total stale duration” does not apply to CQ

18 Resource allocation  Sort pages by increasing change rate  Place in ten equally populated bins (10=fastest)  Uniform spends same resource for each bin  Proportional wastes fewer resources on slow- changing bins, but is not aggressive enough  CAM invests more aggressively in fast-changing bins, achieving the greatest RIR

19 Skew-handling and adaptation  Fixed monitoring/ change ratio  Vary skew in update probability distribution  CAM’s gains increase with skew  CAM improves over initial epochs  Change distribution estimates stabilize within a few epochs RIR

20 Experiments on real pages  Eight sites with dynamic cricket match information In fact, Zipfian updates  Adversarial setup: monitor/change < 1 CAM close to best possible  For M/C=2, CAM updates on 80% of the information changed

21 Conclusion  Continual queries are inherently different from discrete queries  Approach used in CAM Identify relevant pages Track the pages as they change Characterize page change behavior Decide when to monitor the pages in future  CAM approach performs better than other naïve approaches

22 References  J. Cho, H. Gracia-Molina. Synchronizing the database to improve freshness. ACM- SIGMOD,  J. Cho, H. Gracia-Molina. Estimating frequency of change. Technical Report,  J. Sethuram, J. L. Wolf, M. S. Squillante, P. S. Yu. Optimal Crawling strategies for Web search-engines. World Wide Web, 2002.

23 References  S. Pandey, K. Ramamritham, S. Chakrabarti. Monitoring the dynamic Web to respond to Continual Queries. World Wide Web,  S. Pandey, K. Ramamritham, S. Chakrabarti, S. Garg, A. Vyas. Web-CAM: Monitoring the dynamic Web to respond to Continual Queries. Submitted to 29th VLDB conference, 2003.

24 Future Research Possibilities  Maintaining Inverted Index current  Account for new entries to the Web  Identifying the changes relevant to the query  Measuring query-specific change behaviour of a page  Reusing the page change statistics for other related queries

25 Skewed update probability distribution  CAM still performs much better than others  In fact CAM exploits the skewed nature of distribution and performs even better than the uniform setting

26 Adaptive nature of CAM  No difference in allocation of resources in Uniform and Proportional strategy  CAM considers the probability distribution while allocating resources  Lesser frequency bins also get resources now due to some updating moments of high probability