Presentation is loading. Please wait.

Presentation is loading. Please wait.

WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University.

Similar presentations


Presentation on theme: "WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University."— Presentation transcript:

1 WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

2 Databases @Carnegie Mellon Dynamic Information on the Web Bulletin boards Bulletin boards Online auctions Online auctions News News Weather Weather Roadway conditions, Sports scores, etc… Roadway conditions, Sports scores, etc…

3 Databases @Carnegie Mellon Online Shopping, Auctions

4 Databases @Carnegie Mellon Stock Market

5 Databases @Carnegie Mellon Continuous Query Systems Process information from dynamic Web sources automatically Process information from dynamic Web sources automatically e.g., CONQUER [Liu et al. WWW 1999] e.g., CONQUER [Liu et al. WWW 1999] Niagara [Naughton et al. SIGMOD 2000] Niagara [Naughton et al. SIGMOD 2000] WebCQ [Liu et al. CIKM 2000] WebCQ [Liu et al. CIKM 2000]

6 Databases @Carnegie Mellon Past Research on CQ Systems Focus on language design, query processing Focus on language design, query processing Assume “push” model of information access Assume “push” model of information access Information shows up at doorstep Information shows up at doorstep Web sources are “pull” oriented Web sources are “pull” oriented Must explicitly download Web pages, check for changes, submit changes to CQ engine Must explicitly download Web pages, check for changes, submit changes to CQ engine

7 Databases @Carnegie Mellon Converting Pull  Push Auction sites Sports sites WIC CQ engine pull push pull ?

8 Databases @Carnegie Mellon Converting Pull  Push Topic has received little attention Topic has received little attention So far only heuristics with no formal guarantees So far only heuristics with no formal guarantees Periodical polling of sources Periodical polling of sources Not scalable Not scalable CAM [Pandey et al. WWW’03] CAM [Pandey et al. WWW’03] Gal et al. [JACM 2001]: Gal et al. [JACM 2001]: Take into account predicted change behavior Take into account predicted change behavior Create monitoring schedule in advance Create monitoring schedule in advance

9 Databases @Carnegie Mellon A good first step, but … No formal guarantees No formal guarantees Suits narrow range of applications Suits narrow range of applications

10 Databases @Carnegie Mellon Example Application Scenarios Timeliness not critical Timeliness is critical Append-only Complete overwrite maintaining a searchable resume database collecting “front-page” news stories for archival capturing new Internet security bulletins for automatic dissemination within an organization reacting in real-time to stock market fluctuations, online auction bids

11 Databases @Carnegie Mellon Outline Introduction Introduction Problem statement Problem statement WIC: Web Information Collector WIC: Web Information Collector Formal results: Formal results: WIC is a 2-approximation WIC is a 2-approximation Experimental results: Experimental results: Timeliness-completeness tradeoff Timeliness-completeness tradeoff

12 Model of Pull-Oriented Sources Proposed by Wolf et al. [WWW 2002] Proposed by Wolf et al. [WWW 2002] Set of Web pages of interest P 1 … P n Set of Web pages of interest P 1 … P n Importance weight associated with each page Importance weight associated with each page Time is divided into discrete time instants Time is divided into discrete time instants Change: An update posted on a Web page Change: An update posted on a Web page Known probability π ij that page P i will change at time T j Known probability π ij that page P i will change at time T j We do not address the problem of estimating change probabilities We do not address the problem of estimating change probabilities Databases @Carnegie Mellon

13 Our Model Time 0.4 1.0 0.3 0.4 0.6 0.1 0.3 0.9 0.4 0.6 0.2 1.0 0.2 0.6 0.1 0.3 1.0 0.1 0.3 0.6 0.2 0.4 0.2 0.8 1.0 0.6 0.9 0.4 0.7 0.1 0.9 0.7 0.8 0.6 1.0 P1 P2 P3 Databases @Carnegie Mellon

14 Modeling the Change Characteristics Timeliness not critical Timeliness is critical Append-only Complete overwrite resume database news stories archival security bulletins online auction bids

15 Modeling the Change Characteristics the probability of a change to page P i at time T j to remain available at time T k  T j Case 1: changes overwrite old info. Case 2: append-only Also: sliding window, others … Databases @Carnegie Mellon

16 Web Monitoring Requirements Timeliness not critical Timeliness is critical Append-only Complete overwrite resume database news stories archival security bulletins online auction bids

17 Conflicting Requirements Completeness: maximize number of Completeness: maximize number of changes captured changes captured Timeliness: minimize delay in Timeliness: minimize delay in capturing changes capturing changes Limited resources Limited resources Up to C pages can be monitored per time instant Up to C pages can be monitored per time instant When resources are not plentiful, the two objectives can be at odds with each other When resources are not plentiful, the two objectives can be at odds with each other Databases @Carnegie Mellon

18 Timeliness-Completeness tradeoff 0.4 1.0 0.3 0.4 0.6 0.1 0.3 0.9 0.4 0.6 0.2 1.0 Resource constraint: C=1 P1 (append- only) P2(overwrite) 0.3 0.9 0.2 0.3 0.5 0.0 0.2 0.8 0.3 0.5 0.1 0.9 Databases @Carnegie Mellon

19 Only Timeliness 0.4 1.0 0.3 0.4 0.6 0.1 0.3 0.9 0.4 0.6 0.2 1.0 0.3 0.9 0.2 0.3 0.5 0.0 0.2 0.8 0.3 0.5 0.1 0.9 Objective: Changes must be captured with zero delay with zero delay P1 (append- only) P2(overwrite) Databases @Carnegie Mellon

20 Only Completeness 0.4 1.0 0.3 0.4 0.6 0.1 0.3 0.9 0.4 0.6 0.2 1.0 0.3 0.9 0.2 0.3 0.5 0.0 0.2 0.8 0.3 0.5 0.1 0.9 Objective: Maximize the number of changes captured changes captured P1 (append- only) P2(overwrite) Databases @Carnegie Mellon

21 Controlling the Tradeoff Urgency : Importance of information captured as a function of delay in capturing as a function of delay in capturing Example urgency functions Databases @Carnegie Mellon

22 steep urgency curve gradual urgency curve Web Monitoring Requirements Timeliness not critical Timeliness is critical Append-only Complete overwrite resume database news stories archival security bulletins online auction bids

23 Databases @Carnegie Mellon Web Monitoring Objective Maximize Utility Maximize Utility Utility = Expected number of changes captured, weighted by delay according to urgency function Utility = Expected number of changes captured, weighted by delay according to urgency function Each monitoring action takes unit amount of resource Each monitoring action takes unit amount of resource Resource constraint: amount of resource Resource constraint: amount of resource per time unit constrained per time unit constrained

24 Databases @Carnegie Mellon Our Solution Web Information Collector (WIC) Web Information Collector (WIC) 2-approximation for all scenarios 2-approximation for all scenarios Total utility accrued at least half that accrued by optimal monitoring schedule Total utility accrued at least half that accrued by optimal monitoring schedule Finds optimal solution in the following special case: Finds optimal solution in the following special case: Timeliness is critical, changes overwrite Timeliness is critical, changes overwrite

25 Databases @Carnegie Mellon Web Information Collector (WIC) Online, greedy strategy Online, greedy strategy At each time instant, download page(s) with highest utility At each time instant, download page(s) with highest utility Utility combines: Utility combines: Probability that a change has occurred Probability that a change has occurred Probability that change has not been erased Probability that change has not been erased Delay in capturing change (weighted according to urgency function) Delay in capturing change (weighted according to urgency function)

26 WIC continued Running time: Running time: O(# pages) per time instant O(# pages) per time instant under most settings of life and urgency WIC is an online algorithm WIC is an online algorithm Forecasting can be done at last minute Forecasting can be done at last minute Databases @Carnegie Mellon

27 Proof of 2-Approximation See our paper See our paper

28 Databases @Carnegie Mellon Experiments Timeliness not critical Timeliness is critical Append-only Complete overwrite Data: 7550 auction pages Data: 7550 auction pages Exponential decaying urgency function parameterized by r Exponential decaying urgency function parameterized by r

29 Databases @Carnegie Mellon Experimental Results in Paper Sensitivity to error in prediction Sensitivity to error in prediction Not unduly sensitive Not unduly sensitive Comparison against prior approach (CAM) Comparison against prior approach (CAM) Up to 80% improvement Up to 80% improvement Handles more applications Handles more applications Timeliness-Completeness tradeoff Timeliness-Completeness tradeoff

30 Timeliness-Completeness tradeoff favor completenessfavor timeliness

31 Summary Pull->push Pull->push Can’t have it all Can’t have it all - Choose a combination of timeliness - Choose a combination of timeliness and completeness and completeness Our solution: WIC Our solution: WIC - Handles many applications - Handles many applications - Formal guarantee: 2-approximation - Formal guarantee: 2-approximation - Online algorithm - Online algorithm Databases @Carnegie Mellon

32 Urgency Parameter Controls Timeliness-Completeness Tradeoff Best curve to use depends on application Best curve to use depends on application Ap 1: Agent to monitor and bid in online auctions on behalf of many customers Ap 1: Agent to monitor and bid in online auctions on behalf of many customers Use steep curve (timeliness is critical) Use steep curve (timeliness is critical) Ap 2: Program to maintain database of large number of online resumes Ap 2: Program to maintain database of large number of online resumes Use gradual curve (timeliness less critical) Use gradual curve (timeliness less critical)

33 Databases @Carnegie Mellon Experiments Determine exact change occurrence times Determine exact change occurrence times Add noise to simulate prediction inaccuracy: Add noise to simulate prediction inaccuracy: - False positives - False positives - False negatives - False negatives - Gaussian spreading - Gaussian spreading


Download ppt "WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University."

Similar presentations


Ads by Google