Download presentation
Presentation is loading. Please wait.
Published byMonica McKenzie Modified over 9 years ago
1
Online Detection of Change in Data Streams Shai Ben-David School of Computer Science U. Waterloo
2
Some Change Detection Tasks Quality control – Factory products are being regularly tested and scored. Can we detect when the distribution of scores changes? Real estate prices – Following selling prices of houses in K-W Can we tell when market trends change?
3
Problem Formalization Data points are generated sequentially and independently by some underlying probability distribution. Viewing the generated stream of data points, we wish to detect when the underlying data generating distribution changes (and how it changes).
4
Detection in Sensor Networks We consider large scale networks of sensors Each sensor makes local binary decisions about the monitored physical phenomena: RED/GREEN An observer collects a random sample of sensors’ readings.
5
First data collectionSecond data collection Is there a change in the underlying data-generating distribution? If a change has been detected, What has exactly changed ? Change Detection in Sensor Networks
6
Similar Issues in Other Disciplines Ecology – Tracing the distribution of species over geographical locations. Public Health – Tracing spread of various diseases. Census data analysis.
7
Our basic paradigm Compare two sliding windows over the data stream: S1S1 S2S2 time Reducing change detection problem to the “two samples” problem; Given two samples S 1, S 2, generated by distributions P 1,P 2, Infer from S 1, S 2, whether P 1 =P 2.
8
Meta-Algorithm for Online Change Detection
9
Explanation k (m 1,i,m 2,i,α i ). Note that the meta-algorithm is actually running k independent algorithms in parallel – one for each triplet (m 1,i,m 2,i,α i ). X i m 1,i c 0 Y i m 2,i Each keeps a baseline window X i, containing the m 1,i points following last-detected change, c 0, and a second window, Y i, containing the most recent m 2,i points in the stream. d(X i Y i )> α i We declare CHANGE whenever d(X i, Y i )> α i c 0 X i At such a point we reset c 0 and X i α i m i α i The different α i ‘s reflect different levels of ‘change sensitivity. The m i ‘s are computed from the α i ‘s using the theory outlined below
10
Statistical requirements We wish to support our statistical tests with formal, finite sample size guarantees for: Control the rate of False Positives ( `false alarms’). Control the rate of False Negatives (`Missed- detections’). Reliability of the change description.
11
Previous Work on the Two-Sample Problem Mostly within the context of parametric statistics. (Assuming the underlying distributions come from a known family of ‘nice’ distributions) Previous applications not concerned with memory and computation time limitations. Performance guarantees are asymptotic – apply only in the limit when sample sizes go to infinity. Previous focus on detection only – we wish to also describe the change.
12
The Need for Probability-Distance Measure False Positives guarantees are straightforward: “If S 1, S 2 are samples of the same distribution, then the probability that the test will declare `CHANGE’ is small” False Negatives guarantees are more delicate: “If S 1, S 2, come from different distributions then, w.h.p. declare `CHANGE’” This is infeasible. One needs to quantify “d(P 1, P 2 )> ε”
13
Inadequacy of common Measures The L 1 norm (or `total valiance’) is too sensitive: For every sample-based test and every m, there are P 1, P 2 s.t. L 1 (P 1, P 2 )> ¼ but the test fails to detect change from m-samples. L p ‘s for p>1 are too insensitive.
14
A New Measure of Distance F Given a family F of domain subsets, we define Note that this is a pseudo-metric over probability distributions. F Intuitively, F is chosen as a family of sets that the user cares about, d F F d F measures the largest change in probability over sets in F.
15
Major Merits of the F-distance If F is the family of disks or rectangles, d F captures an intuitive notion of ‘localized change’ If the family of sets, F, has a finite VC-dimension, then one gets finite sample- size guarantees against false negatives (w.r.t. d F )
16
Background: VC-Dimension The Vapnik-Chervonenkis dimension (VC-dim) is a parameter that measures the `combinatorial complexity of a family of sets. For algebraically defined families it is roughly the number of parameters needed to define a set in the family: So, VC-dim(Planar disks)=3, VC-dim{Axis Aligned Rectangles)=4
17
VC-Based Guarantees Let P 1, P 2 be any probability distributions over some domain set X. And let F be a family of subsets of X of finite VC-dimension d. For every 0 < ε <1, if S 1, S 2 are i.i.d samples of size m each, drawn by P 1, P 2 (respectively) then,
18
VC-Based Guarantees (2) In particular, we get Where S i (A) is the empirical measure
19
A Relativized Discrepancy To focus on small-weight subsets, we define a variation of the d F distance
20
Statistical Guarantees for the Relativized Discrepancy Let P 1, P 2 be any probability distributions over some domain set X. And let F be a family of subsets of X of finite VC-dimension d. For every 0 < ε <1, if S 1, S 2 are i.i.d samples of size m each, drawn by P 1, P 2 (respectively) then,
21
Algorithms for computing d F (S 1, S 2 ) We developed several basic algorithms, that take a pair of samples S 1, S 2 as input, and output the sets A in F that exhibit maximal empirical discrepancy. Our focus is the computational complexity of the algorithms as function of the input sample sizes.
22
Algorithms – The Basic Ideas (1) We say that a collection H of subsets is F-complete w.r.t. a sample S, if for every A in F there exist a set B in H such that. It follows that, if H is F-complete w.r.t. S1 U S2, then and
23
Algorithms – The Basic Ideas (2) The next step is to find finite collections of subsets that are F-complete for some natural families, F. For example, Where, D(s 1, s 2, s 3 ) is the disk whose boundary is defined by this triple of points.
24
Running times of our algorithms For Real-valued data we designed a data structure and an algorithm that requires O(m 1i + m 2i ) O(m 1i + m 2i ) time at every initiation and (O(log (m 1i + m 2i )) (O(log (m 1i + m 2i )) time for incremental updates for every new arriving data point.
25
Running times of our Algorithms (2) For two-dimensional data points, we consider two F basic families F of sets in the plane: Axis Aligned Rectangles and Planar Disks. For Rectangles we get computational complexity O(|S| 3 ) For Disks we get an exhaustive algorithm that runs O(|S| 4 ) in time O(|S| 4 ) and an approximation algorithm of complexity O(|S| 2 log|S|)
26
Summary We defined notions of spatial distances between probability distributions – changes that are detectable within local geometric regions (say, circles) Apply Vapnik-Chervonenkis theory to derive confidence guarantees. Develop efficient detection and estimation algorithms
27
Novelty of Our Approach Non-parametric statistics. (We make no prior assumptions about the underlying distribution) We provide performance guarantees for manageable (finite) sample sizes. We develop computationally efficient algorithms for change detection and change estimation.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.