Online Detection of Change in Data Streams Shai Ben-David School of Computer Science U. Waterloo.

1 Online Detection of Change in Data Streams Shai Ben-David School of Computer Science U. Waterloo

2 Some Change Detection Tasks  Quality control – Factory products are being regularly tested and scored. Can we detect when the distribution of scores changes?  Real estate prices – Following selling prices of houses in K-W Can we tell when market trends change?

3 Problem Formalization  Data points are generated sequentially and independently by some underlying probability distribution.  Viewing the generated stream of data points, we wish to detect when the underlying data generating distribution changes (and how it changes).

4 Detection in Sensor Networks  We consider large scale networks of sensors Each sensor makes local binary decisions about the monitored physical phenomena: RED/GREEN  An observer collects a random sample of sensors’ readings.

5 First data collectionSecond data collection Is there a change in the underlying data-generating distribution? If a change has been detected, What has exactly changed ? Change Detection in Sensor Networks

6 Similar Issues in Other Disciplines  Ecology – Tracing the distribution of species over geographical locations.  Public Health – Tracing spread of various diseases.  Census data analysis.

7 Our basic paradigm Compare two sliding windows over the data stream: S1S1 S2S2 time Reducing change detection problem to the “two samples” problem; Given two samples S 1, S 2, generated by distributions P 1,P 2, Infer from S 1, S 2, whether P 1 =P 2.

8 Meta-Algorithm for Online Change Detection

9 Explanation k (m 1,i,m 2,i,α i ).  Note that the meta-algorithm is actually running k independent algorithms in parallel – one for each triplet (m 1,i,m 2,i,α i ). X i m 1,i c 0 Y i m 2,i  Each keeps a baseline window X i, containing the m 1,i points following last-detected change, c 0, and a second window, Y i, containing the most recent m 2,i points in the stream. d(X i Y i )> α i  We declare CHANGE whenever d(X i, Y i )> α i  c 0 X i  At such a point we reset c 0 and X i α i m i α i  The different α i ‘s reflect different levels of ‘change sensitivity. The m i ‘s are computed from the α i ‘s using the theory outlined below

10 Statistical requirements We wish to support our statistical tests with formal, finite sample size guarantees for:  Control the rate of False Positives ( `false alarms’).  Control the rate of False Negatives (`Missed- detections’).  Reliability of the change description.

11 Previous Work on the Two-Sample Problem Mostly within the context of parametric statistics. (Assuming the underlying distributions come from a known family of ‘nice’ distributions) Previous applications not concerned with memory and computation time limitations. Performance guarantees are asymptotic – apply only in the limit when sample sizes go to infinity. Previous focus on detection only – we wish to also describe the change.

12 The Need for Probability-Distance Measure  False Positives guarantees are straightforward: “If S 1, S 2 are samples of the same distribution, then the probability that the test will declare `CHANGE’ is small”  False Negatives guarantees are more delicate: “If S 1, S 2, come from different distributions then, w.h.p. declare `CHANGE’” This is infeasible.  One needs to quantify “d(P 1, P 2 )> ε”

13 Inadequacy of common Measures  The L 1 norm (or `total valiance’) is too sensitive: For every sample-based test and every m, there are P 1, P 2 s.t. L 1 (P 1, P 2 )> ¼ but the test fails to detect change from m-samples.  L p ‘s for p>1 are too insensitive.

14 A New Measure of Distance F Given a family F of domain subsets, we define Note that this is a pseudo-metric over probability distributions. F Intuitively, F is chosen as a family of sets that the user cares about, d F F d F measures the largest change in probability over sets in F.

15 Major Merits of the F-distance  If F is the family of disks or rectangles, d F captures an intuitive notion of ‘localized change’  If the family of sets, F, has a finite VC-dimension, then one gets finite sample- size guarantees against false negatives (w.r.t. d F )

16 Background: VC-Dimension The Vapnik-Chervonenkis dimension (VC-dim) is a parameter that measures the `combinatorial complexity of a family of sets. For algebraically defined families it is roughly the number of parameters needed to define a set in the family: So, VC-dim(Planar disks)=3, VC-dim{Axis Aligned Rectangles)=4

17 VC-Based Guarantees Let P 1, P 2 be any probability distributions over some domain set X. And let F be a family of subsets of X of finite VC-dimension d. For every 0 < ε <1, if S 1, S 2 are i.i.d samples of size m each, drawn by P 1, P 2 (respectively) then,

18 VC-Based Guarantees (2) In particular, we get Where S i (A) is the empirical measure

19 A Relativized Discrepancy To focus on small-weight subsets, we define a variation of the d F distance

20 Statistical Guarantees for the Relativized Discrepancy Let P 1, P 2 be any probability distributions over some domain set X. And let F be a family of subsets of X of finite VC-dimension d. For every 0 < ε <1, if S 1, S 2 are i.i.d samples of size m each, drawn by P 1, P 2 (respectively) then,

21 Algorithms for computing d F (S 1, S 2 ) We developed several basic algorithms, that take a pair of samples S 1, S 2 as input, and output the sets A in F that exhibit maximal empirical discrepancy. Our focus is the computational complexity of the algorithms as function of the input sample sizes.

22 Algorithms – The Basic Ideas (1) We say that a collection H of subsets is F-complete w.r.t. a sample S, if for every A in F there exist a set B in H such that. It follows that, if H is F-complete w.r.t. S1 U S2, then and

23 Algorithms – The Basic Ideas (2) The next step is to find finite collections of subsets that are F-complete for some natural families, F. For example, Where, D(s 1, s 2, s 3 ) is the disk whose boundary is defined by this triple of points.

24 Running times of our algorithms For Real-valued data we designed a data structure and an algorithm that requires O(m 1i + m 2i ) O(m 1i + m 2i ) time at every initiation and (O(log (m 1i + m 2i )) (O(log (m 1i + m 2i )) time for incremental updates for every new arriving data point.

25 Running times of our Algorithms (2) For two-dimensional data points, we consider two F basic families F of sets in the plane: Axis Aligned Rectangles and Planar Disks. For Rectangles we get computational complexity O(|S| 3 ) For Disks we get an exhaustive algorithm that runs O(|S| 4 ) in time O(|S| 4 ) and an approximation algorithm of complexity O(|S| 2 log|S|)

26 Summary  We defined notions of spatial distances between probability distributions – changes that are detectable within local geometric regions (say, circles)  Apply Vapnik-Chervonenkis theory to derive confidence guarantees.  Develop efficient detection and estimation algorithms

27 Novelty of Our Approach  Non-parametric statistics. (We make no prior assumptions about the underlying distribution)  We provide performance guarantees for manageable (finite) sample sizes.  We develop computationally efficient algorithms for change detection and change estimation.

