Download presentation
Presentation is loading. Please wait.
Published byErin Bradley Modified over 6 years ago
1
Statistical Change Detection for Multi-Dimensional Data
Presented by Xiuyao Song Authors: Xiuyao Song, Mingxi Wu, Chris Jermaine, Sanjay Ranka Good afternoon, everyone. Welcome to my talk.
2
Motivation example: antibiotic resistance pattern
Culturing the specimen of E. coli bacteria and then testing its resistant rate to multiple antibiotic drugs. Typical data could be like: (R: resistant S: susceptible, U: undetermined) drug1 drug2 drug3 drug4 R S U R We have a baseline data set and a recently observed data set. Question: Does E.Coli show different resistance pattern recently? If a change is detected, it might be caused by the presence of new E. Coli strains. We will raise an alarm for further investigations. We consider the problem of building a statistic test for detecting change of distribution in multidimensional data. Such a test would have numerous applications in a variety of disciplines. One of the examples is to detect the change of antibiotic resistance pattern. Assume we are culturing the specimen of E. coli bacteria and then testing its resistant rate to multiple antibiotic drugs. A typical resistance data could be a vector like this: Resistant to drug1, susceptible to drug2, undetermined to drug3 and resistant to drug4. We have a baseline data set and a recently observed data set. A question of interest is: Does E.Coli bacteria show different resistance pattern than what we have observed before? We need a distributional change detection method to answer this question. If a change is detected, it might be caused by the presence of new E. Coli strains. In this case, we will raise an alarm for further investigations.
3
Problem definition Question: FS = FS’ ? Unknown distributions FS FS’
randomly sampling Multi-dimensional space data set S data set S’ In high level, the problem is like this. Given two sets of multi-dimensional data set S and S’. Assume they are randomly sampled from two unknown underlying distributions respectively. The question we aim to answer is: Are the two distributions the same? baseline data recently observed data Question: FS = FS’ ?
4
Related work For uni-dimensional data, many existed tests, such as K-S test, chi-square test… Only two tests to detect a generic distributional change in multi-dimensional space. Kdq-tree by Dasu et al: relies discretization scheme, suffer from curse of dimensionality. Cross-match by Rosenbaum: computationally expensive due to maximum matching algorithm For uni-dimensional data, many tests existed, such as the K-S test and Chi-square test. However, there has been little attention paid to multidimensional data. Actually, we’ve found only two tests to detect a generic distributional change in multi-dimensional space. One is kdq-tree test. It relies on the discretization of the data space, which tends to suffer from the curse of dimensionality. the other is cross-match test. This test is computationally expensive since it includes a maximum matching algorithm.
5
hypothesis test framework
data set S data set S’ null hypothesis H0: FS = FS’ if H0 holds δ is a sample from ∆ δ δ is in the tail of ∆ highly likely H0 does not hold Both the kdq-tree and cross-match test and even the test we proposed are based on hypothesis test framework. The null hypothesis claims that the two underlying distributions are the same. Then, a test statistic \delta is defined and the corresponding null distribution is derived. If null hypothesis holds, \delta should be a sample from null distribution. If \delta is far out in the tails of null distribution, then it is highly likely that the null hypothesis does not hold. Null distribution ∆
6
Density test high-level overview
data set S data set S’ Step 1: Gaussian kernel density estimate of S1. randomly partition Step 2: define and calculate test statistic S1 S2 Kernel Density Estimate Step 3: derive the null distribution null distribution Step 4: calculate the critical value and make a decision. Here is the density test we proposed. Before we conduct the test, we randomly partition the baseline data set S into two parts. Our test consists of four steps. Step 1, We apply Gaussian kernel density estimation technique to infer the distribution of S1. Step 2, we define and calculate the test statistic \delta. \delta is a function of . S2 and S’ and the density estimate of S1. Step 3, we derive the null distribution. Step 4, Given a significance level \alpha, we calculate the critical value \tao and make a decision to reject or not reject the null hypothesis. KS1
7
Step 1: Kernel Density Estimate (KDE) --bandwidth selection
Plug-in bandwidth: asymptotically efficient, but not accurate. Data-driven bandwidth: converge better to the true distribution. bandwidth Step 1 is Kernel density estimate. The idea is to put a Gaussian kernel on each data point and then sum the bumps up to obtain the density function. The most important thing in KDE is to select the bandwidths of kernels. There are two basic approaches: Plug-in bandwidth, It assumed fixed bandwidth for all the kernels. This approach is asymptotically efficient. But it is not accurate in multi-dimensional space. KDE with data-driven bandwidth, on the other hand, can converge better to the true distribution. As you will see later, the null distribution of our density test is always a normal distribution centered at zero, no matter what kind of density estimate we use. So the correctness of density test can always be guaranteed. It does not depends on any specific bandwidth selection method. But the power of our test is increased when the density estimate is close to the original distribution. For this reason, we will go for a data-driven bandwidth approaches. correctness of density test can always be guaranteed. power of test is increased when estimate is accurate.
8
Choose bandwidth by MLE/EM (maximum likelihood estimation / Expectation Maximization)
3 2 pseudo-LLH object function 1 kernel In our density test, we used Maximum Likelihood Estimation to determine the optimal bandwidth for each kernel. But if we maximize this log-likelihood function directly, we will run into a problem. Refer to this figure, since each kernel is centered on one data point, if the bandwidth of the kernel is converged to zero, the likelihood of this data point over the kernel is going to be infinite. As one term becomes infinite, the log-likelihood function will be infinite. So if we maximize the log-likelihood function, the optimal bandwidths will always shrink to zero, which does not make sense. To fix this problem, we add a constraint that the likelihood of one data point over its corresponding kernel is not counted and set to be zero. Now combine the Log-likelihood function with the constraint, we obtain a pseudo-log-likelihood object function and maximize with the Expectation Maximization method. adding constraint:
9
Effectiveness of EM bandwidth
Samples from the real distribution Samples from KDE with scott’s plug-in bandwidth these figures show how close the density estimate is converged to the real distribution. the first one is the real distribution. The second one is the estimate with scott’s plug-in bandwidth. The third one is the estimate with our EM bandwidth. Obviously EM bandwidth performs better to depict the original distribution. Samples from KDE with our EM bandwidth
10
Step 2: define and calculate
data set S data set S’ randomly partition S1 S2 small if S’ different from S1. large otherwise large Kernel Density Estimate Now go to step 2. We defined \delta in this formula, where LLH means log-likelihood function and K_S1 is the density estimate of S1. Since S2 and S1 come from the same distribution, the second term always takes large value. For the first term, if S’ and S1 have different distributions, the value will be small. Otherwise, the value should be comparable to the 2nd term. That means, when S and S’ have different distributions, \delta will take a very small value and fall into lower tail of the null distribution. So our density test is a one-sided lower tail test. KS1
11
Step 3: derive the null distribution: ∆ ~ normal By Central Limit Theorem
Step 3 is deriving the null distribution. The random variable of null distribution is expressed like this. It consists of two independent terms, each of which can be viewed as a sum of some i.i.d random variables. According to the Central Limit Theorem, each of them has a normal distribution. Therefore big \Delta has a normal distribution. After some derivations, we found the expectation of \Delta is zero and the variance is proportional to sigma square. Notice that random variable Ti’s distribution FS is unknown, so we have to estimate this sigma^2. Need to be estimated * Tk be r.v. with distribution FS
12
Estimating if , an additional type I error will be introduced.
Use bootstrap percentile method, for i = 1 : 4000 bootstrap sampling R from S2, end Before estimating sigma^2, we should be aware of that the estimation might introduce some type I error. Refer to the figure, if the sigma^2 is underestimate, then the variance of the null distribution will be underestimated. A normal sample from the true null distribution might become abnormal in the estimated null distribution. In this case, we will incorrectly declare a change and thus introduce a type I error. Fortunately, we can use the bootstrap percentile method to bound this type I error. In the following algorithm, we use bootstrapping method to find the upper confidence limit on the true sigma^2. The probability of underestimating sigma^2 is bound by \beta in this algorithm.
13
Step 4: calculate critical value and make a decision
is calculated by: where Var[∆] is related to If Total type I error = estimated null distribution ∆ The last step of the density test is to calculate critical value \tao and make a decision. Let \alpha be the significance level of the basic hypothesis test, \beta be the type I error due to estimation of null hypothesis. \tao is calculated by its definition. Then we can reject the null hypothesis if \delta is less than \tao. Note that the total type I error of our density test is bound by \alpha + \beta. We will choose \alpha and \beta such that the critical value of the test is maximized. Because with larger critical value, we are more likely to reject the null hypothesis. So the power of the test is maximized. So far, we’ve finished all the 4 steps of the density test. And the test is completely ready to be used. Choose and to maximize
14
Density test – all 4 steps
data set S data set S’ Step 1: Gaussian kernel density estimate of S1. randomly partition Step 2: define and calculate test statistic S1 S2 Kernel Density Estimate Step 3: derive the null distribution null distribution Step 4: calculate the critical value and make a decision. Here is the density test we proposed. Before we conduct the test, we randomly partition the baseline data set S into two parts. Our test consists of four steps. Step 1, We apply Gaussian kernel density estimation technique to infer the distribution of S1. Step 2, we define and calculate the test statistic \delta. \delta is a function of . S2 and S’ and the density estimate of S1. Step 3, we derive the null distribution. Step 4, Given a significance level \alpha, we calculate the critical value \tao and make a decision to reject or not reject the null hypothesis. KS1
15
Run density test in 2 directions
the test is not symmetric, 2-way test may increase the power. E.g.: FS’ FS In order to increase the detection power, you might want to run the test in two directions. This is because \delta is not defined symmetrically with respect to S and S’. As in this example, the changes on one direction might be much easier to detect than the changes on the other direction. S hard to detect easy to detect S’
16
False positive false positive (%) Low-D High-D
Data consists of low-D group and high-D group. User-given p value is 8% false positive (%) density kdq-tree cross-match Low-D 5 8 High-D 6 1 NA Now we are gonna compare our density test with the other two methods, kdq-tree and cross-match. We divide the experiment data into low-Dimensional group and high-dimensional group. We used larger data size for high-dimensional data. Assume the user-given p value is 8%, This table gives the false positive rate in percentage. You can see the false positive rates of all the methods are bounded by the given p value. Note that we don’t have the result for cross-match in high-dimensional data due to its limited scalability.
17
false negative on low-D group
In order to test the false negatives, we create 5 types of distributional change. This is the result on low-dimension data. You can see our density test has lowest false negative on all the changes. type of changes
18
false negative on high-D group
On high-dimensional data, density test again performs better on all changes. type of changes
19
Scalability This figure compares the scalability. We plot the running time versus the data set size. You can see cross-match method can not comfortably handle large data size. kdq-tree has the best scalability and density test is in the middle. It is worth mentioning that the density test has a one-time cost for constructing the Kernel Density Estimate, it occupies 84% of the running time. once the estimate is constructed, it can be reused and the test can run in a small time fraction. density test has amortizable time cost (one-time cost: 84%)
20
Conclusion Poster session (# 15) Our density test …
can correctly bound the type I error is most powerful on all 5 changes can easily scale to large data sets and has an amortizable time cost In conclusion, our density test… Any further questions, welcome to talk to me in the poster session. My poster number is 15. Poster session (# 15)
21
Thanks for your attention!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.