ADAPTIVE EVENT DETECTION USING TIME-VARYING POISSON PROCESSES Kdd06 University of California, Irvine
ABSTRACT Time-series of count data aggregated behavior of individual person periodic bursty periods of unusual behavior In this paper statistical estimation techniques time-varying Poisson process model unsupervised learning Two data sets with ground truth freeway traffic data building access data performs better than a non-probabilistic, threshold-based technique
CONTENT Introduction Related work Data set characteristics A baseline model and its limitations Probabilistic modeling Learning and inference Adaptive event detection Estimating event attendance Conclusion
INTRODUCTION
Focus on time-series data where time is discrete and N(t) is a measurement of the number of individuals or objects recorded over the time-interval [t-1, t].
DEFINITION OF EVENT Events Sustained (bursty) periods of anomalous behavior sometimes refer to individual measurements Here, a large-scale activity that is unusual relative to normal patterns such as a large meeting in a building, a malicious attack on a Web server, or a traffic accident on a freeway. Chicken and egg problem requires some knowledge of what constitutes normal behavior historical data consists of both normal and anomalous (event) data mixed together.
Goal define a model of uncertainty (how unusual is the measurement?), and additionally incorporate a notion of event persistence. learn a model that reflects the bimodal nature of such data, namely a combination of the normal traffic patterns to which is occasionally added additional counts caused by aperiodic events.
RELATED WORK
Techniques Markov model Likelihood-based method A combination of Poisson models and Bayesian estimation methods Infinite automaton Common goal Detect novel and unusual data points or segments in time-series
DATA SET CHARACTERISTICS
BUILDING DATA
FREEWAY TRAFFIC DATA
Holiday data should be removed before modeling, because they involve relatively different behavior.
A BASELINE MODEL AND ITS LIMITATIONS
Threshold test based on a Poisson model estimate the Poisson rate λ of a particular time and day by averaging the observed counts on similar days at the same time The max likelihood estimate and λ < N
Limitations Is adequate when events cause a large increase in count data Fail when facing the chicken and egg problem Thresholds and the false alarms
PROBABILISTIC MODELING
Model N(t) Normal behavior: N 0 (t) Event caused: N e (t)
MODELING PERIODIC COUNT DATA Poisson distribution λ (t) d(t) Indicates the weekday on which time t falls h(t) Indicates the interval in which time t falls δ and η
MODELING PERIODIC COUNT DATA The effect of δ d(t)
MODELING PERIODIC COUNT DATA The effect of η d(t),h(t)
MODELING PERIODIC COUNT DATA
MODELING RARE, PERSISTENT EVENTS Use binary process z(t) to indicate the presence if an event Transition probability matrix Length of period between events is with expected value 1 / z 0 Length of each event is with expected value 1 / z 1 z 0 and z 1 priors
N E (t) γ (t) is independent at each time t
Markov-modulated Poisson model
LEARNING AND INFERENCE
MCMC Markov chain Monte Carlo methods Monte Carlo 方法的基本思想是 :为 了求解某个 问题, 建立一个恰当的概率模型 或随机 过 程 , 使得其参量 ( 如事件的概率 、 随机 变 量的数学期望等 ) 等于所求 问题 的解 , 然后 对 模型或 过 程 进 行反复多次的随机抽 样试验, 并 对结 果 进 行 统 计 分析 , 最后 计 算所求参量 , 得到 问题 的近似解 。 Hidden variables {z(t), N 0 (t), N E (t)}
SAMPLING THE HIDDEN VARIABLES GIVEN PARAMETERS Likelihood functions Sample If z(t) = 0 N 0 (t) = N(t) If z(t) = 1
SAMPLING THE PARAMETERS GIVEN THE COMPLETE DATA Integral number of weeks T = 7 * D * W The complete data likelihood In this case, only involve λ 0, δ and η Sufficient statistics of the data
Posterior distributions
ADAPTIVE EVENT DETECTION
ESTIMATING EVENT ATTENDANCE
CONCLUSION
Described a framework for building a probabilistic model of time- varying counting processes Observe a superposition of both time-varying but regular(periodic) and aperiodic processes Applied this model to two different time series of counts both over several months Described how the parameters of the model may be estimated using MCMC sampling