USGS Getty Images Frederic Paik Schoenberg, UCLA Statistics

Slides:



Advertisements
Similar presentations
Model Building For ARIMA time series
Advertisements

Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Visual Recognition Tutorial
Estimation A major purpose of statistics is to estimate some characteristics of a population. Take a sample from the population under study and Compute.
Lecture 5: Learning models using EM
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Statistical analysis and modeling of neural data Lecture 4 Bijan Pesaran 17 Sept, 2007.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Stability and accuracy of the EM methodology In general, the EM methodology yields results which are extremely close to the parameter estimates of a direct.
The Triangle of Statistical Inference: Likelihoood
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
FORECASTING. Minimum Mean Square Error Forecasting.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L1a.1 Lecture 1a: Some basic statistical concepts l The use.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
California Earthquake Rupture Model Satisfying Accepted Scaling Laws (SCEC 2010, 1-129) David Jackson, Yan Kagan and Qi Wang Department of Earth and Space.
ESTIMATION METHODS We know how to calculate confidence intervals for estimates of  and  2 Now, we need procedures to calculate  and  2, themselves.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Computacion Inteligente Least-Square Methods for System Identification.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
Jiancang Zhuang Inst. Statist. Math. Detecting spatial variations of earthquake clustering parameters via maximum weighted likelihood.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Confidential and Proprietary Business Information. For Internal Use Only. Statistical modeling of tumor regrowth experiment in xenograft studies May 18.
MathematicalMarketing Slide 3c.1 Mathematical Tools Chapter 3: Part c – Parameter Estimation We will be discussing  Nonlinear Parameter Estimation  Maximum.
Ondrej Ploc Part 2 The main methods of mathematical statistics, Probability distribution.
Virtual University of Pakistan
Estimating standard error using bootstrap
Data Modeling Patrice Koehl Department of Biological Sciences
Copyright © Cengage Learning. All rights reserved.
Chapter 7. Classification and Prediction
Deep Feedforward Networks
STATISTICS POINT ESTIMATION
LECTURE 11: Advanced Discriminant Analysis
Machine Learning Basics
CHE 391 T. F. Edgar Spring 2012.
Roberto Battiti, Mauro Brunato
Fitting Curve Models to Edges
"Did your model account for earthworms?" Rick Paik Schoenberg, UCLA
Statistical Methods For Engineers
CHAPTER 29: Multiple Regression*
Rick Paik Schoenberg, UCLA Statistics
Modelling data and curve fitting
Collaborative Filtering Matrix Factorization Approach
Ying shen Sse, tongji university Sep. 2016
R. Console, M. Murru, F. Catalli
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Nonlinear regression.
EM for Inference in MV Data
5.4 General Linear Least-Squares
Monte Carlo I Previous lecture Analytical illumination formula
Computing and Statistical Data Analysis / Stat 7
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
EM for Inference in MV Data
8. One Function of Two Random Variables
Introduction to Statistics − Day 4
9. Two Functions of Two Random Variables
8. One Function of Two Random Variables
Applied Statistics and Probability for Engineers
Fractional-Random-Weight Bootstrap
Probabilistic Surrogate Models
Continuous Random Variables: Basics
Presentation transcript:

On nonparametric estimation of Hawkes models for earthquakes and DRC Monkeypox USGS Getty Images Frederic Paik Schoenberg, UCLA Statistics Collaborators: Joshua Gordon, Ryan Harrigan. Also thanks to: SCEC, NCEC, & WHO datasets.

1. Background and motivation. 2. Hawkes models and ETAS. 3. Nonparametric Marsan and Lengliné (2008) estimator. 4. Proposed analytic solution. 5. Computation time and performance comparison. 6. Application to Loma Prieta earthquake data. 7. Application to DRC Monkeypox data. 8. Concluding remarks.

1. Background and motivation. * History of numerous models for earthquake forecasting, with mostly failures. (elastic rebound, water levels, radon levels, animal signals, quiescence, electro-magnetic signals, characteristic earthquakes, AMR, Coulomb stress change, etc.) * Skepticism among many in seismological community toward all probabilistic forecasts. * Different models can have similar fit and very different implications for forecasts. (e.g. Pareto vs. tapered Pareto for seismic moments. Fitting these by MLE to 3765 shallow worldwide events with M≥5.8 from 1977-2000, the Pareto says there should be an event of M ≥ 10.0 every 102 years, the tapered Pareto every 10436 years. The fitted Pareto predicts an event with M≥12 every 10,500 years, the tapered Pareto every 1043400 years.) * Model evaluation techniques and forecasting experiments like RELM/CSEP to discriminate among competing models and improve them are very important. * We also need non-parametric alternatives to these models.

1. Background and motivation. Kagan and Schoenberg (2001) * We also need non-parametric alternatives to these models.

1. Background and motivation.

2. Hawkes models and ETAS. Probabilistic models for point processes typically involve modeling the conditional rate l(t,x,y,m) = expected rate of accumulation of points at time t, location (x,y), and magnitude m, given the history of all previous events. l is defined as a non-neg. predictable process so that ∫ (dN -ldm) is a martingale. Hawkes (1971) modeled l(t) = m + K g(t-ti). m ≥ 0 is the background rate, K is the productivity, 0≤K≤1, and g is the triggering density satisfying . Ogata (1988) proposed the Epidemic-Type Aftershock Sequence (ETAS) model, which is a Hawkes model where l(t) = m + K g(t-ti; mi), with g(u; mi) = (u+c)-p exp{a(mi-M0)}.

Temporal activity described by modified Omori Law: K/(u+c)p Marsan and Lengliné (2008)

2. Hawkes models and ETAS. Ogata (1998) extended this to a space-time magnitude model where l(t,x,m) = m(x) + K g(t-ti, x-xi ; mi), with e.g. g(u, x ; mi) = (u+c)-p exp{a(mi-M0)} (||x||2 + d)-q. The parameters of Hawkes models are typically fit by maximizing the log-likelihood, . But can we estimate the triggering function g non-parametrically?

3. Nonparametric Marsan and Lengliné (2008) estimator. Marsan and Lengliné (2008) assume g is a step function, and estimate steps bk as parameters. Setting the partial derivatives of this loglikelihood with respect to the steps bk to zero yields where |Uk| is the width of step k, for k = 1,2,..., p. This is a system of p equations in p unknowns. However, the equations are nonlinear. They depend on 1/l(ti). Gradient descent methods: way too slow for large p. Marsan and Lengliné (2008) find approximate maximum likelihood estimates using the E-M method for point processes. You pick initial values of the parameters, then given those, you know the probability event i triggered event j. Using these, you can weight each pair of points by its probability and re-estimate the parameters, and repeat until convergence. This method works well but is iterative and time-consuming.

3. Nonparametric Marsan and Lengliné (2008) estimator. The last step of Marsan and Lengliné uses essentially a histogram estimator. Others have used slightly different approaches for smoothing. Lewis and Mohler (2011) use maximum penalized likelihood. Bacry et al. (2012) use the Laplace transform of the covariance function. Adelfio and Chiodi (2015) use a semi-parametric method where the background rate l is estimated nonparametrically and the triggering function g parametrically. There are also standard non-parametric methods for smoothing points generally, using splines, kernels, or wavelets. (Brillinger 1997). Marsan and Lengliné's method is different. It estimates g, not the overall rate.

4. Proposed analytic solution. Set p = n. (p = number of steps in the step function, g, and n = # of observed points.) Setting the derivatives of the loglikelihood to zero we have the p equations which are p linear equations in terms of 1/l(ti), for i = 1,2, ..., n. (!) So, if p=n, then we can use these equations to solve for 1/l(ti), and if we know 1/l(ti), then we know l(ti), and if we know l(ti), then we can solve for bi because the def. of a Hawkes process is which results in n linear equations in the p unknowns b1, b2, ..., bp, when g is a step function.

4. Proposed analytic solution. We can write the resulting estimator in very condensed form. Let l = {l(t1),l(t2),..., l(tn)}. Suppose the steps of g have equal widths, |U1| = |U2|, etc. Call this width U. Let Aij = the number points tk such that tj – tk is in Ui, for i,j in {1,2,...,p}. Then the loglikelihood derivatives equalling zero can be rewritten 0 = KA(1/l) – Kb, where b = nU1, with 1 = {1,1,...,1}. This has solution 1/l = A-1 b, if A is invertible. Similarly, the Hawkes equation can be rewritten l = m + KATb, whose solution is = (KAT)-1(l-m). Combining these two underlined formulas yields the estimates . This is very simple, trivial to program, and rapid to compute.

4. Proposed analytic solution. There are problems, however. 1. Estimating n=p steps! High variance. However, if we can assume g is smooth, then we can smooth our estimates for stability. 2. Need to estimate K and m too. We can use Marsan or take derivatives for these as well. 3. What about spatially-varying steps and unequally sized steps for g? No problem. The estimation generalizes in a completely obvious way. 4. A can be singular. We may need better solutions for this. I let uj = tj-tj-1 , sorted the uj values, and then used [u(1), u(2)), etc. as my binwidths, so each row and column of A would have at least one non-zero entry. If it still isn't singular, adding in a few random 1's into A often helps.

4. A can be singular. We may need better solutions for this. I let uj = tj-tj-1 , sorted the uj values, and then used [u(1), u(2)), etc. as my binwidths, so each row and column of A would have at least one non-zero entry. If it still isn't singular, adding in a few random 1's into A often helps.

Note: take the simple case of a dataset where point i is only influenced by point i-1. This is basically a renewal process, and we are just estimating a renewal density. Here A = I, K = 1, and we get the density estimator 1/{n(xi – xi-1)}.

Note: take the simple case of a dataset where point i is only influenced by point i-1. This is basically a renewal process, and we are just estimating a renewal density. Here A = I, K = 1, and we get the density estimator 1/{n(xi – xi-1)}.

5. Computation time and performance comparison. Test of concept. Examples of exponential, truncated normal, uniform, and Pareto g.

5. Computation time and performance comparison. Triangles = Marsan and Lengliné (2008) method. Circles = analytic method.

6. Application to Loma Prieta earthquake data. Loma Prieta earthquake was Mw 6.9 on Oct 17, 1989. As an illustration, we will estimate g on its 5566 aftershocks M ≥ 3 within 15 months. (Google images)

6. Application to Loma Prieta earthquake data. (SCEC.ORG) Estimated triggering function for 5567 Loma Prieta M ≥ 3 events, 10/16/1989 to 1/17/1990. Solid curve is the analytic method and dashed curve is Marsan and Lengliné (2008). Dotted curves are estimates based on analytic method +/- 1 or 2 SEs, respectively, for light grey and dark grey. SEs were computed using the SD of analytic estimates in 100 simulations of Hawkes processes with triggering functions sampled from the solid curve.

6. Application to Loma Prieta earthquake data. Estimated triggering function for Loma Prieta seismicity M ≥ 3, 10/16/1989 to 1/17/1990. Solid curve is the analytic method and dashed curve is Marsan and Lengliné (2008). Dotted curves are estimates based on analytic method +/- 1 or 2 SEs, respectively, for light grey and dark grey. SEs were computed using the SD of analytic estimates in 100 simulations of Hawkes processes with triggering functions sampled from the solid curve.

7. Application to DRC Monkeypox data. 566 investigated cases reported by WHO with incident date in 2005 or 2006. Times within each day randomized uniformly. (Google images) (Getty images)

7. Application to DRC Monkeypox data.

7. Application to DRC Monkeypox data.

8. Concluding remarks. The idea is to let p=n, let the derivatives of the log-likelihood be zero, solve for 1/li and therefore get li, and solve for b. a. We find major computation time savings from this method. For datasets of only 100-300 points the savings are negligible. However, for 5,000 points, the Marsan and Lengliné (2008) algorithm with 100 iterations takes about 7 hours, whereas the analytic method here takes 1.3 min. This speed facilitates computations like simulation based confidence intervals. b. How far can this go? It extends very readily to space-time-magnitude and estimation of m . Would this work for other types of models too? What are the limits on this method? c. What about when A is singular? More work is needed. d. Meaning of CIs in this context?