Bayesian and Least Squares fitting: Problem: Given data (d) and model (m) with adjustable parameters (x), what are the best values and uncertainties for the parameters? Given Bayes’ Theorem: prob(m|d,I) prob(d|m,I) prob(m|I) + Gaussian data uncertainties + flat priors Then log ( prob(m|d) ) constant – data and maximizing the log ( prob(m|d) ) is equivalent to minimizing chi-squared (i.e., least squares)
and x fitted = x| 0 + x r P x
Weighted least-squares: Least squares equations: r = P x, where P ij = ∂m i /∂x j Each datum equation: r i = j ( P ij x j ) Divide both sides of each equation by datum uncertainty, i, i.e., r i r i / i and P ij P ij / i for each j gives variance-weighted solution.
Including priors in least-squares: Least squares equations: r = d – m = P x, where P ij = ∂m i /∂x j Each data equation: r i = j ( P ij x j ) The weighted data (residuals) need not be homogenous: r = ( d – m ) / can be composed of N “normal” data and some “prior-like data”. Possible prior-like datum: x k = v k ± k (for k th parameter) Thenr N+1 = ( v k - x k ) / k andP N+1,j = 1/ k for j =k 0 for j ≠ k
Non-linear models Least squares equations: r = P x, where P ij = ∂m i /∂x j has solution: x = (P T P) -1 P T r If the partial derivatives of the model are independent of the parameters, then the first-order Taylor expansion is exact and applying the parameter corrections, x, gives the final answer. Example linear problem: m = x 1 + x 2 t + x 3 t 2 If not, you have a non-linear problem and the 2 nd and higher order terms in the Taylor expansion can be important until x 0; so iteration is required. Example non-linear problem: m = sin( x t)
How to calculate partial derivatives (P ij ): Analytic formulae If the model can be expressed analytically) Numerical evaluations: “wiggle” parameters one at a time: x w = x except for j th parameter x j w = x j + x Partial derivative of the i th datum for parameter j: P ij = ( m i (x w ) - m i (x) ) / ( x j w – x j ) NB: choose x small enough to avoid 2 nd order errors, but large enough to avoid numerical inaccuracies. Always use 64-bit computations!
Can do very complicated modeling: Example problem: model pulsating photosphere for Mira variables ( see Reid & Goldston 2002, ApJ, 568, 931 ) Data: Observed flux, S(t, ), at radio, IR and optical wavelengths Model: Assume power-law temperature, T(r,t), and density, (r,t); calculate opacity sources (ionization equilibrium, H 2 formation, …); numerically integrate radiative transfer along ray-paths through atmosphere for many impact parameters and wavelengths; parameters include T 0 and 0 at radius r 0 Even though model is complicated and not analytic, one can easily calculate partials numerically and solve for best parameter values.
Modeling Mira Variables: Visual: m v ~ 8 mag ~ factor of 1000; variable formation of TiO clouds at ~2R * with top T ~ 1400 K IR: seeing pulsating stellar surface Radio: H free-free opacity at ~2R *
Iteration and parameter adjustment: Least squares equations: r = P x, where P ij = ∂m i /∂x j has solution: x = (P T P) -1 P T r It is often better to make parameter adjustments slowly, so for the k+1 iteration, set x| k+1 = x| k + * xI k where 0 < < 1 NB: this is equivalent to scaling partial derivatives by . So, if one iterates enough, one only needs to get the sign of the partial derivatives correct !
Evaluating Fits: Least squares equations: r = P x, where P ij = ∂m i /∂x j has solution: x = (P T P) -1 P T r Always carefully examine final residuals (r) Plot them Look for >3 values Look for non-random behavior Always look at parameter correlations correlation coefficient: jk = D jk / sqrt( D jj D kk ) where D = (P T P) -1
sqrt( cos 2 t ) sqrt( 2 ) = 0.7 sqrt( 3 ) = 0.6 sqrt( 4 ) = 0.5
Bayesian vs Least Squares Fitting: Least Squares fitting: Seeks the best parameter values and their uncertainties Bayesian fitting: Seeks the posteriori probability distribution for parameters
Bayesian Fitting: Bayesian: what is posterior probability distribution for parameters? Answer, evaluate prob(m|d,I) prob(d|m,I) prob(m|I) If data and parameter priors have Gaussian distributions, log( prob(m|d) ) constant – data – ( ) param_priors “Simply” evaluate for all (reasonable) parameter values. But this can be computationally challenging: eg, a modest problem with only 10 parameters, evaluated on a coarse grid of 100 values each, requires model calculations!
Markov chain Monte Carlo (McMC) methods: Instead of complete exploration of parameter space, avoid regions with low probability and wander about quasi-randomly over high probability regions: “Monte Carlo” random trials (like roulette wheel in Monte Carlo casinos) “Markov chain” (k+1) th trial parameter values are “close to” k th values
McMC using Metropolis-Hastings (M-H) algorithm: 1.Given k th model (ie, values for all parameters in the model), generate (k+1) th model by small random changes: x j | k+1 = x j | k + g j is an “acceptance fraction” parameter, g is a Gaussian random number (mean=0, standard deviation=1), j is the width of the posteriori probability distribution of parameter x j 2.Evaluate the probability ratio: R = prob(m|d)| k+1 / prob(m|d)| k 2.Draw a random number, U, uniformly distributed from 0 1 2.If R > U, “accept” and store the (k+1) th parameter values, else “replace” the (k+1) th values with a copy the k th values and store them (NB: this yields many duplicate models) Stored parameter values from the M-H algorithm give the posteriori probability distribution of the parameters !
Metropolis-Hastings (M-H) details: M-H McMC parameter adjustments: x j | k+1 = x j | k + g j determines the “acceptance fraction” (start near 1/√N), g is a Gaussian random number (mean=0, standard deviation=1), j is the “sigma” of the posteriori probability distribution of x j The M-H “acceptance fraction” should be about 23% for problems with many parameters and about 50% for few parameters; iteratively adjust to achieve this; decreasing increases acceptance rate Since one doesn’t know the parameter posteriori uncertainties, j, at the start, one needs to do trial solutions and iteratively adjust. When exploring the PDF of the parameters with the M-H algorithm, one should start with near-optimum parameter values, so discard early “burn-in” trials.
M-H McMC flow: Enter data (d) and initial guesses for parameters (x, prior, posteriori ) Start “burn-in” & “acceptance-fraction” adjustment loops (eg, ~10 loops) start a McMC loop (with eg, ~10 5 trials) make new model: x j | k+1 = x j | k + g j posteriori calculate (k+1) th log( prob(m|d) ) data + ( ) param_priors calculate Metropolis ratio: R = exp( log(prob k+1 ) – log(prob k ) ) if R > U k+1 accept and store model < U k+1 replace with k th model and store end McMC loop estimate & update posteriori and adjust for desired acceptance fraction End “burn-in” loops Start “real” McMC exploration with latest parameter values and using final posteriori and to determine parameter step sizes Use a large number of trials (eg, ~10 6 )
Estimation of posteriori : Make histogram of trial parameter values (must cover full range) Start bin loop check cumulative # crossing “-1 ” check cumulative # crossing “+1 ” End bin loop Estimates of (Gaussian) posteriori : p_val(+1 ) – p_val(-1 ) | p_val(+2 ) – p_val(-2 ) | Relatively robust to non-Gaussian pdfs
Adjusting Acceptance Rate Parameter ( ): Metropolis trial acceptance rule for (k+1) th trial model: if R > U k+1 accept and store model < U k+1 replace with k th model and store For n th set of M-H McMC trials, count cumulative number of accepted (N a ) and replaced (N r ) models. Acceptance rate: A = N a / (N a + N r ) For (n+1) th set of trials, set n+1 A n / A desired ) n
“Non-least-squares” Bayesian fitting: Sivia gives 2 examples where the data uncertainties are not known Gaussians; hence least squares is non-optimal: 1) prob( ) = for ; 0 otherwise) where is the error on a datum, which typically is close to (the minimum error), but can occasionally be much larger. 2) prob( ) = – – where is the fraction of “bad” data whose uncertainties are
“Error tolerant” Bayesian fitting: Sivia’s “conservative formulation”: data uncertainties are given by prob( ) = for ; 0 otherwise) where is the error on a datum, which typically is close to (the minimum error), but can occasionally be much larger. Marginalizing over gives prob(d|m, ) = prob(d |m, ) prob( ) d = √ exp[(d-m) 2 /2 d = √ – exp(-R 2 /2) ) / R 2 ) where R = (d-m)/ Thus, one maximizes ∑ i log – exp(-R 2 /2) ) / R 2 ), instead of minimizing ∑ i log exp(-R 2 /2) ) ∑ i -R 2 /2
Data PDFs: Gaussian pdf has sharper peak, giving more accurate parameter estimates (provided all data are good). Error tolerant pdf doesn’t have a large penalty for a wild point, so it will not “care much” about some wild data.
Error tolerant fitting example: Goal: to determine motion of 100’s of maser spots Data: maps (positions) at 12 epochs Method: find all spots at nearly the same position over all epochs; then fit for linear motion Problem: some “extra” spots appear near those selected to fit (eg, R>10). Too much data to plot, examine and excise by hand.
Error tolerant fitting example: Error tolerant fitting output with no “human intervention”
The “good-and-bad” data Bayesian fitting: Box & Tiao’s (1968) data uncertainties come in two “flavors”, given by prob( ) = – – where is the fraction of “bad” data ( whose uncertainties are Marginalizing over for Gaussian errors gives prob(d|m, ) = prob(d |m, ) prob( ) d = √ exp(-R 2 /2 ) + (1- ) exp(-R 2 /2) ) where R = (d-m)/ Thus, one maximizes constant + ∑ i log exp(-R 2 /2 ) + (1- ) exp(-R 2 /2) ), which for no bad data ( ) recovers least squares. But one must estimate 2 extra parameters: and
Estimation of parameter PDFs : Bayesian fitting result: histogram of M-H trial parameter values (PDF) This “integrates” over all values of all other parameters and is the parameter estimate “marginalized” over all other parameters. Parameter correlations: e.g, plot all trial values of x i versus x j