Lecture 2: Parameter Estimation and Evaluation of Support

Slides:

Advertisements

Similar presentations

Managerial Economics in a Global Economy

Advertisements

Lecture 2: Parameter Estimation and Evaluation of Support.

Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.

Lecture 2: Parameter Estimation and Evaluation of Support.

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.

Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests

Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.

Models with Discrete Dependent Variables

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

458 Interlude (Optimization and other Numerical Methods) Fish 458, Lecture 8.

Point estimation, interval estimation

MAE 552 – Heuristic Optimization Lecture 6 February 6, 2002.

Parametric Inference.

Nonlinear Stochastic Programming by the Monte-Carlo method Lecture 4 Leonidas Sakalauskas Institute of Mathematics and Informatics Vilnius, Lithuania EURO.

Analysis of Variance. ANOVA Probably the most popular analysis in psychology Why? Ease of implementation Allows for analysis of several groups at once.

The Triangle of Statistical Inference: Likelihoood

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

Stochastic Linear Programming by Series of Monte-Carlo Estimators Leonidas SAKALAUSKAS Institute of Mathematics&Informatics Vilnius, Lithuania

April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.

Mathematical Models & Optimization?

Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.

2005MEE Software Engineering Lecture 11 – Optimisation Techniques.

Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.

Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.

1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.

Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.

INTRO TO OPTIMIZATION MATH-415 Numerical Analysis 1.

Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.

Stats Methods at IC Lecture 3: Regression.

Multiple Regression Analysis: Inference

Optimization Problems

Optimization via Search

Chapter 14 Introduction to Multiple Regression

Chapter 4 Basic Estimation Techniques

Chapter 7. Classification and Prediction

Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.

6. Simple Regression and OLS Estimation

Deep Feedforward Networks

Confidence Interval Estimation

Heuristic Optimization Methods

Non-linear Minimization

Inference for Regression

Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.

10701 / Machine Learning.

CJT 765: Structural Equation Modeling

Regression 1 Sociology 8811 Copyright © 2007 by Evan Schofer

Generalized Linear Models

Simple Linear Regression

Artificial Intelligence (CS 370D)

Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests

Statistical Methods For Engineers

CHAPTER 29: Multiple Regression*

Parameter Redundancy and Identifiability in Ecological Models

Chapter 9 Hypothesis Testing.

Modelling data and curve fitting

Lesson Comparing Two Means.

Discrete Event Simulation - 4

Optimization Problems

Geology Geomath Chapter 7 - Statistics tom.h.wilson

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Interval Estimation and Hypothesis Testing

More on Search: A* and Optimization

Boltzmann Machine (BM) (§6.4)

More on HW 2 (due Jan 26) Again, it must be in Python 2.7.

Neural Network Training

Local Search Algorithms

MGS 3100 Business Analysis Regression Feb 18, 2016

Stochastic Methods.

Presentation transcript:

Lecture 2: Parameter Estimation and Evaluation of Support Lecture 2 - Parameter Estimation and Support C. Canham Lecture 2: Parameter Estimation and Evaluation of Support

Lecture 2 - Parameter Estimation and Support C. Canham Parameter Estimation “The problem of estimation is of more central importance, (than hypothesis testing).. for in almost all situations we know that the effect whose significance we are measuring is perfectly real, however small; what is at issue is its magnitude.” (Edwards, 1992, pg. 2) “An insignificant result, far from telling us that the effect is non-existent, merely warns us that the sample was not large enough to reveal it.” (Edwards, 1992, pg. 2)

Lecture 2 - Parameter Estimation and Support C. Canham Parameter Estimation Finding Maximum Likelihood Estimates (MLEs) Local optimization (optim) Gradient methods Simplex (Nelder-Mead) Global optimization Simulated Annealing (anneal) Genetic Algorithms (rgenoud) Evaluating the strength of evidence (“support”) for different parameter estimates Support Intervals Asymptotic Support Intervals Simultaneous Support Intervals The shape of likelihood surfaces around MLEs

Parameter estimation: finding peaks on likelihood “surfaces”... Lecture 2 - Parameter Estimation and Support C. Canham Parameter estimation: finding peaks on likelihood “surfaces”... The variation in likelihood for any given set of parameter values defines a likelihood “surface”... The goal of parameter estimation is to find the peak of the likelihood surface.... (optimization)

Local vs Global Optimization Lecture 2 - Parameter Estimation and Support C. Canham Local vs Global Optimization “Fast” local optimization methods Large family of methods, widely used for nonlinear regression in commercial software packages “Brute force” global optimization methods Grid search Genetic algorithms Simulated annealing Local optimization methods are widely used. Major limitation is in the case where there may be multiple, local optima. The methods generally also have trouble with discontinuities in the likelihood surface, and some have problems dealing with constraints on parameter values. global optimum local optimum

Local Optimization – Gradient Methods Lecture 2 - Parameter Estimation and Support C. Canham Local Optimization – Gradient Methods Derivative-based (Newton-Raphson) methods: Likelihood surface General approach: Vary parameter estimate systematically and search for zero slope in the first derivative of the likelihood function...(using numerical methods to estimate the derivative, and checking the second derivative to make sure it is a maximum, not a minimum)

Local Optimization – No Gradient Lecture 2 - Parameter Estimation and Support C. Canham Local Optimization – No Gradient The Simplex (Nelder Mead) method Much simpler to program Does not require calculation or estimation of a derivative No general theoretical proof that it works, (but lots of happy practitioners…) Implemented as method= “Nelder-Mead” in the “optim” function in R

Lecture 2 - Parameter Estimation and Support C. Canham Global Optimization “Virtually nothing is known about finding global extrema in general.” “There are tantalizing hints that so-called “annealing methods” may lead to important progress on global (optimization)...” Quote from Press et al. (1986) Numerical Recipes

Global Optimization – Grid Searches Lecture 2 - Parameter Estimation and Support C. Canham Global Optimization – Grid Searches Simplest form of optimization (and rarely used in practice) Systematically search parameter space at a grid of points Can be useful for visualization of the broad features of a likelihood surface

Global Optimization – Genetic Algorithms Lecture 2 - Parameter Estimation and Support C. Canham Global Optimization – Genetic Algorithms Based on a fairly literal analogy with evolution Start with a reasonably large “population” of parameter sets Calculate the “fitness” (likelihood) of each individual set of parameters Create the next generation of parameter sets based on the fitness of the “parents”, and various rules for recombination of subsets of parameters (genes) Let the population evolve until fitness reaches a maximum asymptote Implemented in the “genoud” package in R: cool but slow for large datasets with large number of parameters

Global optimization - Simulated Annealing Lecture 2 - Parameter Estimation and Support C. Canham Global optimization - Simulated Annealing Analogy with the physical process of annealing: Start the process at a high “temperature” Gradually reduce the temperature according to an annealing schedule Always accept uphill moves (i.e. an increase in likelihood) Accept downhill moves according to the Metropolis algorithm: p = probability of accepting downhill move Dlh = magnitude of change in likelihood t = temperature

Effect of temperature (t) Lecture 2 - Parameter Estimation and Support C. Canham Effect of temperature (t)

Simulated Annealing in practice... Lecture 2 - Parameter Estimation and Support C. Canham Simulated Annealing in practice... A version with automatic adjustment of range... Search range (step size) Lower bound Upper bound Current value REFERENCES: Goffe, W. L., G. D. Ferrier, and J. Rogers. 1994. Global optimization of statistical functions with simulated annealing. Journal of Econometrics 60:65-99. Corana et al. 1987. Minimizing multimodal functions of continuous variables with the simulated annealing algorithm. ACM Transactions on Mathematical Software 13:262-280

Constraints – setting limits for the search... Lecture 2 - Parameter Estimation and Support C. Canham Constraints – setting limits for the search... Biological limits Values that make no sense biologically (be careful...) Algebraic limits Values for which the model is undefined (i.e. dividing by zero...) The first step in any search is to consider the constraints on the range of parameter values within which you are willing to search. Often there will be ranges that do not make any biological sense, but there may also be cases where specific parameter values don’t make algebraic sense (i.e. values of zero for a term that is in the denominator...) Bottom line: global optimization methods let you cast your net widely, at the cost of computer time...

Simulated Annealing - Initialization Lecture 2 - Parameter Estimation and Support C. Canham Simulated Annealing - Initialization Set Annealing schedule Initial temperature (t) (3.0) Rate of reduction in temperature (rt) (0.95)N Interval between drops in temperature (nt) (100) Interval between changes in range (ns) (20) Parameter values Initial values (x) Upper and lower bounds (lb,ub) Initial range (vm) Typical values in blue...

Lecture 2 - Parameter Estimation and Support C. Canham Simulated Annealing – Step 1 Pick a new set of parameter values (by varying just 1 parameter) Begin {a single iteration} {copy the current parameter array (x) to a temporary holder (xp) for this iteration} xp := x; {choose a new value for the parameter in use (puse)} xp[puse] := x[puse] + ((random*2 - 1)*vm[puse]); {check if the new value is out of bounds } if xp[puse] < lb[puse] then xp[puse] := x[puse] - (random * (x[puse]-lb[puse])); if xp[puse] > ub[puse] then xp[puse] := x[puse] + (random * (ub[puse]-x[puse])); vm is the range lb is the lower bound ub is the upper bound

Simulated Annealing – Step 2 Lecture 2 - Parameter Estimation and Support C. Canham Simulated Annealing – Step 2 {call the likelihood function with the new set of parameter values} likeli(xp,fp); {fp = new likelihood} {accept the new values if likelihood increases or at least stays the same} if (fp >= f) then begin x := xp; f := fp; nacp[puse] := nacp[puse] + 1; if (fp > fopt) then {if this is a new maximum, update the maximum likelihood} xopt := xp; fopt := fp; opteval := eval; BestFit; {update display of maximum r} end; end Accept the step if it leads uphill...

Simulated Annealing – Step 3 Lecture 2 - Parameter Estimation and Support C. Canham Simulated Annealing – Step 3 else {use Metropolis criteria to determine whether to accept a downhill move } begin try {fp < f, so the code below is a shortcut for exp(-1.0(abs(f-fp)/t)} p := exp((fp-f)/t); {t = current temperature} except on EUnderflow do p := 0; end; pp := random; if pp < p then x := xp; f := fp; nacp[puse] := nacp[puse] + 1; Use the Metropolis algorithm to decide whether to accept a downhill step...

Simulated Annealing – Step 4 Lecture 2 - Parameter Estimation and Support C. Canham Simulated Annealing – Step 4 Periodically adjust the range (VM) within which new steps are chosen... {after nused * ns cycles, adjust VM so that half of evaluations are accepted} If eval mod (nused*ns) = 0 then begin for i := 0 to npmax do if xvary[i] then ratio := nacp[i]/ns; { C controls the adjustment of VM (range) - references suggest setting at 2.0} if (ratio > 0.6) then vm[i] := vm[i]*(1.0+c[i]*((ratio - 0.6)/0.4)) else if ratio < 0.4 then vm[i] := vm[i]/(1.0+c[i]*((0.4 - ratio)/0.4)); if vm[i] > (ub[i]-lb[i]) then vm[i] := ub[i] - lb[i]; end; { reset nacp[i]} for i := 1 to npmax do nacp[i] := 0; ns is typically ~ 20 This part is strictly ad hoc...

Effect of C on Adjusting Range... Lecture 2 - Parameter Estimation and Support C. Canham Effect of C on Adjusting Range... Goffe et al. recommend setting C = 2. I have always followed this advice...

Simulated Annealing Code – Final Step Lecture 2 - Parameter Estimation and Support C. Canham Simulated Annealing Code – Final Step Reduce the “temperature” according to the annealing schedule {after nused * ns * nt cycles, reduce temperature t } If eval mod (nused*ns*nt) = 0 then begin t := rt * t; {store current maximum lhood in history list} lhist[eval div (nused*ns*nt)].iter := eval; lhist[eval div (nused*ns*nt)].lhood := fopt; end; I typically set nt = 100 (a very slow annealing) The annealing schedule is determined by 3 things: (1) the initial temperature (t): I typically use 3 (2) how often do you drop the temperature: I rarely vary this from ns = 20 and nt = 100.... (3) how much do you drop the temperature: I typically use 0.9 for a reasonalby “slow” annealing. To speed things up I might drop it to 0.85, but I rarely go below that... rt = fractional reduction in temperature at each drop in temperature: NOTE: Goffe et al. restart the search at the previous MLE estimates each time the temperature drops... (I don’t)

Lecture 2 - Parameter Estimation and Support C. Canham How many iterations?... The result on the left is much more typical – the algorithm usually “converges” very quickly. Is there any objective definition of “convergence”? Not really, but there are various ad hoc definitions based on the magnitude of the change in maximum likelihood over a defined number of iterations Logistic regression of windthrow susceptibility (188 parameters) 5 million is not enough! Red maple leaf litterfall (6 parameters) 500,000 is way more than necessary! What would constitute convergence?...

Optimization - Summary Lecture 2 - Parameter Estimation and Support C. Canham Optimization - Summary No hard and fast rules for any optimization – be willing to explore alternate options. Be wary of initial values used in local optimization when the model is at all complicated How about a hybrid approach? Start with simulated annealing, then switch to a local optimization…

Evaluating the strength of evidence for the MLE Lecture 2 - Parameter Estimation and Support C. Canham Evaluating the strength of evidence for the MLE Now that you have an MLE, how should you evaluate it? (Hint: think about the shape of the likelihood function, not just the MLE)

Strength of evidence for particular parameter estimates – “Support” Lecture 2 - Parameter Estimation and Support C. Canham Strength of evidence for particular parameter estimates – “Support” Log-likelihood = “Support” (Edwards 1992) Likelihood provides an objective measure of the strength of evidence for different parameter estimates...

Fisher’s “Score” and “Information” Lecture 2 - Parameter Estimation and Support C. Canham Fisher’s “Score” and “Information” “Score” (a function) = First derivative (slope) of the likelihood function So, S(θ) = 0 at the maximum likelihood estimate of θ “Information” (a number) = -1 * Second derivative (acceleration) of the likelihood function, evaluated at the MLE.. So this is a number: a measure of how steeply likelihood drops off as you move away from the MLE In general cases, “information” is equivalent to the variance of the parameter…

Lecture 2 - Parameter Estimation and Support C. Canham Profile Likelihood Evaluate support (information) for a range of values of a given parameter by treating all other parameters as “nuisance” and holding them at their MLEs… Parameter 1 Parameter 2

Asymptotic vs. Simultaneous M-Unit Support Limits Lecture 2 - Parameter Estimation and Support C. Canham Asymptotic vs. Simultaneous M-Unit Support Limits Asymptotic Support Limits (based on Profile Likelihood): Hold all other parameters at their MLE values, and systematically vary the remaining parameter until likelihood declines by a chosen amount (m)... What should “m” be? (2 is a good number, and is roughly analogous to a 95% CI)

Asymptotic vs. Simultaneous M-Unit Support Limits Lecture 2 - Parameter Estimation and Support C. Canham Asymptotic vs. Simultaneous M-Unit Support Limits Simultaneous: Resampling method: draw a very large number of random sets of parameters and calculate log-likelihood. M-unit simultaneous support limits for parameter xi are the upper and lower limits that don’t differ by more than m units of support... Likelihood Ratio Test: 2 times the difference in log-likelihoods is distributed as a Chisquared statistic with degrees of freedom equal to the difference in the number of parameters between two models. In the case of asymptotic support intervals, use the critical value with 1 degree of freedom, because there is just 1 fitted value (the parameter of interest). The p = 0.05 value for a Chisquared with 1 df = 3.84, so the critical difference in LL with 1 parameter is 1.92... In practice, it can require an enormous number of iterations to do this if there are more than a few parameters

Asymptotic vs. Simultaneous Support Limits Lecture 2 - Parameter Estimation and Support C. Canham Asymptotic vs. Simultaneous Support Limits A hypothetical likelihood surface for 2 parameters... Simultaneous 2-unit support limits for P1 In general, the asymptotic limits will almost always be LESS than the simultaneous limits. On the other hand, I find the asymptotic limits to be more intuitively informative. Given that we know what the MLE values are for the other parameters, why shouldn’t we use them to express our strength of support for a parameter of interest... 2-unit drop in support Parameter 2 Asymptotic 2-unit support limits for P1 Parameter 1

Lecture 2 - Parameter Estimation and Support C. Canham Other measures of strength of evidence for different parameter estimates Edwards (1992; Chapter 5) Various measures of the “shape” of the likelihood surface in the vicinity of the MLE... How pointed is the peak?...

Lecture 2 - Parameter Estimation and Support C. Canham Bootstrap methods Bootstrap methods can be used to estimate the variances of parameter estimates In simple terms: generate many replicates of the dataset by sampling with replacement (bootstraps) Estimate parameters for each of the datasets Use the variance of the parameter estimates as a bootstrap estimate of the variance

Evaluating Support for Parameter Estimates: A Frequentist Approach Lecture 2 - Parameter Estimation and Support C. Canham Evaluating Support for Parameter Estimates: A Frequentist Approach Traditional confidence intervals and standard errors of the parameter estimates can be generated from the Hessian matrix Hessian = matrix of second partial derivatives of the likelihood function with respect to parameters, evaluated at the maximum likelihood estimates Also called the “Information Matrix” by Fisher Provides a measure of the steepness of the likelihood surface in the region of the optimum Can be generated in R using optim and fdHess Illustrate on the board… the square root of the diagonals of the inverse of the negative of the Hessian = S.E. 95% CI = +- 1.96* S.E.

Lecture 2 - Parameter Estimation and Support C. Canham Example from R The Hessian matrix (when maximizing a log likelihood) is a numerical approximation for Fisher's Information Matrix (i.e. the matrix of second partial derivatives of the likelihood function), evaluated at the point of the maximum likelihood estimates. Thus, it's a measure of the steepness of the drop in the likelihood surface as you move away from the MLE. > res$hessian a b sd a -150.182 -2758.360 -0.201 b -2758.360 -67984.416 -5.925 sd -0.202 -5.926 -299.422 (sample output from an analysis that estimates two parameters and a variance term)

Lecture 2 - Parameter Estimation and Support C. Canham More from R now invert (“solve” in R parlance) the negative of the Hessian matrix to get the matrix of parameter variance and covariance > solve(-1*res$hessian) a b sd a 2.613229e-02 -1.060277e-03 3.370998e-06 b -1.060277e-03 5.772835e-05 -4.278866e-07 sd 3.370998e-06 -4.278866e-07 3.339775e-03 the square roots of the diagonals of the inverted negative Hessian are the standard errors* > sqrt(diag(solve(-1*res$hessian))) a b sd 0.1616 0.007597 0.05779 (*and 1.96 * S.E. is a 95% C.I….)