Download presentation
Presentation is loading. Please wait.
Published byGwen McLaughlin Modified over 9 years ago
1
Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016
2
Collaborators and Funding HP Singh, R. Gupta, C. Ngeow, L. Macri, A. Bhardwaj, S. Das, R. Kundu, S. Deb. A. Nanthakumar NSF, IUSSTF, IUCAA, Delhi University, SUNY Oswego Website: http://www.oswego.edu/~kanbur/iucaa2016/ http://www.oswego.edu/~kanbur/DU2014
3
Linear Regression A very common type of model in science (x i,y i ), i=1,….,N Y i = a + bx i + ε i, where x i, y i are the independent/dependent variables, respectively, a,b are the intercept/slope, respectively and ε i is the error. The error model is usually ε i ~ N(0, σ 2 ) We are interested in testing hypotheses on the slope b.
4
Linear Regression Least Squares estimates of the intercept and slope are given by with standard errors given by
5
Linear Regression Interested in testing whether the following model is better: H 0 : b=b 0 vs. H A : b=b 1, x ≤ x 0, b=b 2, x > x 0 That is there is a change of slope at x 0 - the break point. Can fit regression lines to data on both sides of the break point with slope estimates
6
Linear Regression The standard way to “check” this is by looking at the intervals and see if they are mutually exclusive. This essentially puts confidence intervals around the slope estimates. Depending on the choice of m, this says that the probability that the true slope is in the interval above is 1-α – or the probability of an error is α.
7
Linear Regression Then if A={“short period” slope is wrong}, B={“long period” slope is wrong}. In comparing the long and short period slopes, the probability of at least one mistake If 1 > α > 0, then 2α-α 2 > α. If we carry out statistical tests to significance level α, then this is saying that the statistical tests outlined in this talk have a smaller chance of making an error.
8
F Test Perhaps the simplest way to test for nonlinearity is to use the F test: Refer this statistic to F(ν R – ν F, ν F ) where the subscript R, F stands for the reduced and full models respectively, and ν stands for the degrees of freedom. RSS stands for the residual sum of squares and refer this test statistic to the theoretical F distribution. Normality, heteroskedasticity and IID observations.
9
Normality/Heteroskedasticity (X i, Y i ) with residuals ε i. Y i ‘ = Y f i + ε i Permute residuals without replacement (bootstrap is with replacement) ε n i = ε j Y n i = Y f i + ε n i With (X i, Y n i ) get the F statistic – repeat – F i. Find proportion of F i that are greater than the observed value of F. Heteroskedasticity – plot residuals against the independent variable. Try a transformation - perhaps log.
10
Testing for Normality Data (X i, Y i ), i=1,….N Quantiles: F n (u) = (#Y i ≤ u)/N and compare with that expected from a normal distribution. If the data are from a normal distribution, the q-q plot should be close to a straight line.
11
Random Walk Methods Order the independent variable: x 1 <x x <….<x N If r k is the kth residual from a linear regression, then If the data are consistent with a single linear regression, then the C(j) are a simple random walk. Our test statistic, R, is the vertical range of the C(j)
12
Random Walk Methods If the partial sums are a random walk, R will be small. Permute r k so that you randomize the residuals. Then recompute R. Repeat this procedure for a large number (~10000) permutations. The significance statistic is the Fraction of the permuted R statistics that are greater than the observed value of R: this is the significance level under the null hypothesis of linearity. This is a non-parametric test and does not depend on normality of the errors.
13
Testimator Test Estimator Sort the data in order of increasing independent variable. Divide the sample into N1 different non-overlapping and hence completely independent datasets. Each subset has n data points and the remaining datapoints are included in the last subset. We fit a linear regression to the first subset and determine an initial slope estimate, β’.
14
Testimator This initial estimate of the slope becomes β 0 in the next subset under the null hypothesis that the slope of the second subset is equal to the slope of the first subset. We calculate the t-statistic such that
15
Testimator Since there will be n g =n-1 hypothesis tests, the critical t value will be a Bonferroni type and ν is the number of data points in each subset. Once we know the observed and critical value of the t- statistics, we determine which is the probability that the initial testimator guess is true. If the value of k < 1, the null hypothesis is accepted and we derive the new testimator slope for the next subset using the previously determined β’s such that
16
Testimator This value of the testimator is taken as β 0 for the next subset. This process of hypothesis testing is repeated n g times or until the value of k > 1, suggesting rejection of the null hypothesis – that is the data are more consistent with a non-linear relation.
17
The Extra-Galactic Distance Scale μ=m-M μ=m-(a+b.logP) Calibrating Galaxy, observe Cepheids and determine M=a+blogP Target galaxy, observe Cepheids m i, i=1,…N. So μ i = m i – (a + blogP i ) y=Lq where y=(m 1, m 2,…m N ), q=(a,b,μ 1,μ 2,…μ N ) is the vector of unknowns and L is a (Nx(N+2)) matrix containing 1’s and logP i ’s.
18
The Extra-Galactic Distance Scale This is a vector equation for the q’s and easily solvable using the General Linear Model interface in R. Minimize χ 2 = (y-Lq) T C -1 (y-Lq) yields the MLE estimator for q. C is the matrix of measurement errors Weighted least squares estimate when errors are normally distributed. q’ = (L T C -1 L) -1 L T C -1 [y] and standard errors for the parameters in q’ are (L T C -1 L) -1. If you formulate your statistical data analysis problems in this General Linear Model formalism, its very easy to solve in R along with a full error analysis.
19
The Extra-Galactic Distance Scale and Bayes Bayesian GLM formalism applied to the estimate of H0
20
Segmented Lines and the Davies Test The model is Y =a s + b s X + ψ(X)Δa(X-X b ) and Δa=a L -a S and Ψ(X)=0, X<X b, ψ(X)=1, X≥X b. This assumes a continuous transition between the two linear models. A more general situation, perhaps a discontinuity is Y=a s +b s X + Ψ(X)[Δa(X-X b ) – γ], where γ represents the magnitude of the gap.
21
Segmented Lines Choose an initial break point X b ’ and then fit the other parameters in the equation. Estimate a new break point, X b ’’ = X b ’ + γ/Δa. Repeat until γ≈0.
22
Cepheid PL Relations
23
Cepheid PC Relations
24
Multiphase PL Relations
25
Multiwavelength PL Relations
26
Galactic PL Relations
27
ExtraGalactic PL Relations
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.