Introduction to Astrostatistics Eric Feigelson Dept. of Astronomy & Astrophysics Center for Astrostatistics Penn State University Summer.

Slides:



Advertisements
Similar presentations
Regression Eric Feigelson Lecture and R tutorial Arcetri Observatory April 2014.
Advertisements

Chapter 6 Sampling and Sampling Distributions
Probability and Statistics Basic concepts II (from a physicist point of view) Benoit CLEMENT – Université J. Fourier / LPSC
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Lwando Kondlo Supervisor: Prof. Chris Koen University of the Western Cape 12/3/2008 SKA SA Postgraduate Bursary Conference Estimation of the parameters.
Sampling: Final and Initial Sample Size Determination
Codes for astrostatistics: StatCodes & VOStat Eric Feigelson Penn State.
Maximum likelihood (ML) and likelihood ratio (LR) test
Constraining Astronomical Populations with Truncated Data Sets Brandon C. Kelly (CfA, Hubble Fellow, 6/11/2015Brandon C. Kelly,
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Curve-Fitting Regression
Chapter 7 Sampling and Sampling Distributions
Evaluating Hypotheses
IEEM 3201 One and Two-Sample Estimation Problems.
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
July 3, A36 Theory of Statistics Course within the Master’s program in Statistics and Data mining Fall semester 2011.
Maximum likelihood (ML)
Regression Eric Feigelson. Classical regression model ``The expectation (mean) of the dependent (response) variable Y for a given value of the independent.
The Paradigm of Econometrics Based on Greene’s Note 1.
Overview of astrostatistics Eric Feigelson (Astro & Astrophys) & Jogesh Babu (Stat) Penn State University.
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Topic 5 Statistical inference: point and interval estimate
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
1 Institute of Engineering Mechanics Leopold-Franzens University Innsbruck, Austria, EU H.J. Pradlwarter and G.I. Schuëller Confidence.
Review of Chapters 1- 5 We review some important themes from the first 5 chapters 1.Introduction Statistics- Set of methods for collecting/analyzing data.
1 Nonparametric Methods III Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
An Empirical Likelihood Ratio Based Goodness-of-Fit Test for Two-parameter Weibull Distributions Presented by: Ms. Ratchadaporn Meksena Student ID:
CS433: Modeling and Simulation Dr. Anis Koubâa Al-Imam Mohammad bin Saud University 15 October 2010 Lecture 05: Statistical Analysis Tools.
Introduction Osborn. Daubert is a benchmark!!!: Daubert (1993)- Judges are the “gatekeepers” of scientific evidence. Must determine if the science is.
Chapter 20 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 These tests can be used when all of the data from a study has been measured on.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
G. Cowan Lectures on Statistical Data Analysis Lecture 1 page 1 Lectures on Statistical Data Analysis London Postgraduate Lectures on Particle Physics;
บทบาทของนักสถิติต่อภาคธุรกิจ และอุตสาหกรรม. Scientific method refers to a body of techniques for investigating phenomena, acquiring new knowledge, or.
Academic Research Academic Research Dr Kishor Bhanushali M
Overview G. Jogesh Babu. Overview of Astrostatistics A brief description of modern astronomy & astrophysics. Many statistical concepts have their roots.
Reasoning in Psychology Using Statistics Psychology
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Point Estimation of Parameters and Sampling Distributions Outlines:  Sampling Distributions and the central limit theorem  Point estimation  Methods.
1 Introduction to Statistics − Day 3 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
Machine Learning 5. Parametric Methods.
Copyright © Cengage Learning. All rights reserved. 9 Inferences Based on Two Samples.
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
Introduction to statistics Definitions Why is statistics important?
Week 21 Order Statistics The order statistics of a set of random variables X 1, X 2,…, X n are the same random variables arranged in increasing order.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Commentary on Chris Genovese’s “Nonparametric inference and the Dark Energy equation of state” Eric Feigelson (Penn State) SCMA IV.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Computacion Inteligente Least-Square Methods for System Identification.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Chapter 22 Inferential Data Analysis: Part 2 PowerPoint presentation developed by: Jennifer L. Bellamy & Sarah E. Bledsoe.
Chapter 6 Sampling and Sampling Distributions
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Appendix I A Refresher on some Statistical Terms and Tests.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,
Canadian Bioinformatics Workshops
Astrostatistics: Past, Present & Future
Overview G. Jogesh Babu.
Reasoning in Psychology Using Statistics
Bootstrap for Goodness of Fit
Ch13 Empirical Methods.
Parametric Methods Berlin Chen, 2005 References:
Applied Statistics and Probability for Engineers
Presentation transcript:

Introduction to Astrostatistics Eric Feigelson Dept. of Astronomy & Astrophysics Center for Astrostatistics Penn State University Summer School in Statistics for Astronomers June 2013

Outline Role of statistics in astronomy History of astrostatistics Needs and status of astrostatistics today Prospects of astrostatistics Appendix: Vocabulary & fields of modern statistics

What is astronomy? Astronomy (astro = star, nomen = name in Greek) is the observational study of matter beyond Earth – planets in the Solar System, stars in the Milky Way Galaxy, galaxies in the Universe, and diffuse matter between these concentrations. Astrophysics (astro = star, physis = nature) is the study of the intrinsic nature of astronomical bodies and the processes by which they interact and evolve. This is an indirect, inferential intellectual effort based on the assumption that gravity, electromagnetism, quantum mechanics, plasma physics, chemistry, and so forth – apply universally to distant cosmic phenomena.

What is statistics? (No consensus !!) Statistics characterizes and generalizes data –“The first task of a statistician is cross-examination of data” (R. A. Fisher, 1949) –“Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data” (Wikipedia, ) –“[Statistics is] the study of algorithms for data analysis” (R. Beran) –“A statistical inference carries us from observations to conclusions about the populations sampled” (D. R. Cox, 1958)

Does statistics relate to scientific models? The pessimists … “There is no need for these hypotheses to be true, or even to be at all like the truth; rather … they should yield calculations which agree with observations” (Osiander’s Preface to Copernicus’ De Revolutionibus, quoted by C. R. Rao) “`Essentially, all models are wrong, but some are useful.' (Box & Draper 1987) The optimist … “The goal of science is to unlock nature’s secrets. … Our understanding comes through the development of theoretical models which are capable of explaining the existing observations as well as making testable predictions. … Fortunately, a variety of sophisticated mathematical and computational approaches have been developed to help us through this interface, these go under the general heading of statistical inference.” (P. C. Gregory, 2005)

My personal conclusions (X-ray astronomer with 25 yrs statistical experience) The application of statistics can reliably quantify information embedded in scientific data and help adjudicate theoretical models. But this is not a straightforward, mechanical enterprise. It requires careful statement of the problem, model formulation, choice of statistical method(s), calculation of statistical quantities, and judicious evaluation of the result. Astronomers often do not adequately pursue each of these steps. Modern statistics is vast in its scope and methodology. It is difficult to find what may be useful (jargon problem!), and there are usually several ways to proceed. Some issues are debated among statisticians, or have no known solution. Many statistical procedures are based on mathematical proofs which determine the applicability of established results; it is easy to ignore these limits and emerge with unreliable results. It is perilous to violate mathematical truths! It can be difficult to interpret the meaning of a statistical result with respect to the scientific goal. P-values are not necessarily useful … we are scientists first! Statistics is only a tool towards understanding nature from incomplete information. We should be knowledgeable in our use of statistics and judicious in its interpretation.

Astronomy & statistics: A glorious past For most of western history, the astronomers were the statisticians! Ancient Greeks – 18 th century What is the best estimate of the length of a year from discrepant data? Middle of range (Hipparcos) Observe only once! (medieval) Mean (Galileo, Brahe, Simpson) Median (today?) 19 th century Discrepant observations of planets/moons/comets used to estimate orbits using Newtonian celestial mechanics Legendre, Laplace & Gauss develop least-squares regression and normal error theory (c ) Prominent astronomers contribute to least-squares theory (c )

The lost century of astrostatistics…. In the late-19th and 20th centuries, statistics moved towards human sciences (demography, economics, psychology, medicine, politics) and industrial applications (agriculture, mining, manufacturing). During this time, astronomy recognized the power of Modern physics: electromagnetism, thermodynamics, quantum mechanics, relativity. Astronomy & physics were closely wedded into astrophysics. Thus, astronomers and statisticians substantially broke contact; e.g. the curriculum of astronomers heavily involved physics but little statistics. Statisticians today know little modern astronomy.

The state of astrostatistics today (not good!) The typical astronomical study uses: –Fourier transform for temporal analysis (Fourier 1807) –Least squares regression for model fitting (Legendre 1805, Pearson 1901) –Kolmogorov-Smirnov goodness-of-fit test (Kolmogorov, 1933) –Principal components analysis for tables (Hotelling 1936) Even traditional methods are often misused: –Six unweighted bivariate least squares fits are used interchangeably with wrong confidence intervals Feigelson & Babu ApJ 1992 –Use of the likelihood ratio test for comparing two models is often inconsistent with asymptotic statistical theory Protassov et al. ApJ 2002 –K-S goodness-of-fit probabilities are inapplicable when the model is derived from the data Babu & Feigelson ADASS 2006

Advertisement …. Modern Statistical Methods for Astronomy with R Applications E. D. Feigelson & G. J. Babu, Cambridge Univ Press, August 2012 Text is based on this Summer School but more comprehensive

Feigelson in Advances in Machine Learning and Data Mining for Astronomy M. Way et al. (eds.) 2012 Example of inadequate use of modern methodology

An analogy ….. Astrostatistics and Chairs Homemade chair by amateur Minimum chi-square regression The Eames chair Maximum likelihood regression with BIC model selection Modern utilitarian & ecological chair Anderson-Darling nonparametric model validation with bootstrap confidence intervals Astronomers must learn principles of furniture design, ergonomics, selection of materials, joinery, finishing, etc

Statistical needs in astronomy today Are the available stars/galaxies/sources an unbiased sample of the vast underlying population? When should these objects be divided into 2/3/… classes? What is the intrinsic relationship between two properties of a class (especially with confounding variables)? Can we answer such questions in the presence of observations with measurement errors & flux limits?

Are the available stars/galaxies/sources an unbiased sample of the vast underlying population? Sampling When should these objects be divided into 2/3/… classes? Multivariate classification What is the intrinsic relationship between two properties of a class (especially with confounding variables)? Multivariate regression Can we answer such questions in the presence of observations with measurement errors & flux limits? Censoring, truncation & measurement errors (cf. talks by Chad Schafer and Brandon Kelly) Statistical needs in astronomy today

When is a blip in a spectrum, image or datastream a real signal? Statistical inference How do we model the vast range of variable objects (extrasolar planets, BH accretion, GRBs, …)? Time series analysis How do we model the 2-6-dimensional points (galaxies in the Universe, photons in a detector)? Spatial point processes & image processing How do we model continuous structures (cosmic microwave background fluctuations, interstellar medium)? Density estimation, regression

A new imperative: Large-scale surveys, megadatasets & the Virtual Observatory Huge, uniform, multivariate databases are emerging from specialized survey projects & telescopes: – object photometric catalogs from USNO, 2MASS & SDSS – galaxy redshift catalogs from 2dF & SDSS – source radio/infrared/X-ray catalogs – Spectral databases: 10 5 SDSS quasars, 10 4 stellar radial velocities, 10 3 Spitzer protoplanetary disks, 10 8 LAMOST spectra, …, … – Huge image databases, growing datacubes (EVLA/ALMA, IFUs) – Planned LSST will generate ~10 Pby video, ~10 10 object catalogs The Virtual Observatory is an international effort underway to federate these distributed on-line astronomical databases. Powerful statistical tools are needed to derive scientific insights from extracted VO datasets

Software Astronomers urgently need broad, reliable statistical software. Historically, commercial stat packages have dominated, and astronomers have not purchased them (largest: SAS). Recently, the first major public-domain statistical software package has emerged: R ( Similar to IDL, R (and its add-on packages in CRAN) provide a huge range of built-in statistical functionalities.

Statistics: Some basic definitions Statistical inference –Seeking quantitative insight & interpretation of a dataset Hypothesis testing –To what confidence is a dataset consistent with a previously stated hypothesis? Estimation –Seeking the quantitative characteristics of a functional model designed to explain a dataset. An estimator seeks to approximate the unknown parameters based on the data Probability distribution –A parametric functional family describing the behavior of a parent distribution of a dataset (e.g. Gaussian = normal) Nonparametric statistics –Inference based directly on the dataset without parametric models Independent & identically distributed (iid) data point –A sample of similarly but independently acquired quantitative measurements.

Some basic definitions (cont.) Frequentist statistics – Suite of classical inference methods based on simple probability distributions. Hypotheses are fixed while data vary. Bayesian statistics – Inference methods based on Bayes’ Theorem based on likelihoods and prior distributions. Data are fixed while hypotheses vary. L 1 and L 2 methods – 19th century methods for estimation based on minimizing the absolute or squared deviations between a sample and a model Maximum likelihood methods – 20th century methods for parametric estimation based on the likelihood that a dataset fits the model (often like L 2 ) Gibbs sampling, Metropolis-Hastings algorithm, Markov chain Monte Carlo, … – New computational methods useful for integrations over hypothesis space in Bayesian statistics

Some basic definitions (cont.) Robust (nonparametric) methods – Statistical procedures that are insensitive to data outliers or distributions Model selection & validation – Procedures for estimating the goodness-of-fit and choice of parametric model. (Nested vs. non-nested models, model misspecification) Statistical power, efficiency & bias – Mathematical evaluation of the effectiveness of a statistical procedure to achieve its desired goals Two-sample & k-sample tests – Statistical tests giving probabilities that k samples are drawn from the same parent sample Independent & identically distributed (i.i.d.) data point –A sample of similarly but independently acquired quantitative measurements. Heteroscedasticity – A failure of i.i.d. due to differently weighted data points, common in astronomy due to measurement errors with known variances

Some fields of applied statistics Multivariate analysis –Establishing the structure of a table of rows & columns –Analysis of variance, regression, principal component analysis, discriminant analysis, factor analysis Multivariate classification –Dividing a multivariate dataset into distinct classes Correlation & regression –Establishing the relationships between variables in a sample Time series analysis –Studying data measured along a time-like axis Spatial analysis –Studying point or continuous processes in 2-3-dimensions Survival analysis –Studying data subject to censoring (e.g. upper limits) Data mining –Studying structures in mega-datasets Biometrics, econometrics, psychometrics, chemometrics, quality assurance, geostatistics, astrostatistics, …, …