Introduction to Astrostatistics and R Eric Feigelson Penn State University ESO/Garching November 2014.

Slides:



Advertisements
Similar presentations
Regression Eric Feigelson Lecture and R tutorial Arcetri Observatory April 2014.
Advertisements

Programming Paradigms and languages
Introduction to Astrostatistics Eric Feigelson Dept. of Astronomy & Astrophysics Center for Astrostatistics Penn State University Summer.
Actuarial Modeling in R CAS Predictive Modeling Seminar Las Vegas October, 2007 Glenn Meyers, FCAS, MAAA Jim Guszcza, FCAS, MAAA.
CP411 Computer Graphics, Wilfrid Laurier University Introduction # 1 Welcome to CP411 Computer Graphics 2012 Instructor: Dr. Hongbing Fan Introduction.
Teaching Courses in Scientific Computing 30 September 2010 Roger Bielefeld Director, Advanced Research Computing.
R Mohammed Wahaj. What is R R is a programming language which is geared towards using a statistical approach and graphics Statisticians and data miners.
Codes for astrostatistics: StatCodes & VOStat Eric Feigelson Penn State.
Department of Mathematics and Computer Science
Decision Making: An Introduction 1. 2 Decision Making Decision Making is a process of choosing among two or more alternative courses of action for the.
Ann Arbor ASA ‘Up and Running’ Series: SPSS Prepared by volunteers of the Ann Arbor Chapter of the American Statistical Association, in cooperation with.
A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.
Engineering Data Analysis & Modeling Practical Solutions to Practical Problems Dr. James McNames Biomedical Signal Processing Laboratory Electrical & Computer.
NSF DMS VOStat - HEAD 2004 Ashish Mahabal VOStat Arming Astronomers with Advanced Statistics Caltech: A. Mahabal, M. Graham,
Who am I and what am I doing here? Allan Tucker A brief introduction to my research
RESEARCH HUB AT THE UNIVERSITY LIBRARIES PENN STATE UNIVERSITY TOUR OF STATISTICAL PACKAGES.
Regression Eric Feigelson. Classical regression model ``The expectation (mean) of the dependent (response) variable Y for a given value of the independent.
What is R Muhammad Omer. What is R  R is the programing language software for statistical computing and data analysis  The R language is extensively.
Overview of astrostatistics Eric Feigelson (Astro & Astrophys) & Jogesh Babu (Stat) Penn State University.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Part 1: Introduction 1-1/22 Econometrics I Professor William Greene Stern School of Business Department of Economics.
Data Mining Techniques
SENG521 (Fall SENG 521 Software Reliability & Testing Software Reliability Tools (Part 8a) Department of Electrical & Computer.
What is R By: Wase Siddiqui. Introduction R is a programming language which is used for statistical computing and graphics. “R is a language and environment.
N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.
Data Mining Chun-Hung Chou
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
ROOT: A Data Mining Tool from CERN Arun Tripathi and Ravi Kumar 2008 CAS Ratemaking Seminar on Ratemaking 17 March 2008 Cambridge, Massachusetts.
Anomaly detection with Bayesian networks Website: John Sandiford.
SPSS Presented by Chabalala Chabalala Lebohang Kompi Balone Ndaba.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
A PowerPoint Presentation Package to Accompany Applied Statistics in Business & Economics 2 nd ed. by David P. Doane and Lori E. Seward Prepared by Lloyd.
Piotr Wolski Introduction to R. Topics What is R? Sample session How to install R? Minimum you have to know to work in R Data objects in R and how to.
1 Statistical Distribution Fitting Dr. Jason Merrick.
1 Computer Programming (ECGD2102 ) Using MATLAB Instructor: Eng. Eman Al.Swaity Lecture (1): Introduction.
Mater Research Support Centre Statistics Course 2006 Introductory Talk.
Using Software in Teaching Statistics Damon Berridge, Centre for Applied Statistics, Dept of Mathematics & Statistics ESRC NCRM.
Chapter 1 Defining Social Studies. Chapter 1: Defining Social Studies Thinking Ahead What do you associate with or think of when you hear the words social.
Introduction to Science Informatics Lecture 1. What Is Science? a dependence on external verification; an expectation of reproducible results; a focus.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Multilevel Modeling Software Wayne Osgood Crime, Law & Justice Program Department of Sociology.
Introducing model-data fusion to graduate students in ecology Topics of discussion: The impact of NEON on ecology What are the desired outcomes from a.
Robust GW summary statistics & robust GW regression are used to investigate a freshwater acidification data set. Results show that data relationships can.
Introduction to Astrostatistics and R Eric Feigelson Penn State University Statistical Challenges in 21 st Century Cosmology IAU Symp 306 May 2014.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Edinburgh e-Science MSc Bob Mann Institute for Astronomy & NeSC University of Edinburgh.
Overview G. Jogesh Babu. Overview of Astrostatistics A brief description of modern astronomy & astrophysics. Many statistical concepts have their roots.
August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.
1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.
Statistics in WR: Lecture 1 Key Themes – Knowledge discovery in hydrology – Introduction to probability and statistics – Definition of random variables.
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
Spatial Point Processes Eric Feigelson Institut d’Astrophysique April 2014.
Statistics and Probability Theory Lecture 01 Fasih ur Rehman.
Commentary on Chris Genovese’s “Nonparametric inference and the Dark Energy equation of state” Eric Feigelson (Penn State) SCMA IV.
1 Seattle University Master’s of Science in Business Analytics Key skills, learning outcomes, and a sample of jobs to apply for, or aim to qualify for,
Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,
A quick guide to other statistical software
CST 1101 Problem Solving Using Computers
Astrostatistics: Past, Present & Future
JMP Discovery Summit 2016 Janet Alvarado
Course Review Questions will not be all on one topic, i.e. questions may have parts covering more than one area.
MATLAB Distributed, and Other Toolboxes
Introduction to R Programming with AzureML
Analyzing and Interpreting Quantitative Data
Overview G. Jogesh Babu.
Adventures in teaching and learning data analysis with R
R Programming.
Today’s Beginner Workshop
CSc4730/6730 Scientific Visualization
Helping a friend out Guidelines for better software
Presentation transcript:

Introduction to Astrostatistics and R Eric Feigelson Penn State University ESO/Garching November 2014

Download R from r-project.org Create a directory e.g. /Users/myself/Desktop/Rdir Download datasets into this directory from astro.psu.edu/users/edf/ESO_Nov2014 Enter R (double-click for MacOS, or R in linux to open R console. Put on right side of your screen. Open 2_Getting_started.R from PSU Web site on left side of screen so we can cut-and-paste into the console. Preparation for R tutorials

Outline of the school Mon 2:00 - 3:001. Introduction to astrostatistics and R [lecture] 3:00 - 4:152. Getting started with R [practicum] 4:15 - 5:303. Density estimation & local regression [lect+prac] Tue10:45 -12:004. Fundamentals of statistical inference [lect] 2:00 - 3:005. Regression [lect + prac] 3:00 - 4:006. Spatial point processes [lect+prac] Wed4:00 - 5:007. Time series analysis [lect] 9:00 -10:308. Data mining: clustering/classification [lect+prac] 11:00 -12:009. Towards good statistical practices in astronomical studies [lect]

My credentials Professor of Astronomy & Astrophysics and of Statistics, Penn State Assoc Director, Center for Astrostatistics, Penn State Scientific Editor (methodology), Astrophysical Journal Chair, IAU Working Group in Astrostatistics & Astroinformatics Councils, Intl Astrostatistics Assn, AAS/WGAA, LSST/ISSC Co-editor, Astrostatistics & Astroinformatics Portal ( Lead author, MSMA textbook (2012 PROSE Award) also an X-ray astronomer who studies star formation

Outline Introduction to astrostatistics –Role of statistics in science –History of astrostatistics –Status of astrostatistics today Introduction to R –History of statistical computing –The R language & CRAN packages

What is astronomy? Astronomy is the observational study of matter beyond Earth: planets in the Solar System, stars in the Milky Way Galaxy, galaxies in the Universe, and diffuse matter between these concentrations. Astrophysics is the study of the intrinsic nature of astronomical bodies and the processes by which they interact and evolve. This is an indirect, inferential intellectual effort based on the assumption that physics – gravity, electromagnetism, quantum mechanics, etc – apply universally to distant cosmic phenomena.

What is statistics? (No consensus !!) –“… briefly, and in its most concrete form, the object of statistical methods is the reduction of data” (R. A. Fisher, 1922) –“Statistics is the mathematical body of science that pertains to the collection, analysis, interpretation or explanation, and presentation of data.” (Wikipedia, ) –“Statistics is the study of the collection, analysis, interpretation, presentation and organization of data.” (Wikipedia, ) –“A statistical inference carries us from observations to conclusions about the populations sampled” (D. R. Cox, 1958)

Does statistics relate to scientific models? The pessimists … “Essentially, all models are wrong, but some are useful.” (Box & Draper 1987) “There is no need for these hypotheses to be true, or even to be at all like the truth; rather … they should yield calculations which agree with observations” (Osiander’s Preface to Copernicus’ De Revolutionibus, quoted by C. R. Rao) "The object [of statistical inference] is to provide ideas and methods for the critical analysis and, as far as feasible, the interpretation of empirical data... The extremely challenging issues of scientific inference may be regarded as those of synthesising very different kinds of conclusions if possible into a coherent whole or theory... The use, if any, in the process of simple quantitative notions of probability and their numerical assessment is unclear." (D. R. Cox, 2006)

The positivists … “The goal of science is to unlock nature’s secrets. … Our understanding comes through the development of theoretical models which are capable of explaining the existing observations as well as making testable predictions. … “Fortunately, a variety of sophisticated mathematical and computational approaches have been developed to help us through this interface, these go under the general heading of statistical inference.” (P. C. Gregory, Bayesian Logical Data Analysis for the Physical Sciences, 2005)

Recommended steps in the statistical analysis of scientific data

Modern statistics is vast in its scope and methodology. It is difficult to find what may be useful (jargon problem!), and there are usually several ways to proceed. Very confusing. Some statistical procedures are based on mathematical proofs which determine the applicability of established results. It is perilous to violate mathematical truths! Some issues are debated among statisticians, or have no known solution. Scientific inferences should not depend on arbitrary choices in methodology & variable scale. Prefer nonparametric & scale- invariant methods. Try multiple methods. It can be difficult to interpret the meaning of a statistical result with respect to the scientific goal. Statistics is only a tool towards understanding nature from incomplete information. We should be knowledgeable in our use of statistics and judicious in its interpretation

Astronomy & Statistics: A glorious past For most of western history, the astronomers were the statisticians! Ancient Greeks to 18 th century Best estimate of the length of a year from discrepant data? Middle of range: Hipparcos (4 th century B.C.) Observe only once! (medieval) Mean: Brahe (16 th c), Galileo (17 th c), Simpson (18 th c) Median (20 th c) 19 th century Discrepant observations of planets/moons/comets used to estimate orbital parameters using Newtonian celestial mechanics Legendre, Laplace & Gauss develop least-squares regression and normal error theory (c ) Prominent astronomers contribute to least-squares theory (c )

The lost century of astrostatistics…. In the late-19th and 20th centuries, statistics moved towards human sciences (demography, economics, psychology, medicine, politics) and industrial applications (agriculture, mining, manufacturing). During this time, astronomy recognized the power of modern physics: electromagnetism, thermodynamics, quantum mechanics, relativity. Astronomy & physics were wedded into astrophysics. Thus, astronomers and statisticians substantially broke contact; e.g. the curriculum of astronomers heavily involved physics but little statistics. Statisticians today know little modern astronomy.

The state of astrostatistics today (not good!) Many astronomical studies are confined to a narrow suite of familiar statistical methods: –Fourier transform for temporal analysis (Fourier 1807) –Least squares regression for model fits (Legendre 1805, Pearson 1901) –Kolmogorov-Smirnov goodness-of-fit test (Kolmogorov, 1933) –Principal components analysis for tables (Hotelling 1936) Even traditional methods are often misused: final lecture on Friday

Under-utilized methodology: modeling (MLE, EM Algorithm, BIC, bootstrap) multivariate classification (LDA, SVM, CART, RFs) time series (autoregressive models, state space models) spatial point processes (Ripley’s K, kriging) nondetections (survival analysis) image analysis (computer vision methods, False Detection Rate) statistical computing (R) Advertisement …. Modern Statistical Methods for Astronomy with R Applications E. D. Feigelson & G. J. Babu, Cambridge Univ Press, 2012

Cosmology Statistics Galaxy clustering Spatial point processes, clustering Galaxy morphology Regression, mixture models Galaxy luminosity fn Gamma distribution Power law relationships Pareto distribution Weak lensing morphology Geostatistics, density estimation Strong lensing morphology Shape statistics Strong lensing timing Time series with lag Faint source detection False Discovery Rate Multiepoch survey lightcurves Multivariate classification CMB spatial analysis Markov fields, ICA, etc  CDM parameters Bayesian inference & model selection Comparing data & simulation under development An astrostatistics lexicon …

A new imperative: Large-scale surveys & megadatasets Huge imaging, spectroscopic & multivariate datasets are emerging from specialized survey projects & telescopes: – object photometric catalogs from 2MASS, SDSS, VISTA, … – galaxy redshift catalogs from SDSS, LAMOST, … – source radio/infrared/X-ray catalogs (WISE, eROSITA) – Spectral-image datacubes (VLA, ALMA, IFUs) – object x 10 2 epochs (3D) surveys (PTF, CRTS, SNF, VVV, Pan-STARRS, Stripe 82, DES, …, LSST) The Virtual Observatory is an international effort to federate many distributed on-line astronomical databases. Powerful statistical tools are needed to derive scientific insights from TBy-PBy-EBy databases

Recent resurgence in astrostatistics Improved access to statistical software. R/CRAN public-domain statistical software environment with thousands of functions. Increasing capability in Python. Papers in astronomical literature doubled to ~500/yr in past decade (“Methods: statistical” papers in NASA-Smithsonian Astrophysics Data System) Short training courses ( Penn State, India, Brazil, Spain, Greece, China, Italy, France ) Cross-disciplinary research collaborations (Harvard/ICHASC, Carnegie-Mellon, Penn State, NASA-Ames/Stanford, CEA-Saclay/Stanford, Cornell, UC-Berkeley, Michigan, Imperial College London, LSST Statistics & Informatics Science Collaboration, …) Cross-disciplinary conferences (Statistical Challenges in Modern Astronomy, Astronomical Data Analysis , PhysStat, SAMSI 2012, Astroinformatics 2012) Scholarly society working groups and a new integrated Web portal asaip.psu.edu serving: Int’l Stat Institute’s Int’l Astrostatistical Assn, Int’l Astro Union Working Group, Amer Astro Soc Working Group, Amer Stat Assn Interest Group, LSST Science Collaboration, ACM SIGAstro?)

Textbooks Bayesian Logical Data Analysis for the Physical Sciences: A Comparative Approach with Mathematica Support, Gregory, 2005 Practical Statistics for Astronomers, Wall & Jenkins, 2 nd ed 2012 Modern Statistical Methods for Astronomy with R Applications, Feigelson & Babu, 2012 Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data, Ivecic, Connolly, VanderPlas & Gray, 2014

Join a Working Group and the Astrostatistics and Astroinformatics Portal Recent papers, meetings, jobs, blogs, courses, forums, …

Prelude to R …. A brief history of statistical computing 1960s – c2003: Statistical analysis developed by academic statisticians, but implementation relegated to commercial companies (SAS, BMDP, Statistica, Stata, Minitab, etc). 1980s: John Chambers (ATT, USA)) develops S system, C-like command line interface. 1990s: Ross Ihaka & Robert Gentleman (Univ Auckland NZ) mimic S in an open source system, R. R Core Development Team expands, GNU GPL release. Early-2000s: Comprehensive R Analysis Network (CRAN) for user-provided specialized packages grows exponentially. Important packages incorporated into base-R.

Growth of CRAN contributed packages Sept 2014: ~5900 packages >120,000 functions See The Popularity of Data Analysis Software, R. A. Muenchen,

The R statistical computing environment R integrates data manipulation, graphics and extensive statistical analysis. Uniform documentation and coding standards. But quality control is limited. Fully programmable C-like language, similar to IDL. Specializes in vector/matrix inputs. Easy download from for Windows, Mac or linux. On-the-fly installation of CRAN packages. ~6000 user-provided add-on CRAN packages, tens of thousands of statistical functions

Many resources: R help files (3500p for base R), CRAN Task Views and vignette files, on-line tutorials, >140 books, >400 blogs, Use R! conferences, galleries, blogs, companies, The R Journal & J. Stat. Software, etc. Principal steps for using R in astronomical research: –Knowing what you want[education, consulting, thought] –Finding what you want[Google, Rseek, crantastic, …] –Writing R scripts[R Help files, books] –Understanding what you find [education, consulting, thought]

Some functionalities of base R arithmetic & linear algebra bootstrap resampling empirical distribution tests exploratory data analysis generalized linear modeling graphics robust statistics linear programming local and ridge regression max likelihood estimation multivariate analysis multivariate clustering neural networks smoothing spatial point processes statistical distributions statistical tests survival analysis time series analysis

Selected methods in Comprehensive R Archive Network (CRAN) Bayesian computation & MCMC, classification & regression trees, genetic algorithms, geostatistical modeling, hidden Markov models, irregular time series, kernel-based machine learning, least-angle & lasso regression, likelihood ratios, map projections, mixture models & model- based clustering, nonlinear least squares, multidimensional analysis, multimodality test, multivariate time series, multivariate outlier detection, neural networks, non-linear time series analysis, nonparametric multiple comparisons, omnibus tests for normality, orientation data, parallel coordinates plots, partial least squares, periodic autoregression analysis, principal curve fits, projection pursuit, quantile regression, random fields, Random Forest classification, ridge regression, robust regression, Self-Organizing Maps, shape analysis, space-time ecological analysis, spatial analyisis & kriging, spline regressions, tessellations, three-dimensional visualization, wavelet toolbox

CRAN Task Views ( CRAN Task Views provide brief overviews of CRAN packages by topic & functionality. Maintained be expert volunteers. Partial list: Bayesian ~110 packages Chem/Phys ~70 packages (incl. 17 for astronomy) Cluster/Mixture~90 packages Graphics ~40 packages HighPerfComp~80 packages Machine Learning~70 packages Medical imaging~25 packages Robust~40 packages Spatial~130 packages Survival~170 packages TimeSeries~170 packages

Some features of R o Designed for individual use on workstation, exploring data interactively with advanced methodology and graphics. But it can be used for automated pipeline analysis. Very similar experience to IDL. o Extensive graphics based on SVG, RGTK2, JGD, and other GUIs. Add-on graphics packages: ggplot2, lattice, rgl, rggobi, … o Interfaces: BUGS, C, C++, Fortran, Java, Perl, Python, Xlisp, XML –This is very important for astronomers. R scripts can ingest subroutines from these languages. Packages exist for two-way communication for C, Fortran, Python & Ruby: you can ingest R functions in your legacy codes or vice versa. o I/O: ASCII, binary, bitmap, cgi, FITS, ftp, gzip, HTML, SOAP, URL o Math packages (LAPACK, PVM), GUIs (Rstudio, Rattle), … …

Since c.2005, R has been the world’s premier public-domain statistical computing package Data scientists recommend both Python and R … usage of both is growing rapidly

A vision of astrostatistics in 2025 … Astronomy graduate curriculum has 1 year of statistical and computational methodology Some astronomers have M.S. in statistics and computer science Astrostatistics and astroinformatics is a well-funded, cross- disciplinary research field involving a few percent of astronomers (cf. astrochemists) pushing the frontiers of methodology. Astronomers regularly use many methods coded in R. Statistical Challenges in Modern Astronomy meetings are held annually with ~400 participants