Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot.

Slides:



Advertisements
Similar presentations
Simple linear models Straight line is simplest case, but key is that parameters appear linearly in the model Needs estimates of the model parameters (slope.
Advertisements

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Forecasting Using the Simple Linear Regression Model and Correlation
Inference for Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Objectives (BPS chapter 24)
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
The Simple Regression Model
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
RESEARCH STATISTICS Jobayer Hossain Larry Holmes, Jr November 6, 2008 Examining Relationship of Variables.
Quantitative Business Analysis for Decision Making Simple Linear Regression.
© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Simple Linear Regression Analysis
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Statistical hypothesis testing – Inferential statistics II. Testing for associations.
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Correlation & Regression
Quantitative Business Analysis for Decision Making Multiple Linear RegressionAnalysis.
Objectives of Multiple Regression
Regression and Correlation Methods Judy Zhong Ph.D.
Inference for regression - Simple linear regression
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.
AP Statistics Section 15 A. The Regression Model When a scatterplot shows a linear relationship between a quantitative explanatory variable x and a quantitative.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
1 Chapter 10 Correlation and Regression 10.2 Correlation 10.3 Regression.
Chapter 10 Correlation and Regression
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Inference for regression - More details about simple linear regression IPS chapter 10.2 © 2006 W.H. Freeman and Company.
Simple Linear Regression ANOVA for regression (10.2)
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Analysis Overheads1 Analyzing Heterogeneous Distributions: Multiple Regression Analysis Analog to the ANOVA is restricted to a single categorical between.
Simple linear regression Tron Anders Moger
Chapter 14: Inference for Regression. A brief review of chapter 4... (Regression Analysis: Exploring Association BetweenVariables )  Bi-variate data.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Inference for regression - More details about simple linear regression IPS chapter 10.2 © 2006 W.H. Freeman and Company.
AP Statistics Section 15 A. The Regression Model When a scatterplot shows a linear relationship between a quantitative explanatory variable x and a quantitative.
Lab 4 Multiple Linear Regression. Meaning  An extension of simple linear regression  It models the mean of a response variable as a linear function.
Lecture Slides Elementary Statistics Twelfth Edition
Regression Analysis AGEC 784.
Modeling in R Sanna Härkönen.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
B&A ; and REGRESSION - ANCOVA B&A ; and
(Residuals and
Lecture Slides Elementary Statistics Thirteenth Edition
Diagnostics and Transformation for SLR
CHAPTER 29: Multiple Regression*
OUTLINE Lecture 5 A. Review of Lecture 4 B. Special SLR Models
QQ Plot Quantile to Quantile Plot Quantile: QQ Plot:
Diagnostics and Transformation for SLR
Regression and Correlation of Data
Introductory Statistics
Presentation transcript:

Statistics 1: tests and linear models

How to get started? Exploring data graphically: Scatterplot HistogramBoxplot

Important things to check Are all the variables in correct format? Do there seem to be outliers? –Mistake in data coding? Initial structure of the analyses What is the response variable? What are the explanatory variables? Explore patterns visually –Correlations? –Differences between groups?

Summary statistics summary(data), summary(x) mean(x), median(x) range(x) var(x), sd(x) min(x), max(x) quantile(x,p) tapply(), table()

Tests Test for normality –Shapiro’s test: shapiro.test() –QQ plot: qqnorm(), qqline() Homogeneity of variance –var.test (for two groups) –bartlett.test (for several groups)

Tests for differences in means Student’s t-test: t.test() –One or two sample test Testing if sample mean differs e.g. from 0 Testing if sample means of two groups differ –Paired/non paired Are pairs of measurements associated? –Variance homogeneous/non homogeneous –Assumes normally distributed data Wilcoxon’s test: wilcox.test() –Normality not required –paired/non paired DEMO 1

Correlation cor(x,y) calculated correlation coefficient between two numeric variables –close to 0: no correlation –close to 1: strong correlation Is the correlation significant –cor.test(y,x) –Note: check also graphically!!!

Confidence intervals and standard errors Typical ways of describing uncertainty in a parameter value (e.g. mean) –Standard error (SE of mean is sqrt(var(xx)/n) –Confidence interval (95%) The range within which the value is with the probability of 95% Normal approximation: 1.96*SE, so that 95% CI for mean(xx) [mean(xx) *SE(xx), mean(xx) *SE(xx)] If data not normally distributed bootstrapping can be helpful –Let’s assume we have measured age at death for 100 rats 95% CI for mean age at death can be derived by »1. take a sample of 100 rats with replacement from the original data »2. calculate mean »3. repeat 1 & 2 e.g times and always record the mean »4. Now 2.5 and 97.5% quantiles of the means give the 95% CI for mean EXERCISE TOMORROW!

Linear model and regression Models the response variable through additive effects of explanatory variables –E.g. how does stopping distance of a car depend on speed? –Or how does weight of an animal depend on it’s length?

The formula Y = a + b 1 x 1 + … + b n x n + ε Response variable Intercept Explanatory variables Normally distributed error term, i.e. ‘random noise’ Regression, ANOVA or ANCOVA?

How to interprete… Intercept: –Baseline value for Y –The value that Y is expected to get if all the predictors are 0 –If one/some of the predictors are factors, then this is the value predicted for the reference levels of the factors Coefficients b n –If x n is numeric variable, then increment of x n with one unit increases the value of Y with b n –If x n is a factor, then parameter b n gets different value for each factor level, so that Y increases with the value b n corresponding to the level of x n Note, reference level of x is included to the intercept

Fitting the model in R lm(y~x,data=“ name of your dataset ”) Formula: y~x intercept + the effect of x y~x-1 no intercept y~x+z multiple regression with main effects y~x*z multiple regression with main effects and interactions Exploring the model: summary(), anova(), plot( “model” )

plot() command in lm Produced four figures 1.Residuals against fitted values 2.QQ plot for residuals 3.Standardized residuals 4.‘Influence’ plotted against residuals: identifies outliers Residuals should be normally distributed and not show any systematic trends. If not OK, then: -> transformation of response: sqrt(), ln(),… -> transformations of explanatory variables -> should generalized linear model be used?

How to predict? Y = a + b 1 x 1 + … + b n x n Expected value of Y Values of predictors Estimated model parameters In R, predict() function.

Briefly about model selection The aim: simplest adequate model –Few parameters preferred over many –Main effects preferred over interactions –Untransformed variables preferred over transformed –Model should still not be oversimplified Simplifying a model –Are effects of explanatory variables significant? –Does deletion of a term increase residual variation significantly? Model selection tools: –anova() Tests difference in residual variation between alternative models –step() Stepwise model selection based on AIC values DEMO 2