Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10.

Slides:



Advertisements
Similar presentations
3.3 Hypothesis Testing in Multiple Linear Regression
Advertisements

1 Chapter 4 Experiments with Blocking Factors The Randomized Complete Block Design Nuisance factor: a design factor that probably has an effect.
Chapter 4 Randomized Blocks, Latin Squares, and Related Designs
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
12-1 Multiple Linear Regression Models Introduction Many applications of regression analysis involve situations in which there are more than.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
12 Multiple Linear Regression CHAPTER OUTLINE
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Objectives (BPS chapter 24)
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section ) –Outliers and Influential Points (Section 6.7) Homework.
The Simple Regression Model
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
Lecture 24 Multiple Regression (Sections )
Regression Diagnostics Checking Assumptions and Data.
Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.
Statistics 350 Lecture 17. Today Last Day: Introduction to Multiple Linear Regression Model Today: More Chapter 6.
Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Conditions of applications. Key concepts Testing conditions of applications in complex study design Residuals Tests of normality Residuals plots – Residuals.
Correlation & Regression
Advantages of Multivariate Analysis Close resemblance to how the researcher thinks. Close resemblance to how the researcher thinks. Easy visualisation.
Regression and Correlation Methods Judy Zhong Ph.D.
1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.
© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.
Diploma in Statistics Introduction to Regression Lecture 2.21 Introduction to Regression Lecture Review of Lecture 2.1 –Homework –Multiple regression.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Multiple Collinearity, Serial Correlation,
1 Chapter 10 Correlation and Regression 10.2 Correlation 10.3 Regression.
1 Chapter 3 Multiple Linear Regression Multiple Regression Models Suppose that the yield in pounds of conversion in a chemical process depends.
Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.
2 Multicollinearity Presented by: Shahram Arsang Isfahan University of Medical Sciences April 2014.
Chapter 10 Correlation and Regression
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Chapter 14 Inference for Regression © 2011 Pearson Education, Inc. 1 Business Statistics: A First Course.
Regression Analysis Week 8 DIAGNOSTIC AND REMEDIAL MEASURES Residuals The main purpose examining residuals Diagnostic for Residuals Test involving residuals.
Anaregweek11 Regression diagnostics. Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)
1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’
Outliers and influential data points. No outliers?
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 2.
Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 6.
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 4.
More on regression Petter Mostad More on indicator variables If an independent variable is an indicator variable, cases where it is 1 will.
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 5.
Individual observations need to be checked to see if they are: –outliers; or –influential observations Outliers are defined as observations that differ.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Lab 4 Multiple Linear Regression. Meaning  An extension of simple linear regression  It models the mean of a response variable as a linear function.
DATA ANALYSIS AND MODEL BUILDING LECTURE 9 Prof. Roland Craigwell Department of Economics University of the West Indies Cave Hill Campus and Rebecca Gookool.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Regression Analysis Part A Basic Linear Regression Analysis and Estimation of Parameters Read Chapters 3, 4 and 5 of Forecasting and Time Series, An Applied.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
The simple linear regression model and parameter estimation
Chapter 6 Diagnostics for Leverage and Influence
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Multiple Linear Regression
Chapter 12: Regression Diagnostics
Diagnostics and Transformation for SLR
Stats Club Marnie Brennan
CHAPTER 29: Multiple Regression*
The greatest blessing in life is
Diagnostics and Transformation for SLR
Presentation transcript:

Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis 2 Statistical Data Analysis: Introduction Topics Summarizing data Investigating distributions Bootstrap Robust methods Nonparametric tests Analysis of categorical data Multiple linear regression (continued)

Statistical Data Analysis 3 Multiple linear regression (Reader: Chapter 8) Relationship between one response variable and one or more explanatory variable Last time: Statistical model Parameter estimation Selection explanatory variables (determination coef, F-, t-tests) Model quality: global methods/diagnostics (plots) This week: further investigation of model quality deviating observation points outlier, leverage point/potential, influence point plots, numerical measures and tests test for outliers, hat matrix, Cook’s distance explanatory variables that are themselves linearly related – collinearity: plots, numerical measures variance inflation factors, condition indices, variance decomposition

Statistical Data Analysis 4 Statistical model Multiple linear regression model independent and normally distributed Issues: 1) estimate 2) select explanatory variables 3) assess model quality

Statistical Data Analysis 5 3) Assessment of model quality – deviating points Consider observation point (y i, x i1,…,x ip ) types of deviating observation points deviating response: outlier deviating explanatory variable: potential or leverage point if point has influence: influence point how to detect outlier: test for outliers leverage point: hat matrix Influence point: Cook’s distance

Statistical Data Analysis 6 Example outlier Forbes’ data: boiling temperature for different pressure Small deviating effect in response may have large effects Generally easy to detect in plots

Statistical Data Analysis 7 3) Assessment of model quality – outliers Outlier: deviating response How to detect? Make plots - which ones? If possible outliers detected, do formal test Idea: if k-th point outlier, then it fits the regression model up to a shift δ i.e. it fits mean shift outlier model for sufficiently large | δ |, or in matrix notation with s.t. When is k-th point outlier in terms of δ ? How to test?

Statistical Data Analysis 8 3) Assessment of model quality – outliers Outlier: deviating response If k-th point outlier, then it fits mean shift outlier model for sufficiently large | δ |, with s.t. When is k-th point outlier in terms of δ ? If | δ | significantly different from 0, then k-th point outlier Test for outlier H 0 : δ = 0, β arbitrary H 1 : δ ≠ 0, β arbitrary (note: in Reader one-sided) Test statistic ~

Statistical Data Analysis 9 Example leverage point Huber’s data: Small deviation in explanatory variable may have large effect Often difficult to detect in plots: on edge of range of values value residual often not large

Statistical Data Analysis 10 3) Assessment of model quality – leverage points Potential or leverage point : deviating explanatory variable How to detect? With hatmatrix stems from Properties of H: and if h ii large then other h ij small We see and Hence, if h ii large, then i-th point has potential influence

Statistical Data Analysis 11 3) Assessment of model quality – influence points Influence point: if point has influence How to detect? check if point outlier or leverage point If yes, then fit model with and without this point If result very different: point is influence point Measure based on difference between estimated beta’s: Cook’s distance for i-th point: if D i larger than 1 (roughly), then i-th point is influence point Parameter estimate without i-th point

Statistical Data Analysis 12 3) Assessment of model quality – influence points Measure of influence based on difference between estimated beta’s: Cook’s distance for i-th point: If D i larger than 1 (roughly), then i-th point is influence point Explanation: the set is confidence region with confidence 1 – α for parameter vector β Thus defines measure of distance from For choices of α around 0.5 the values of b outside this set lie “far away” from For choices of α around 0.5 the boundary of the set,, has value around 1 Parameter estimate without i-th point

Statistical Data Analysis 13 Example influence points Cook’s distances for different data sets:

Statistical Data Analysis 14 3) Assessment of model quality – collinearity explanatory variables that are themselves linearly related – collinearity: numerical measures variance inflation factors, condition indices, variance decomposition when a problem if variance of one or more estimator is large then estimate(s) not reliable how to detect known methods? scatter plots, corr. coeff (between pairs of variables), determination coef of X j on others = squared multiple linear corr coeff between X j and others + several new numerical measures

Statistical Data Analysis 15 3) Assessment of model quality – collinearity exactly collinear if for some constants not all equal to 0 If one or more collinearities in (general) matrix X, then rank(X) not maximal and does not exist With approximate collinearities difficult to compute In design matrix X one or more (approximate) collinearities can exist between its columns In that case difficult to compute and/or one or more may be large

Statistical Data Analysis 16 3) Assessment of model quality – collinearity How to detect collinearity scatter plots, corr. coeff (between pairs of variables), determination coef of X j on all others = squared multiple linear corr coeff between X j and all others 4 new numerical measures i) variance inflation factors because VIF j is amount of increase in variance of due to relationship between X j and all others If VIF j large, then estimate unreliable

Statistical Data Analysis 17 3) Assessment of model quality – collinearity How to detect collinearity ii) condition number (read in Reader) iii) condition indices makes ues of singular value decomposition with and D = diagonal( ) k-th condition index: If small, thus large → collinearity because then if not too small, then X j involved in collinearity singular values of X ≥ 0

Statistical Data Analysis 18 3) Assessment of model quality – collinearity How to detect collinearity iv) variance decomposition proportions because (from s.v.d.) If is large, then investigate which terms involved via the Write the in matrix and look in row of large (= small ) which are close to 1 Corresponding X j involved in collinearity Easier to see then with method (iii)

Statistical Data Analysis 19 3) Assessment of model quality – collinearity No general guideline exists Sometimes: - leave out one or more explanatory variable - scale explanatory variables - center explanatory variables Always: - try to find explanation, this may lead to right choice Solutions for collinearity variable may loose meaning

Statistical Data Analysis 20 3) Assessment of model quality – example Now: Example body fat data different document