Modeling in R Sanna Härkönen.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Kin 304 Regression Linear Regression Least Sum of Squares
Correlation and regression
Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
Forecasting Using the Simple Linear Regression Model and Correlation
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Chapter 13 Multiple Regression
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
1-1 Regression Models  Population Deterministic Regression Model Y i =  0 +  1 X i u Y i only depends on the value of X i and no other factor can affect.
Chapter 12 Multiple Regression
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Gordon Stringer, UCCS1 Regression Analysis Gordon Stringer.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Chapter Topics Types of Regression Models
Probability & Statistics for Engineers & Scientists, by Walpole, Myers, Myers & Ye ~ Chapter 11 Notes Class notes for ISE 201 San Jose State University.
Simple Linear Regression Analysis
Ch. 14: The Multiple Regression Model building
Pertemua 19 Regresi Linier
Correlation and Regression Analysis
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Simple Linear Regression Analysis
Advantages of Multivariate Analysis Close resemblance to how the researcher thinks. Close resemblance to how the researcher thinks. Easy visualisation.
Regression and Correlation Methods Judy Zhong Ph.D.
Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.
Statistics for Business and Economics 8 th Edition Chapter 11 Simple Regression Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Ch.
Regression Method.
1 FORECASTING Regression Analysis Aslı Sencer Graduate Program in Business Information Systems.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Managerial Economics Demand Estimation. Scatter Diagram Regression Analysis.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Applied Quantitative Analysis and Practices LECTURE#23 By Dr. Osman Sadiq Paracha.
Ch4 Describing Relationships Between Variables. Section 4.1: Fitting a Line by Least Squares Often we want to fit a straight line to data. For example.
Regression. Population Covariance and Correlation.
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 13-1 Introduction to Regression Analysis Regression analysis is used.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Lecture 10: Correlation and Regression Model.
Environmental Modeling Basic Testing Methods - Statistics III.
© 2001 Prentice-Hall, Inc.Chap 13-1 BA 201 Lecture 18 Introduction to Simple Linear Regression (Data)Data.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Chapter 11: Linear Regression and Correlation Regression analysis is a statistical tool that utilizes the relation between two or more quantitative variables.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Stats Methods at IC Lecture 3: Regression.
The simple linear regression model and parameter estimation
Chapter 4: Basic Estimation Techniques
Regression and Correlation of Data Summary
EXCEL: Multiple Regression
Chapter 14 Introduction to Multiple Regression
Chapter 20 Linear and Multiple Regression
Chapter 4 Basic Estimation Techniques
Regression Analysis AGEC 784.
Inference for Least Squares Lines
Multiple Regression.
Chapter 11 Simple Regression
Chapter 13 Simple Linear Regression
BIVARIATE REGRESSION AND CORRELATION
Correlation and Simple Linear Regression
Regression Analysis Week 4.
CHAPTER 29: Multiple Regression*
Correlation and Simple Linear Regression
Simple Linear Regression
Simple Linear Regression and Correlation
Linear Regression and Correlation
Linear Regression and Correlation
Chapter 13 Simple Linear Regression
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Modeling in R Sanna Härkönen

Model fitting: simple linear model Important measures: Correlation r Coefficient of determination R2 p-values Residuals (examining their distribution)

PEARSON CORRELATION r Measures linear relationship between two variables Even though correlation would be low, there can be strong relationship (non-linear) between the variables Can be positive/negative depending on the relationship (-1..1) Equation:

EXAMPLE: SAME CORRELATION (0.816), BUT different RELATIONSHIP LINEAR FIT OK ONLY HERE: http://en.wikipedia.org/wiki/File:Anscombe%27s_quartet_3.svg

REGRESSION ANALYSIS Examining relationships of variables Dependent variable: the variable that is explained by the independent variable(s) Coefficient of determination R2 = r2, where r is correlation For example if D would be expressed as a function of H -> D is dependent and H is independent variable.

SIMPLE LINEAR REGRESSION Fitting linear regression line between two variables. y = β0 + β1 *x + ε (y is the dependent (=response) variale, x is the independent (=predictor) variable, β0 is the constant, β1 is the slope and ε is the random error) Method: least squares regression, where the regression line is fitted so that the sum of squares of model residuals (measured y – modeled y)2 is minimized

INTERPRETATION: r and R2 Relationship: non-significant moderate remarkable strong |r|=0.0 R2=0.0 |r|=0.4 R2=0.16 |r|=0.6 R2=0.36 |r|=0.8 R2=0.64 |r|=1 R2=1 H explains ~14% of the variation in D. Poor fit. H explains ~90% of the variation in D. Very good fit.

FITTING A SIMPLE LINEAR MODEL Linear relationship Import the data to R (command read.csv()) Examine summary statistics of your variables (summary() command in R) Examine the relationships of variables by plotting them (plot() command in R) If you see a linear relationship between the dependent variable and explanatory variables -> you can fit a linear model If the relationship is not linear, you can try to first linearize it by doing conversion for the variable(s) (e.g. logarithm, exponential, …) and then apply linear regression with the conversed values Fit the linear model in R: command lm(y~x), where y is dependent and x is independent variable Examine the results of the regression (significance of variables, R2 etc) using summary() command Examine the residuals Non-linear relationship of X and Y Linear relationship of X and exp(Y)

Summary statistics Dataset ”a”: summary

Plotting plot(a$D, a$TOTAL_VOLUME)

plot(a$BA, a$TOTAL_VOLUME) Need for linearizing??

R example: BUILDING LINEAR MODEL in R Building linear model for basal area (BA1) as a function of height (H1)

RESULTS OF REGRESSION ANALYSIS : R Summary statistics of residuals (= original_y – modeled_y) Intercept and slope for the model. -> Y = 0.126937 + 0.117584 X Standard error of the estimates t-test values (estimate/SE) and their p-values: express if the variable is significant with certain significance level Degrees of freedom: sample size – number of variables in the model F-test’s value and its p-value express if the independent variables in the model capable to explain the dependent variable. R-squared: R2 Adjusted R-squared: takes into account number of variables in the model. It is used when comparing regression models with different number of variables. How to interpret p-value: <0.01 very significant (with >99% probability) <0.05 significant (with >95% probability) > 0.05: not significant Residual standard error: (sqrt(sum((mod_y-orig_y)^2)/(n-2))

Residuals Important to check after model fitting Residuals : measured Y – modeled Y

Interpreting Residual Plots Residuals should look like this Variable transformation required Outliers non-constant variance and outliers variable Xj should be included in the model [1] from: VANCLAY, J. 1994. “Modelling Forest Growth and Yield. Application to Mixed Tropical Forests” CAB International.. BLAS MOLA’s SLIDES

Residuals: Y_measured – Y_Modeled If the model is good, the residuals should be homoscedastic, i.e no trend with x should be present in residuals follow normal distribution R command plot.lm(your_model) can be used for examining residuals: Upper figure: residuals should be equally distributed around the 0-line. In the example figure, howerev, there seems to be lowering trend in residuals -> not good. Lower figure: all the residuals would be on the straight line, if the residuals follow normal distribution. -> in the example figure they don’t seem to completely follow normal distribution.

EXAMPLE

Exercises in GROUPS: which is the best model? Which is the worst? WHY?

R examples Multiple regression: lm(volume ~ height + diameter + basal area) Using dummy variables (categorical): (e.g. species, forest type etc categories) lm(volume ~ height + factor(tree_species)

Total volume as function of H

TOTAL VOLUME as function of H and BA

Total volume as function of H, BA and forest type (dummy) Interpretation of output, if dummy variable is used: Forest types 1-7 present. Forest type 1 is the ”base” category (no multipliers). If forest type is 2 -> factor(a$FOREST_TYPE)2 coefficient is 1 and is multiplied with estimate value 13.097745. In that case all other forest type coefficients are 0. Etc with other forest types

Interpret these R summaries of the model fits. Write down the equations (y=a + b*x) of both models. Which model is better? Are the intercept and slope significant in both models? Are both models capable for estimating the desired variable? What else would you need to check when considering the model goodness?