Models to Represent the Relationships Between Variables (Regression)

Slides:



Advertisements
Similar presentations
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Advertisements

Regression Analysis Simple Regression. y = mx + b y = a + bx.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Linear regression models
Ch11 Curve Fitting Dr. Deshi Ye
Simple Linear Regression and Correlation
Correlation and regression
The General Linear Model. The Simple Linear Model Linear Regression.
Chapter 13 Multiple Regression
Chapter 10 Simple Regression.
Statistics for Business and Economics
Chapter Topics Types of Regression Models
1 MF-852 Financial Econometrics Lecture 6 Linear Regression I Roy J. Epstein Fall 2003.
Ch. 14: The Multiple Regression Model building
Correlation and Regression Analysis
Introduction to Regression Analysis, Chapter 13,
Regression and Correlation Methods Judy Zhong Ph.D.
Linear Regression Analysis Additional Reference: Applied Linear Regression Models – Neter, Kutner, Nachtsheim, Wasserman The lecture notes of Dr. Thomas.
Statistics for Business and Economics Chapter 10 Simple Linear Regression.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
© 2001 Prentice-Hall, Inc. Statistics for Business and Economics Simple Linear Regression Chapter 10.
Regression Analysis Week 8 DIAGNOSTIC AND REMEDIAL MEASURES Residuals The main purpose examining residuals Diagnostic for Residuals Test involving residuals.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
[1] Simple Linear Regression. The general equation of a line is Y = c + mX or Y =  +  X.  > 0  > 0  > 0  = 0  = 0  < 0  > 0  < 0.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 11: Linear Regression and Correlation Regression analysis is a statistical tool that utilizes the relation between two or more quantitative variables.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
The “Big Picture” (from Heath 1995). Simple Linear Regression.
Model Selection and the Bias–Variance Tradeoff All models described have a smoothing or complexity parameter that has to be considered: multiplier of the.
STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.
Statistical Forecasting
What makes the Great Salt Lake level go up and down ?
The simple linear regression model and parameter estimation
Chapter 11: Linear Regression and Correlation
Probability Distributions
Concepts in Probability, Statistics and Stochastic Modeling
Chapter 14 Introduction to Multiple Regression
Chapter 7. Classification and Prediction
Regression Analysis AGEC 784.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Correlation and Simple Linear Regression
Multiple Linear Regression
مدل ریاضی در آب‌های سطحی
Simple Linear Regression - Introduction
Correlation and Simple Linear Regression
Diagnostics and Transformation for SLR
Regression Analysis Week 4.
CHAPTER 29: Multiple Regression*
Alafia river: Autocorrelation Autocorrelation of standardized flow.
Simple Linear Regression
Residuals The residuals are estimate of the error
Multiple Regression Models
Simple Linear Regression
Correlation and Simple Linear Regression
Simple Linear Regression
Product moment correlation
Example on the Concept of Regression . observation
Simple Linear Regression
Statistics for Business and Economics
Linear Regression and Correlation
Diagnostics and Transformation for SLR
Regression and Correlation of Data
Hydrology Modeling in Alaska: Modeling Overview
Introduction to Regression
Regression Models - Introduction
St. Edward’s University
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Models to Represent the Relationships Between Variables (Regression) Learning Objectives Develop a model to estimate an output variable from input variables. Select from a variety of modeling approaches in developing a model. Quantify the uncertainty in model predictions. Use models to provide forecasts or predictions for inputs different from any previously observed

Readings Kottegoda and Rosso, Chapter 6 Helsel and Hirsch, Chapters 9 and 11 Hastie, Tibshirani and Friedman, Chapters 1-2 Matlab Statistics Toolbox Users Guide, Chapter 4.

Regression The use of mathematical functions to model and investigate the dependence of one variable, say Y, called the response variable, on one or more other observed variables, say X, knows as the explanatory variables Not search for cause and effect relationship without prior knowledge Iterative process Formulate Fit Evaluate Validate

A Rose by any other name... Explanatory variable Independent variable x-value Predictor Input Regressor Response variable Dependent variable y-value Predictand Output

The modeling process Data gathering and exploratory data analysis Conceptual model development (hypothesis formulation) Applying various forms of models to see which relationships work and which do not. parameter estimation diagnostic testing interpretation of results

Conceptual model of system to guide analysis Natural Climate states: ENSO, PDO, NAO, … Other climate variables: temperature, humidity … Rainfall Management Groundwater pumping Surface water withdrawals Groundwater Level Surface water releases from storage Streamflow

Soil Moisture And Groundwater Conceptual Model Solar Radiation Precipitation Air Humidity Air Temp. Mountain Snowpack Evaporation GSL Level Volume Area Soil Moisture And Groundwater Salinity Streamflow

Bear River Basin Macro-Hydrology Streamflow response to basin and annual average forcing. Runoff ratio = 0.18 Streamflow Q/A mm Runoff ratio = 0.10 Precipitation mm LOWESS (R defaults) Temperature C

Annual Evaporation Loss E/A LOWESS (R defaults) Salinity decreases as volume increases. E increases as salinity decreases. E/A m Area m2

Evaporation vs Salinity LOWESS (R defaults) Salinity estimated from total load and volume related to decrease in E/A with decrease in lake volume and increase in C E/A m C = 3.5 x 1012 kg/(Volume) g/l

Evaporation vs Temperature (Annual) LOWESS (R defaults) E/A m Degrees C

Soil Moisture And Groundwater Conclusions Solar Radiation Precipitation Air Humidity Air Temp. Increases Reduces Mountain Snowpack Evaporation Area Control GSL Level Volume Area Supplies Reduces Soil Moisture And Groundwater Contributes Salinity CL/V Dominant Streamflow

Considerations in Model Selection Choice of complexity in functional relationship Theoretically infinite choice of type of functional relationship Classes of functional relationships Interplay between bias, variance and model complexity Generality/Transferability prediction capability on independent test data.

Model Selection Choices Example Complexity, Generality, Transferability R Commands # Data setup par(pch=20,cex=1.5) x=rnorm(10) y=x+rnorm(10,0,.2) # Plot 1 plot(x,y,) # Plot 2 plot(x,y) abline(0,1) # Plot 3 plot(x[order(x)],y[order(x)],type="o")

Interpolation

Functional Fit

How do we quantify the fit to the data? ei = f(xi) - yi RSS = 0 RSS > 0 xi X X Residual (ei): Difference between fit (f(xi)) and observed (yi) Residual Sum of Squared Error (RSS) :

Interpolation or function fitting? Which has the smallest fitting error? Is this a valid measure? Each is useful for its own purpose. Selection may hinge on considerations out of the data, such as the nature and purpose of the model and understanding of the process it represents.

Another Example R commands # Use same x values y2=cos(x*2)+rnorm(10,0,.2) # plot(x,y2) xx=seq(-2,2,.2) yy=cos(xx*2) lines(xx,yy) plot(x,y2,ylim=c(-1,1),xlim=c(-2,2)) lines(x[order(x)],y2[order(x)])

Interpolation

Functional Fit - Linear

Is a linear approximation appropriate? Which is better? Is a linear approximation appropriate?

The actual functional relationship (random noise added to cyclical function)

Another example of two approaches to prediction Linear model fit by least squares Nearest neighbor

General function fitting

General function fitting – Independent data samples x1 x2 x3 y ……… Input Output Independent data vectors Example linear regression y=a x + b + 

Statistical decision theory inputs, p dimensional, real valued real valued output variable Joint distribution Pr(X,Y) Seek a function f(X) for predicting Y given X. Loss function to penalize errors in prediction e.g. L(Y, f(X))=(Y-f(X))2 square error L(Y, f(X))=|Y-f(X)| absolute error

Criterion for choosing f Minimize expected loss e.g. E[L] = E[(Y-f(X))2]  f(x) = E[Y|X=x] The conditional expectation, known as the regression function This is the best prediction of Y at any point X=x when best is measured by average square error.

Basis for nearest neighbor method Expectation approximated by averaging over sample data Conditioning at a point relaxed to conditioning on some region close to the target point

Basis for linear regression Model based. Assumes a model, f(x) = a + b x Plug f(X) in to expected loss E[L] = E[(Y-a-bX)2] Solve for a, b that minimize this theoretically Did not condition on X, rather used (assumed) knowledge of the functional relationship to pool over values of X.

Comparison of assumptions Linear model fit by least squares assumes f(x) is well approximated by a global linear function k nearest neighbor assumes f(x) is well approximated by a locally constant function

Mean((y- )2) = 0.0459 Mean((f(x)- )2) = 0.00605

k=20 Mean((y- )2) = 0.0408 Mean((f(x)- )2) = 0.00262

k=60 Mean((y- )2) = 0.0661 Mean((f(x)- )2) = 0.0221

50 sets of samples generated For each calculated at specific xo values for linear fit and knn fit MSE = Variance + Bias2

Dashed lines from linear regression

Dashed lines from linear regression

Simple Linear Regression Model Kottegoda and Rosso page 343

Regression is performed to learn something about the relationship between variables remove a portion of the variation in one variable (a portion that may not be of interest) in order to gain a better understanding of some other, more interesting, portion estimate or predict values of one variable based on knowledge of another variable Helsel and Hirsch page 222

Regression Assumptions Helsel and Hirsch page 225

Regression Diagnostics - Residuals Kottegoda and Rosso page 350

Regression Diagnostics - Antecedent Residual Kottegoda and Rosso page 350

Regression Diagnostics - Test residuals for normality Kottegoda and Rosso page 351

Regression Diagnostics - Residual versus explanatory variable Kottegoda and Rosso page 351

Regression Diagnostics - Residual versus predicted response variable Helsel and Hirsch page 232

Regression Diagnostics - Residual versus predicted response variable Helsel and Hirsch page 232

Quantile-Quantile Plots QQ-plot for Raw Flows QQ-plot for Log-Transformed Flows Need transformation to Normalize the data

Bulging Rule For Transformations Up,  >1 (x2, etc.) Down,  <1 (log x, 1/x, , etc.) Helsel and Hirsch page 229

Box-Cox Transformation z = (x -1)/ ;   0 z = ln(x);  = 0 Kottegoda and Rosso page 381

Box-Cox Normality Plot for Monthly September Flows on Alafia R. Using PPCC This is close to 0,  = -0.14