Multivariate Data. Descriptive techniques for Multivariate data In most research situations data is collected on more than one variable (usually many.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Lesson 10: Linear Regression and Correlation
Probability & Statistical Inference Lecture 9
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Chapter 4 The Relation between Two Variables
The General Linear Model. The Simple Linear Model Linear Regression.
Scatter Diagrams and Linear Correlation
Correlation & Regression Chapter 15. Correlation statistical technique that is used to measure and describe a relationship between two variables (X and.
Chapter 15 (Ch. 13 in 2nd Can.) Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression.
The Simple Regression Model
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 13 Introduction to Linear Regression and Correlation Analysis.
SIMPLE LINEAR REGRESSION
Correlation and Regression. Correlation What type of relationship exists between the two variables and is the correlation significant? x y Cigarettes.
REGRESSION AND CORRELATION
SIMPLE LINEAR REGRESSION
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Correlation and Regression Analysis
Correlation and Regression A BRIEF overview Correlation Coefficients l Continuous IV & DV l or dichotomous variables (code as 0-1) n mean interpreted.
Lecture 16 Correlation and Coefficient of Correlation
Linear Regression.
SIMPLE LINEAR REGRESSION
Relationship of two variables
Correlation and regression 1: Correlation Coefficient
Simple Linear Regression
Correlation.
Chapter 14 – Correlation and Simple Regression Math 22 Introductory Statistics.
Chapter 15 Correlation and Regression
Probabilistic and Statistical Techniques 1 Lecture 24 Eng. Ismail Zakaria El Daour 2010.
Measures of Variability Variability. Measure of Variability (Dispersion, Spread) Variance, standard deviation Range Inter-Quartile Range Pseudo-standard.
© The McGraw-Hill Companies, Inc., Chapter 11 Correlation and Regression.
1 Chapter 10 Correlation and Regression 10.2 Correlation 10.3 Regression.
Correlation is a statistical technique that describes the degree of relationship between two variables when you have bivariate data. A bivariate distribution.
Chapter 10 Correlation and Regression
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Measure of Variability (Dispersion, Spread) 1.Range 2.Inter-Quartile Range 3.Variance, standard deviation 4.Pseudo-standard deviation.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
CORRELATION. Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson’s coefficient of correlation.
Simple linear regression Tron Anders Moger
Chapter 9: Correlation and Regression Analysis. Correlation Correlation is a numerical way to measure the strength and direction of a linear association.
Agresti/Franklin Statistics, 1 of 88 Chapter 11 Analyzing Association Between Quantitative Variables: Regression Analysis Learn…. To use regression analysis.
Correlation & Regression Analysis
Chapter 8: Simple Linear Regression Yang Zhenlin.
Multivariate data. Regression and Correlation The Scatter Plot.
Regression Analysis. 1. To comprehend the nature of correlation analysis. 2. To understand bivariate regression analysis. 3. To become aware of the coefficient.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
The Simple Linear Regression Model. Estimators in Simple Linear Regression and.
CORRELATION ANALYSIS.
Summarizing Data Graphical Methods. Histogram Stem-Leaf Diagram Grouped Freq Table Box-whisker Plot.
Linear Regression Hypothesis testing and Estimation.
Slide 1 Copyright © 2004 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-1 Overview Overview 10-2 Correlation 10-3 Regression-3 Regression.
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
The simple linear regression model and parameter estimation
Regression and Correlation
Multivariate Data.
Correlation and Simple Linear Regression
SCATTERPLOTS, ASSOCIATION AND RELATIONSHIPS
Correlation and Simple Linear Regression
Lecture Slides Elementary Statistics Thirteenth Edition
6-1 Introduction To Empirical Models
Correlation and Simple Linear Regression
Correlation and Regression
Comparing k Populations
Correlation and Regression
Simple Linear Regression and Correlation
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Multivariate Data

Descriptive techniques for Multivariate data In most research situations data is collected on more than one variable (usually many variables)

Graphical Techniques The scatter plot The two dimensional Histogram

The Scatter Plot For two variables X and Y we will have a measurements for each variable on each case: x i, y i x i = the value of X for case i and y i = the value of Y for case i.

To Construct a scatter plot we plot the points: ( x i, y i ) for each case on the X-Y plane. ( x i, y i ) xixi yiyi

Data Set #3 The following table gives data on Verbal IQ, Math IQ, Initial Reading Acheivement Score, and Final Reading Acheivement Score for 23 students who have recently completed a reading improvement program InitialFinal VerbalMathReadingReading StudentIQIQAcheivementAcheivement

(84,80)

Some Scatter Patterns

Circular No relationship between X and Y Unable to predict Y from X

Ellipsoidal Positive relationship between X and Y Increases in X correspond to increases in Y (but not always) Major axis of the ellipse has positive slope

Example Verbal IQ, MathIQ

Some More Patterns

Ellipsoidal (thinner ellipse) Stronger positive relationship between X and Y Increases in X correspond to increases in Y (more freqequently) Major axis of the ellipse has positive slope Minor axis of the ellipse much smaller

Increased strength in the positive relationship between X and Y Increases in X correspond to increases in Y (almost always) Minor axis of the ellipse extremely small in relationship to the Major axis of the ellipse.

Perfect positive relationship between X and Y Y perfectly predictable from X Data falls exactly along a straight line with positive slope

Ellipsoidal Negative relationship between X and Y Increases in X correspond to decreases in Y (but not always) Major axis of the ellipse has negative slope slope

The strength of the relationship can increase until changes in Y can be perfectly predicted from X

Some Non-Linear Patterns

In a Linear pattern Y increase with respect to X at a constant rate In a Non-linear pattern the rate that Y increases with respect to X is variable

Growth Patterns

Growth patterns frequently follow a sigmoid curve Growth at the start is slow It then speeds up Slows down again as it reaches it limiting size

Review the scatter plot

Some Scatter Patterns

Non-Linear Patterns

Measures of strength of a relationship (Correlation) Pearson’s correlation coefficient (r) Spearman’s rank correlation coefficient (rho,  )

Assume that we have collected data on two variables X and Y. Let ( x 1, y 1 ) ( x 2, y 2 ) ( x 3, y 3 ) … ( x n, y n ) denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)

From this data we can compute summary statistics for each variable. The means and

The standard deviations and

These statistics: give information for each variable separately but give no information about the relationship between the two variables

Consider the statistics:

The first two statistics: are used to measure variability in each variable they are used to compute the sample standard deviations and

The third statistic: is used to measure correlation If two variables are positively related the sign of will agree with the sign of

When is positive will be positive. When x i is above its mean, y i will be above its mean When is negative will be negative. When x i is below its mean, y i will be below its mean The product will be positive for most cases.

This implies that the statistic will be positive Most of the terms in this sum will be positive

On the other hand If two variables are negatively related the sign of will be opposite in sign to

When is positive will be negative. When x i is above its mean, y i will be below its mean When is negative will be positive. When x i is below its mean, y i will be above its mean The product will be negative for most cases.

Again implies that the statistic will be negative Most of the terms in this sum will be negative

Pearsons correlation coefficient is defined as below:

The denominator: is always positive

The numerator: is positive if there is a positive relationship between X ad Y and negative if there is a negative relationship between X ad Y. This property carries over to Pearson’s correlation coefficient r

Properties of Pearson’s correlation coefficient r 1.The value of r is always between –1 and If the relationship between X and Y is positive, then r will be positive. 3.If the relationship between X and Y is negative, then r will be negative. 4.If there is no relationship between X and Y, then r will be zero. 5.The value of r will be +1 if the points, ( x i, y i ) lie on a straight line with positive slope. 6.The value of r will be -1 if the points, ( x i, y i ) lie on a straight line with negative slope.

r =1

r = 0.95

r = 0.7

r = 0.4

r = 0

r = -0.4

r = -0.7

r = -0.8

r = -0.95

r = -1

Computing formulae for the statistics:

To compute first compute Then

Example Verbal IQ, MathIQ

Data Set #3 The following table gives data on Verbal IQ, Math IQ, Initial Reading Acheivement Score, and Final Reading Acheivement Score for 23 students who have recently completed a reading improvement program InitialFinal VerbalMathReadingReading StudentIQIQAcheivementAcheivement

Now Hence

Thus Pearsons correlation coefficient is:

Thus r = Verbal IQ and Math IQ are positively correlated. If Verbal IQ is above (below) the mean then for most cases Math IQ will also be above (below) the mean.

Is the improvement in reading achievement (RA) related to either Verbal IQ or Math IQ? improvement in RA = Final RA – Initial RA

The Data Correlation between Math IQ and RA Improvement Correlation between Verbal IQ and RA Improvement

Scatterplot: Math IQ vs RA Improvement

Scatterplot: Verbal IQ vs RA Improvement

Spearman’s rank correlation coefficient  (rho)

Spearman’s rank correlation coefficient  (rho) Spearman’s rank correlation coefficient is computed as follows: Arrange the observations on X in increasing order and assign them the ranks 1, 2, 3, …, n Arrange the observations on Y in increasing order and assign them the ranks 1, 2, 3, …, n. For any case (i) let ( x i, y i ) denote the observations on X and Y and let ( r i, s i ) denote the ranks on X and Y.

If the variables X and Y are strongly positively correlated the ranks on X should generally agree with the ranks on Y. (The largest X should be the largest Y, The smallest X should be the smallest Y). If the variables X and Y are strongly negatively correlated the ranks on X should in the reverse order to the ranks on Y. (The largest X should be the smallest Y, The smallest X should be the largest Y). If the variables X and Y are uncorrelated the ranks on X should randomly distributed with the ranks on Y.

Spearman’s rank correlation coefficient is defined as follows: For each case let d i = r i – s i = difference in the two ranks. Then Spearman’s rank correlation coefficient (  ) is defined as follows:

Properties of Spearman’s rank correlation coefficient  1.The value of  is always between –1 and If the relationship between X and Y is positive, then  will be positive. 3.If the relationship between X and Y is negative, then  will be negative. 4.If there is no relationship between X and Y, then  will be zero. 5.The value of  will be +1 if the ranks of X completely agree with the ranks of Y. 6.The value of  will be -1 if the ranks of X are in reverse order to the ranks of Y.

Example x i y i Ranking the X’s and the Y’s we get: r i s i Computing the differences in ranks gives us: d i

Computing Pearsons correlation coefficient, r, for the same problem:

To compute first compute

Then

and Compare with

Comments: Spearman’s rank correlation coefficient  and Pearson’s correlation coefficient r 1.The value of  can also be computed from: 2.Spearman’s  is Pearson’s r computed from the ranks.

3.Spearman’s  is less sensitive to extreme observations. (outliers) 4.The value of Pearson’s r is much more sensitive to extreme outliers. This is similar to the comparison between the median and the mean, the standard deviation and the pseudo-standard deviation. The mean and standard deviation are more sensitive to outliers than the median and pseudo- standard deviation.

Scatter plots

Some Scatter Patterns

Non-Linear Patterns

Measuring correlation 1.Pearson’s correlation coefficient r 2.Spearman’s rank correlation coefficient 

Simple Linear Regression Fitting straight lines to data

The Least Squares Line The Regression Line When data is correlated it falls roughly about a straight line.

In this situation wants to: Find the equation of the straight line through the data that yields the best fit. The equation of any straight line: is of the form: Y = a + bX b = the slope of the line a = the intercept of the line

a Run = x 2 -x 1 Rise = y 2 -y 1 b = Rise Runx2-x1x2-x1 = y2-y1y2-y1

a is the value of Y when X is zero b is the rate that Y increases per unit increase in X. For a straight line this rate is constant. For non linear curves the rate that Y increases per unit increase in X varies with X.

Linear

Non-linear

Age Class Mipoint Age (X) Median BP (Y) Example: In the following example both blood pressure and age were measure for each female subject. Subjects were grouped into age classes and the median Blood Pressure measurement was computed for each age class. He data are summarized below:

Graph:

Interpretation of the slope and intercept 1.Intercept – value of Y at X = 0. –Predicted Blood pressure of a newborn (65.1). –This interpretation remains valid only if linearity is true down to X = 0. 2.Slope – rate of increase in Y per unit increase in X. –Blood Pressure increases 1.38 units each year.

The Least Squares Line Fitting the best straight line to “linear” data

Reasons for fitting a straight line to data 1.It provides a precise description of the relationship between Y and X. 2.The interpretation of the parameters of the line (slope and intercept) leads to an improved understanding of the phenomena that is under study. 3.The equation of the line is useful for prediction of the dependent variable (Y) from the independent variable (X).

Assume that we have collected data on two variables X and Y. Let ( x 1, y 1 ) ( x 2, y 2 ) ( x 3, y 3 ) … ( x n, y n ) denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)

Let Y = a + b X denote an arbitrary equation of a straight line. a and b are known values. This equation can be used to predict for each value of X, the value of Y. For example, if X = x i (as for the i th case) then the predicted value of Y is:

For example if Y = a + b X = X Is the equation of the straight line. and if X = x i = 20 (for the i th case) then the predicted value of Y is:

If the actual value of Y is y i = 70.0 for case i, then the difference is the error in the prediction for case i. is also called the residual for case i

If the residual can be computed for each case in the sample, The residual sum of squares (RSS) is a measure of the “goodness of fit of the line Y = a + bX to the data

X Y=a+bX Y (x 1,y 1 ) (x 2,y 2 ) (x 3,y 3 ) (x 4,y 4 ) r1r1 r2r2 r3r3 r4r4

The optimal choice of a and b will result in the residual sum of squares attaining a minimum. If this is the case than the line: Y = a + bX is called the Least Squares Line

The equation for the least squares line Let

Computing Formulae:

Then the slope of the least squares line can be shown to be:

and the intercept of the least squares line can be shown to be:

The following data showed the per capita consumption of cigarettes per month (X) in various countries in 1930, and the death rates from lung cancer for men in TABLE : Per capita consumption of cigarettes per month (X i ) in n = 11 countries in 1930, and the death rates, Y i (per 100,000), from lung cancer for men in Country (i)X i Y i Australia4818 Canada5015 Denmark3817 Finland11035 Great Britain11046 Holland4924 Iceland236 Norway259 Sweden3011 Switzerland5125 USA13020

Fitting the Least Squares Line

First compute the following three quantities:

Computing Estimate of Slope and Intercept

Y = (0.228)X

Interpretation of the slope and intercept 1.Intercept – value of Y at X = 0. –Predicted death rate from lung cancer (6.756) for men in 1950 in Counties with no smoking in 1930 (X = 0). 2.Slope – rate of increase in Y per unit increase in X. –Death rate from lung cancer for men in 1950 increases units for each increase of 1 cigarette per capita consumption in 1930.

Age Class Mipoint Age (X) Median BP (Y) Example: In the following example both blood pressure and age were measure for each female subject. Subjects were grouped into age classes and the median Blood Pressure measurement was computed for each age class. He data are summarized below:

Fitting the Least Squares Line

First compute the following three quantities:

Computing Estimate of Slope and Intercept

Graph:

Relationship between correlation and Linear Regression 1.Pearsons correlation. Takes values between –1 and +1

2.Least squares Line Y = a + bX –Minimises the Residual Sum of Squares: –The Sum of Squares that measures the variability in Y that is unexplained by X. –This can also be denoted by: SS unexplained

Some other Sum of Squares: –The Sum of Squares that measures the total variability in Y (ignoring X).

–The Sum of Squares that measures the total variability in Y that is explained by X.

It can be shown: (Total variability in Y) = (variability in Y explained by X) + (variability in Y unexplained by X)

It can also be shown: = proportion variability in Y explained by X. = the coefficient of determination

Further: = proportion variability in Y that is unexplained by X.

Example TABLE : Per capita consumption of cigarettes per month (X i ) in n = 11 countries in 1930, and the death rates, Y i (per 100,000), from lung cancer for men in Country (i)X i Y i Australia4818 Canada5015 Denmark3817 Finland11035 Great Britain11046 Holland4924 Iceland236 Norway259 Sweden3011 Switzerland5125 USA13020

Fitting the Least Squares Line First compute the following three quantities:

Computing Estimate of Slope and Intercept

Computing r and r % of the variability in Y (death rate due to lung Cancer (1950) is explained by X (per capita cigarette smoking in 1930)

Y = (0.228)X

Comments Correlation will be +1 or -1 if the data lies on a straight line. Correlation can be zero or close to zero if the data is either –Not related or –In some situations non-linear

Example The data

One should be careful in interpreting zero correlation. It does not necessarily imply that Y is not related to X. It could happen that Y is non-linearly related to X. One should plot Y vs X before concluding that Y is not related to X.