Understanding Your Data Set Statistics are used to describe data sets Gives us a metric in place of a graph What are some types of statistics used to describe.

Slides:



Advertisements
Similar presentations
Regression and correlation methods
Advertisements

Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Analysis of Variance Compares means to determine if the population distributions are not similar Uses means and confidence intervals much like a t-test.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Chapter 12 Simple Regression
Chapter 13 Introduction to Linear Regression and Correlation Analysis
The Simple Regression Model
Chapter Topics Types of Regression Models
Biol 500: basic statistics
Simple Linear Regression Analysis
Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.
Linear Regression Example Data
Chapter 9 - Lecture 2 Computing the analysis of variance for simple experiments (single factor, unrelated groups experiments).
Today Concepts underlying inferential statistics
Correlation and Regression Analysis
Introduction to Regression Analysis, Chapter 13,
Chapter 12 Section 1 Inference for Linear Regression.
Relationships Among Variables
Inferential Statistics
Lecture 15 Basics of Regression Analysis
AM Recitation 2/10/11.
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Correlation and Linear Regression
CPE 619 Simple Linear Regression Models Aleksandar Milenković The LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama.
Simple Linear Regression Models
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
F OUNDATIONS OF S TATISTICAL I NFERENCE. D EFINITIONS Statistical inference is the process of reaching conclusions about characteristics of an entire.
Population: a data set representing the entire entity of interest - What is a population? Sample: a data set representing a portion of a population Population.
Regression Analysis. Scatter plots Regression analysis requires interval and ratio-level data. To see if your data fits the models of regression, it is.
© 2003 Prentice-Hall, Inc.Chap 13-1 Basic Business Statistics (9 th Edition) Chapter 13 Simple Linear Regression.
Regression For the purposes of this class: –Does Y depend on X? –Does a change in X cause a change in Y? –Can Y be predicted from X? Y= mX + b Predicted.
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Statistical analysis Outline that error bars are a graphical representation of the variability of data. The knowledge that any individual measurement.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Y X 0 X and Y are not perfectly correlated. However, there is on average a positive relationship between Y and X X1X1 X2X2.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Analysis of Variance (ANOVA) Can compare the effects of different treatments Can make population level inferences based on sample population.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Understanding Your Data Set Statistics are used to describe data sets Gives us a metric in place of a graph What are some types of statistics used to describe.
Midterm Review Ch 7-8. Requests for Help by Chapter.
Descriptive Statistics Used to describe a data set –Mean, minimum, maximum Usually include information on data variability (error) –Standard deviation.
Correlation & Regression Analysis
PCB 3043L - General Ecology Data Analysis.
Introducing Communication Research 2e © 2014 SAGE Publications Chapter Seven Generalizing From Research Results: Inferential Statistics.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Regression Analysis Deterministic model No chance of an error in calculating y for a given x Probabilistic model chance of an error First order linear.
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
Regression For the purposes of this class: –Does Y depend on X? –Does a change in X cause a change in Y? –Can Y be predicted from X? Y= mX + b Predicted.
Significance Tests for Regression Analysis. A. Testing the Significance of Regression Models The first important significance test is for the regression.
Understanding Your Data Set Statistics are used to describe data sets Gives us a metric in place of a graph What are some types of statistics used to describe.
BPS - 5th Ed. Chapter 231 Inference for Regression.
 List the characteristics of the F distribution.  Conduct a test of hypothesis to determine whether the variances of two populations are equal.  Discuss.
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 10: Comparing Models.
Chapter 4 Basic Estimation Techniques
PCB 3043L - General Ecology Data Analysis.
POSC 202A: Lecture Lecture: Substantive Significance, Relationship between Variables 1.
AP Biology Intro to Statistics
Correlation and Regression
CHAPTER 29: Multiple Regression*
15.1 The Role of Statistics in the Research Process
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Understanding Your Data Set Statistics are used to describe data sets Gives us a metric in place of a graph What are some types of statistics used to describe data sets? –Average, range, variance, standard deviation, coefficient of variation, standard error

Length NumberPondLake Average= Table 1. Total length (cm) and average length of spotted gar collected from a local farm pond and from a local lake.

Are the two samples equal? –What about 47.2 and 47.3? If we sampled all of the gar in each water body, would the average be different? –How different? Would the lake fish average still be larger? Length NumberPondLake Average=

Range Simply the distance between the smallest and largest value Length (cm) Figure 1. Range of spotted gar length collected from a pond and a lake. The dashed line represents the overlap in range.

Length (cm) Does the difference in average length (47.2 vs. 68.2) seem to be much as large as before?

Variance An index of variability used to describe the dispersion among the measures of a population sample. Need the distance between each sample point and the sample mean.

Figure 2. Mean length (cm) of each spotted gar collected from the pond. The horizontal solid line represents the sample mean length.

We can easily put this new data set into a spreadsheet table. By adding up all of the differences, we can get a number that is a reflection of how scattered the data points are. –Closer to the mean each number is, the smaller the total difference. After adding up all of the differences, we get zero. –This is true of all calculations like this What can we do to get rid of the negative values? #LengthMeanDifference Sum =0

Sum of Squares #LengthMeanDifferenceDifference Sum = Now is a number we can use! This value is called the SUM OF SQUARES.

Back to Variance Sum of Squares (SOS) will continue to increase as we increase our sample size. –A sample of 10 replicates that are highly variable would have a higher SOS than a sample of 100 replicates that are not highly variable. To account for sample size, we need to divide SOS by the number of samples minus one (n-1). –We’ll get to the reason (n-1) instead of n later

Calculate Variance (σ 2 ) σ 2 = S 2 =  (X i – X m ) 2 / (n – 1) SOS Degrees of Freedom Variance for Pond = S 2 = / 9 =

More on Variance Variance tends to increase as the sample mean increases –For our sample, the largest difference between any point and the mean was 30.8 cm. Imagine measuring a plot of cypress trees. How large of a difference would you expect (if measured in cm)? The variance for the lake sample =

Standard Deviation Calculated as the square root of the variance. –Variance is not a linear distance (we had to square it). Think about the difference in shape of a meter stick versus a square meter. By taking the square root of the variance, we return our index of variability to something that can be placed on a number line.

Calculate SD For our gar sample, the Variance was The square root of = –Reported with the mean as: 47.2 ± (mean ± SD). Standard Deviation is often abbreviated as σ (sigma) or as SD. SD is a unit of measurement that describes the scatter of our data set. –Also increases with the mean

Standard Error Calculated as: SE = σ / √(n) –Indicates how close we are to estimating the true population mean –For our pond ex: SE = / √10 = –Reported with the mean as 47.2 ± (mean ± SE). –Based on the formula, the SE decreases as sample size increases. Why is this not a mathematical artifact, but a true reflection of the population we are studying?

Sample Size The number of individuals within a population you measure/observe. –Usually impossible to measure the entire population As sample size increases, we get closer to the true population mean. –Remember, when we take a sample we assume it is representative of the population.

Effect of Increasing Sample Size I measured the length of 100 gar Calculated SD and SE for the first 10, then included the next additional 10, and so on until all 100 individuals were included.

Sample Size

SD = Square root of the variance (Var =  (X i – X m ) / (n – 1))

Sample Size SE = SD / √(n)

Population: a data set representing the entire entity of interest - What is a population? Sample: a data set representing a portion of a population Population Sample

Population mean – the true mean for that population -a single number Sample mean – the estimated population mean -a range of values (estimate ± 95% confidence interval) Population Sample

As our sample size increases, we sample more and more of the population. Eventually, we will have sampled the entire population and our sample distribution will be the population distribution Increasing sample size

Variance =  (x-x) 2 N-1  i= x N N Mean = x = Standard Deviation =  (x-x) 2 N-1  Go to Excel Mean = 169/6 = Range = 25 – 32 SOS = Variance = / 5 = 8.16 Std. Dev. =  40.83/5 = 2.86 Std. Err. = 2.86 / √ 6 = 1.17 Standard Error = SD √N

MEAN ± CONFIDENCE INTERVAL When a population is sampled, a mean value is determined and serves as the point-estimate for that population. However, we cannot expect our estimate to be the exact mean value for the population. Instead of relying on a single point-estimate, we estimate a range of values, centered around the point-estimate, that probably includes the true population mean. That range of values is called the confidence interval.

Confidence Interval Confidence Interval: consists of two numbers (high and low) computed from a sample that identifies the range for an interval estimate of a parameter. There is a 5% chance (95% confidence interval) that our interval does not include the true population mean. y ± (t  /0.05 )[(  ) / (  n)] ±    30.45

Hypothesis Testing –Null versus Alternative Hypothesis Briefly: –Null Hypothesis: Two means are not different –Alternative Hypothesis: Two means are not similar A test statistic based on a predetermined probability (usually 0.05) is used to reject or accept the null hypothesis  < 0.05 then there is a significant difference  > 0.05 then there is NO significant difference

Are Two Populations The Same? Boudreaux: ‘My pond is better than your lake, cher’! Alphonse: ‘Mais non! I’ve got much bigger fish in my lake’! How can the truth be determined?

Two Sample t-test Simple comparison of a specific attribute between two populations If the attributes between the two populations are equal, then the difference between the two should be zero This is the underlying principle of a t-test If P-value > 0.05 the means are not significantly different; If P < 0.05 the means are significantly different

Analysis of Variance Can compare two or more means Compares means to determine if the population distributions are not similar Uses means and confidence intervals much like a t-test Test statistic used is called an F statistic (F-test), which is used to get the P value If P-value > 0.05 the means are not significantly different; If P< 0.05 the means are significantly different Post-hoc test separates the non-similar ones

Analysis of Variance Compares means to determine if the population distributions are not similar Uses means and confidence intervals much like a t-test Test statistic used is called an F statistic (F-test)

Normal Distribution Most characteristics follow a normal distribution –For example: height, length, speed, etc. One of the assumptions of the ANOVA test is that the sample data is ‘normally distributed.’

Sample Distribution Approaches Normal Distribution With Sample Size

Variance =  (x-x) 2 N-1  i= x N N Mean = x = Standard Deviation =  (x-x) 2 N-1  Mean = 169/6 = Range = 25 – 32 SOS = Variance = / 5 = 8.16 Std. Dev. =  40.83/5 = 2.86 Std. Err. = 2.86 / √ 6 = 1.17 Standard Error = SD √N

Calculate a SOS based on an overall mean (total SOS) ANOVA – Analysis of Variance

TrtmntReplicateLengthOverall MeanSOS Total Pond Pond Pond Pond Pond Pond Pond Pond Pond Pond Lake Lake Lake Lake Lake Lake Lake Lake Lake Lake This provides a measure of the overall variance (Total SOS).

Calculate a SOS based for each treatment (Treatment or Error SOS).

TrtmntReplicateLengthTrtmnt MeanSOS Error Pond Pond Pond Pond Pond Pond Pond Pond Pond Pond Lake Lake Lake Lake Lake Lake Lake Lake Lake Lake This provides a measure of the reduction of variance by measuring each treatment separately (Treatment or Error SOS). What happens to Error SOS when the variability w/in each treatment decreases?

Calculate a SOS for each predicted value vs. the overall mean (Model SOS)

TrtmntReplicateLengthTrtmnt MeanOverall MeanSOS Model Pond Pond Pond Pond Pond Pond Pond Pond Pond Pond Lake Lake Lake Lake Lake Lake Lake Lake Lake Lake This provides a measure of the distance between the mean values (Model SOS). What happens to Model SOS when the two means are close together? What if the means are equal?

Detecting a Difference Between Treatments Model SOS gives us an index on how far apart the two means are from each other. – Bigger Model SOS = farther apart Error SOS gives us an index of how scattered the data is for each treatment. –More variability = larger Error SOS = more possible overlap between treatments

Magic of the F-test The ratio of Model SOS to Error SOS (Model SOS divided by Error SOS) gives us an overall index (the F statistic) used to indicate the relative ‘distance’ and ‘overlap’ between two means. –A large Model SOS and small Error SOS = a large F statistic. Why does this indicate a significant difference? –A small Model SOS and a large Error SOS = a small F statistic. Why does this indicate no significant difference?? Based on sample size and alpha level (P-value), each F statistic has an associated P-value. –P < 0.05 (Large F statistic) there is a significant difference between the means –P ≥ 0.05 (Small F statistic) there is NO significant difference

A B A Showing Results

Regression For the purposes of this class: –Does Y depend on X? –Does a change in X cause a change in Y? –Can Y be predicted from X? Y= mX + b Predicted values Overall Mean Actual values

When analyzing a regression-type data set, the first step is to plot the data: XY The next step is to determine the line that ‘best fits’ these points. It appears this line would be sloped upward and linear (straight).

1) The regression line passes through the point (X avg, Y avg ). 2) Its slope is at the rate of “m” units of Y per unit of X, where m = regression coefficient (slope; y=mx+b) The line of best fit is the sample regression of Y on X, and its position is fixed by two results: (55, 138) Y = 1.24(X) slopeY-intercept Rise/Run

Testing the Regression Line for Significance An F-test is used based on Model, Error, and Total SOS. –Very similar to ANOVA Basically, we are testing if the regression line has a significantly different slope than a line formed by using just Y_avg. –If there is no difference, then that means that Y does not change as X changes (stays around the average value) To begin, we must first find the regression line that has the smallest Error SOS.

Independent Value Dependent Value Error SOS The regression line should pass through the overall average with a slope that has the smallest Error SOS (Error SOS = the distance between each point and predicted line: gives an index of the variability of the data points around the predicted line). overall average is the pivot point

For each X, we can predict Y:Y = 1.24(X) XY_ActualY_Pred SOS Erro r Error SOS is calculated as the sum of (Y Actual – Y Predicted ) 2 This gives us an index of how scattered the actual observations are around the predicted line. The more scattered the points, the larger the Error SOS will be. This is like analysis of variance, except we are using the predicted line instead of the mean value.

Total SOS Calculated as the sum of (Y – Y avg ) 2 Gives us an index of how scattered our data set is around the overall Y average. Overall Y average Regression line not shown

XY_Actual Y AverageSOS Total Total SOS gives us an index of how scattered the data points are around the overall average. This is calculated the same way for a single treatment in ANOVA. What happens to Total SOS when all of the points are close to the overall average? What happens when the points form a non-horizontal linear trend?

Model SOS Calculated as the Sum of (Y Predicted – Y avg ) 2 Gives us an index of how far all of the predicted values are from the overall average. Distance between predicted Y and overall mean

Model SOS Gives us an index of how far away the predicted values are from the overall average value What happens to Model SOS when all of the predicted values are close to the average value? XY_Pred Y Avera ge SOS Mod el

All Together Now!! XY_ActualY_PredSOS Error Y_AvgSOS Total SOS Model SOS Error =  (Y_Actual – Y_Pred) 2 SOS Total =  (Y_Actual –Y_ Avg) 2 SOS Mode l =  (Y_Pred – Y_Avg) 2

Using SOS to Assess Regression Line Model SOS gives us an index on how ‘different’ the predicted values are from the average values. – Bigger Model SOS = more different –Tells us how different a sloped line is from a line made up only of Y_avg. –Remember, the regression line will pass through the overall average point. Error SOS gives us an index of how different the predicted values are from the actual values –More variability = larger Error SOS = large distance between predicted and actual values

Magic of the F-test The ratio of Model SOS to Error SOS (Model SOS divided by Error SOS) gives us an overall index (the F statistic) used to indicate the relative ‘difference’ between the regression line and a line with slope of zero (all values = Y_avg. –A large Model SOS and small Error SOS = a large F statistic. Why does this indicate a significant difference? –A small Model SOS and a large Error SOS = a small F statistic. Why does this indicate no significant difference?? Based on sample size and alpha level (P-value), each F statistic has an associated P-value. –P < 0.05 (Large F statistic) there is a significant difference between the regression line a the Y_avg line. –P ≥ 0.05 (Small F statistic) there is NO significant difference between the regression line a the Y_avg line.

Mean Model SOS Mean Error SOS Independent Value Dependent Value Basically, this is an index that tells us how different the regression line is from Y_avg, and the scatter of the data around the predicted values. = F

Correlation (r): A nother measure of the mutual linear relationship between two variables. ‘r’ is a pure number without units or dimensions ‘r’ is always between –1 and 1 Positive values indicate that y increases when x does and negative values indicate that y decreases when x increases. –What does r = 0 mean? ‘r’ is a measure of intensity of association observed between x and y. –‘r’ does not predict – only describes associations between variables

r > 0 r < 0 r = 0 r is also called Pearson’s correlation coefficient.

R-square If we square r, we get rid of the negative value if it is negative) and we get an index of how close the data points are to the regression line. Allows us to decide how much confidence we have in making a prediction based on our model. Is calculated as Model SOS / Total SOS

r 2 = Model SOS / Total SOS = Model SOS = Total SOS

= Model SOS = Total SOS r2 = Model SOS / Total SOS  numerator/denominator Small numerator Big denominator R 2 =

R-square and Prediction Confidence

Finally…….. If we have a significant relationship (based on the p-value), we can use the r-square value to judge how sure we are in making a prediction.