Introductory Statistics for Laboratorians dealing with High Throughput Data sets Centers for Disease Control
Regression/Prediction In this case we can use X to predict Y with accuracy. The equation is Y = 2X +1 For any X we can compute Y Error is not a factor
In this case we can’t fit a nice straight line to the data. If we repeat the experiment we will get somewhat different results Random error is a factor
If in this case we know that the relationship should be a straight line and we believe the deviation from a line is due to error We can fit a line to the data There are many possible lines to choose from. How do we select the best line?
Principles of Prediction If we don’t know anything about the person being measured, our best bet is always to predict the mean If we know the person is average on X we should predict they will be average on Y. We want to select the line that produces the least error. X – ScaleY – Scale Mean = 3.0Mean = 7.3
Development Sample – A sample in which both X and Y are known – Used to develop an equation that can be used to compute a predicted Y from X – Used to Compute the Standard Curve Unknown Sample – Use the equation developed above to predict Y for people or samples for whom we know X but not Y X – ScaleY – Scale Mean = 3.0Mean = 7.3
Regression Freddie Bruflot X = 5 Y = 10 Actual Y > Predicted Y > Mean of Y > Total Residual Regression Error.75
Computation of SSE XYY predictedResidual (error)Residual squared Mean = 3Mean = 7.3Sum of Squared Residuals =
Correlation/Regression Example High levels of a particular factor in blood samples (call it BF-Costly)is known to be highly predictive of cervical cancer. Measuring this specific factor is so expensive and time consuming that it is impractical. The following data are obtained concerning the relationship between a second, easily measured blood factor (call it BF-Cheap) and BF-Costly.
BF-CheapBF-Costly Mean = 71.13Mean = 72.60
Here is a scatterplot of the data showing the relationship between BF-Cheap and BF-Costly. It looks like they are correlated. If it is significant this might be worth pursuing. Null hypothesis: correlation is zero Alpha =.05
Test Significance of the Correlation The probability that the correlation is zero is (7.24E-05) We can reject the null hypothesis It may well be worth it to develop a prediction equation that can be used to predict BF- Costly from BF-Cheap Correlations BF - CheapBF - Costly BF - Cheap Pearson Correlation Sig. (2- tailed).7.24E-05 N 15 BF - Costly Pearson Correlation Sig. (2- tailed) 7.24E-05. N 15 **Correlation is significant at the 0.01 level (2-tailed).
Regression Analysis Null Hypothesis: the slope of the regression line is zero Alpha =.05 Probability is We can reject the null hypothesis What is the equation? Is it any good? ANOVA Model Sum of Squaresdf Mean SquareFSig. 1Regression E-05 Residual Total aPredictors: (Constant), BF - Cheap bDependent Variable: BF - Costly
The equation we are looking for will be of the form: The Y – Intercept is called “constant” and is 27.6 The slope (the number you multiply BF-Cheap by) is.63 Both of these are significant The equation is: Coefficients Unstandardized Coefficients Standardized Coefficients Model BStd. ErrorBetatSig. 1(Constant) BF - Cheap E-05 aDependent Variable: BF - Costly Predicted BF-Costly = slope * BF-Cheap + y-intercept Predicted BF-Costly =.63(BF-Cheap)
How good is the equation. What kind of accuracy can we expect. R-Square (.715) is the proportion of the total variance accounted for by the equation. About 71.5% of the variance is accounted for That means about 28.5% is not accounted for Model Summary ModelRR Square Adjusted R Square Std. Error of the Estimate a Predictors: (Constant), BF - Cheap b Dependent Variable: BF - Costly
Using Linear Regression to Develop a Standard Curve for Real Time PCR Develop a standard curve from samples with known concentration. – Y is the concentration – X is the CP Both X and Y are known The relationship between concentration and output is not linear it is an S-curve but the relationship between the log(concentration) and output is linear.
Standard Sample Data CPConcentrationLog Concentration
CP and Concentration are highly correlated. It is a negative correlation. The correlation is Now fit a regression line to these data and get the equation of the line.
Fit Regression Line This is highly significant of course Y-Intercept is 8.11 Slope is -.22 Equation is: Predicted Concentration = antilog (-.22(CP) ) Coefficients Unstandardized Coefficients Standardized CoefficientstSig. ModelBStd. ErrorBeta 1(Constant) E-14 CP E-11 aDependent Variable: Log Concentration
Standard Curve For an Unknown with a CP of Log Predicted Concentration would be (18.48) = 4.04 Antilog of 4.04 is 1.10E4