Statistical Methods For Engineers ChE 477 (UO Lab) Brigham Young University
Error of Measured Variable Some definitions: x = sample mean s = sample standard deviation m = exact mean s = exact standard deviation As the sampling becomes larger: x m s s t chart z chart not valid if bias exists (i.e. calibration is off) Several measurements are obtained for a single variable (i.e. T). What is the true value? How confident are you? Is the value different on different days? Questions
t-test in Excel The one-tailed t-test function in Excel is: =T.INV(,r) Remember to put in /2 for tests (i.e., 0.025 for 95% confidence interval) The two-tailed t-test function in Excel is: =T.INV.2T(,r) Where is the probability (i.e, .05 for 95% confidence interval for 2-tailed test) and r is the value of the degrees of freedom
T-test example = exact mean 40.9 is sample mean 40.9 ± 2.4 90% confident m is somewhere in this range 40.9 ± 3.0 95% confident m is somewhere in this range 40.9 ± 4.6 99% confident m is somewhere in this range Alpha is .05 for first case, 0.025 for second, and .005 for third case. What is a for each case?
Histogram Approximates a Probability Density Function (pdf)
All Statistical Info is in pdf Probabilities are determined by integration. Moments (means, variances, etc.) Are obtained by simple means. Most likely outcomes are determined from values.
Student’s t Distribution Used to compute confidence intervals according to Assumes mean and variance estimated by sample values
Typical Numbers Two-tailed analysis Population mean and variance unknown Estimation of population mean only Calculated for 95% confidence interval Based on number of data points, not degrees of freedom
Conversion of SD to CI Example Five data points with sample mean and standard deviation of 714 and 108, respectively. The estimated population mean and 95% confidence interval is:
General Confidence Interval Degrees of freedom generally = n-p, where n is number of data points and p is number of parameters Confidence interval for parameter given by
Linear Fit Confidence Interval For intercept: For slope:
Sum of the squares of the difference
An Example Current/A Temperature/ºC 8.22524 2.5 16.0571 5 21.6508 7.5 26.621 10 27.7787 12.5 38.0298 15 39.9741 Assume you collect the seven data points shown at the right, which represent the measured relationship between temperature and a signal (current) from a sensor. You want to know how to determine the temperature from the current.
First Plot the Data
Fit Data and Determine Residuals
Determine Model Parameters Residuals are easy and accurate means of determining if model is appropriate and of estimating overall variation (standard deviation) of data. The average of the residuals should always be zero. These formulas apply only to a linear regression. Similar formulas apply to any polynomial and approximate formulas apply to any equation.
Determine Confidence Interval
Two typical datasets
Straight-line Regression Estimate Std Error t-Statistic P-Value intercept 0.241001 1.733e-4 139.041 3.6504e-10 slope -3.214e-4 5.525e-6 -58.1739 2.8398e-8 Estimate Std Error t-Statistic P-Value intercept 0.239396 3.3021e-3 72.4977 9.13934e-14 slope -3.264e-4 1.0284e-5 -31.7349 1.50394e-10
Prediction Bands 95% confidence interval for the correct line
Linear vs. Nonlinear Models Linear and nonlinear refer to the coefficients, not the forms of the independent variable. The derivative of a linear model with respect to a parameter does not depend on any parameters. The derivative of a nonlinear model with respect to a parameter depends on one or more of the parameters.
Linear vs. Nonlinear Models
Joint Confidence Region linearized result correct (unknown) result nonlinear result
Extension
Graphical Summary The linear and non-linear analyses are compared to the original data both as k vs. T and as ln(k) vs. 1/T. As seen in the upper graph, the linearized analysis fits the low-temperature data well, at the cost of poorer fits of high temperature results. The non-linear analysis does a more uniform job of distributing the lack of fit. As seen in the lower graph, the linearized analysis evenly distributes errors in log space
Parameter Estimates Best estimate of parameters for a given set of data. Linear Equations Explicit equations Requires no initial guess Depends only on measured values of dependent and independent variables Does not depend on values of any other parameters Nonlinear Equations Implicit equations Requires initial guess Convergence often difficult Depends on data and on parameters
Parameter Estimates Nonlinear estimate (blue) is closer to the correct value (black) than the linearized estimate (red). Blue line represents parameter 95% confidence region. It is possible that linear analysis could be closer to correct answer with any random set of data, but this would be fortuitous.
For Parameter Estimates In all cases, linear and nonlinear, fit what you measure, or more specifically the data that have normally distributed errors, rather than some transformation of this. Any nonlinear transformation (something other than adding or multiplying by a constant) changes the error distribution and invalidates much of the statistical theory behind the analysis. Standard packages are widely available for linear equations. Nonlinear analyses should be done on raw data (or data with normally distributed errors) and will require iteration, which Excel and other programs can handle.
Recommendations Minimize to sum of squares of differences between measurements and model written in term of what you measured. DO NOT linearize the model, i.e., make it look something like a straight line model. Confidence intervals for parameters can be misleading. Joint/simultaneous confidence regions are much more reliable. Propagation of error formula grossly overestimates error