1 James R. Black Qing Qing Wu 17 Feb 2016 Modeling Prediction Intervals using Monte Carlo Simulation Software 2016 ICEAA Professional Development & Training Workshop
2 Presenter Bios James “Jay” Black has 12 years of cost estimating experience and currently works as senior operations research analyst for the Administration for Children and Families within the U.S. Department of Health and Human Services. In this role, he supports the Grants Center of Excellence software suite used to administer 1200 grant programs in eight Federal departments. Jay has a Masters in Systems Engineering from Johns Hopkins University and holds a current CCE/A certification. Qing Qing “Q” Wu is a cost analyst for the Cost Effectiveness Branch at the Naval Surface Warfare Center Carderock Division. She supports the Naval Sea Systems Command 05C Cost Engineering & Industrial Analysis Division in their Weapon Systems Division. She has a Bachelor’s degree in Mathematics from the Macaulay Honors College at The City College of New York.
3 Presentation Summary References/Acknowledgements: –2014 ICEAA Workshop presentation prepared by Dr. Christian Smart (MDA) and Marc Greenberg (NASA) –Joint Agency Cost Schedule Risk and Uncertainty Handbook (CSRUH, Feb 2014) Abstract: –The use of a prediction interval (PI) is a simple method of quantifying risk and uncertainty for a Cost Estimating Relationship (CER) derived from an Ordinary Least Squares (OLS) regression –Yet, few cost estimators implement PIs in their estimates despite their frequent use of CERs This presentation will provide a step-by-step tutorial for modeling a PI for an example CER using Monte Carlo Simulation software and will identify the beneficial impact on the coefficient of variation (CV)
4 Cost Estimating Relationships (CERs) Definition: A Cost Estimating Relationship (CER) is a mathematical expression of cost as a function of one or more independent variables CERs are often developed using regression analysis to fit an equation to a data set Examples of equations used for CERs include: Linear CER:y = a + bx Nonlinear CERs:y = ax b y = ab x y = a + bx c where y = Cost x = Technical Parameter
5 Modeling Uncertainty CERs do not perfectly fit historical data upon which they are based This results in an underlying uncertainty distribution about an estimate –The outcome of a CER represents only one point on an uncertainty distribution (typically mean or median) This brief will model this uncertainty
6 Modeling Uncertainty (cont.) Model uncertainty is variation about the dependent variable, i.e., cost For a linear CER: For a nonlinear CER: where represents the error between the estimated cost and the actual cost Y; the estimate uncertainty is captured by the Prediction Interval Often used to create weight based estimates Often used to model learning curve
7 Example: Modeling Uncertainty for a Linear CER For example, consider a linear CER: Using Monte Carlo simulation software (e.g. or Oracle Crystal Ball), define a distribution for – = normal(mean = 0, std dev = prediction error) –OR – = student-t(midpoint = 0, scale = prediction error, degrees of freedom) Ok, so how do you define prediction error?
8 Prediction Interval Equation = Calculated Value from Regression Line = t Critical Value (T.INV.2T function in Excel) = Standard Error of the Estimate (STEYX function in Excel) = number of observations = average of X = sum of squared deviations of X from its mean (DEVSQ function in Excel) Prediction Error
9 Evaluating the Prediction Error Development $ In BY12$M Weight In Lbs. $1,0001,000 $2,0003,000 $1,6002,500 $1, $2,0003,500 $3,5009,000 $5,00030,000 $4,00010,000 $1,6004,000 Example Dataset Set Up the Inputs Prediction Error = Evaluate Prediction Error =SEE*(SQRT(((n+1)/n)+((( X-Avg)^2)/Devsq)))
10 Development $ In BY12$M Weight In Lbs. $1,0001,000 $2,0003,000 $1,6002,500 $1, $2,0003,500 $3,5009,000 $5,00030,000 $4,00010,000 $1,6004,000 Example Dataset * Note: the use of an Excel trendline is for presentation brevity, make sure you consider T- & F-Stat, R^2 adj, and other fit measures when running a regression on your own OLS Regression* Define two distributions and look at resulting effect on CV –On independent variable x = triangular(4000, 5000, 7000) –On student-t(midpoint = 0, scale = prediction error, degrees of freedom) Y = x Example: Modeling Uncertainty for a Linear CER
11 Deciles CV5.2% Status Quo Example Risk Only on Independent Variable Deciles 90%$2, %$2, %$2, %$2, %$2, %$2, %$2, %$2, %$1, CV7.0% Risk only on Independent Variable (Weight): - Only with weighted (triangular) distribution - Low CV of 0.07 Basis and Values of Risk Parameters Risk ParameterMinMost LikelyMax Weight DistWeight Low (10%) 4000 Weight Most Likely 5000 Weight High (90%) 7000 Note: Regression of the original dataset had a R² =
12 Deciles CV5.2% Prediction Interval Example Risk on Independent Var. & Error Term Deciles 90%$3, %$2, %$2, %$2, %$2, %$1, %$1, %$1, %$1, CV39.9% Basis and Values of Risk Parameters Risk ParameterMinMost LikelyMax PI Dist Weight Dist Weight Low (10%) 4000 Weight Most Likely 5000 Weight High (90%) 7000 Risk on Independent Variable and Error Term: - Weighted (triangular) distribution and PI (Student-t) distribution - High CV of.40 Student-t Distribution Parameters: Midpoint = 0, Scale = (Prediction Error) Degrees of Freedom = 7 (n-2) Note: Regression of the original dataset had a R² =
13 Summary Implementing risk on the error term using the prediction interval is not difficult Even for regressions with reasonable fit statistics, implementing risk on the error term can produce desirable CVs
14 BACKUP
15 Example: Linear xPrediction Erroreyy + e 5000 =SEE*(SQRT(((n+1)/n)+(((X- Avg)^2)/Devsq)))0 =Slope*X+Intercept=y+e xPrediction Erroreyy + e Example: X = 5000 Student-t Distribution Midpoint = 0, Scale = Prediction Error Degrees of Freedom = n-k-1 Triangular Distribution 10% = 4,000 Likeliest = 5,000 90% = 7,000
16 Student-t Distribution Explained Inputs to the Student-t distribution: Midpoint: 0 Scale: Prediction Error Deg. Freedom: n-k-1
17 Prediction Interval Equation Prediction Error –standard error of the CER –CER sample size (i.e., the number of data points used to derive the CER) –desired confidence level –distance from the center of the CER’s independent variables to the location of the independent variable of the point being estimated
18 Generating the S-Curve from the Prediction Interval The S-curve can be generated by varying the critical value of the t distribution for the prediction interval equation, holding the CER input(s) constant: Prediction Error
19 Example: Non-Linear Dataset Inputs Prediction Error = Prediction Error =SEE*(SQRT(((n+1)/n)+((( X-Avg)^2)/Devsq))) n=COUNT(ln(x)) Slope=SLOPE(ln(y), ln(x)) Intercept=INTERCEPT(ln(y), ln(x)) SEE=STEYX(ln(y), ln(x)) Avg=AVERAGE(ln(x)) Devsq=DEVSQ(ln(x)) n=9 Slope=0.50 Intercept=3.47 SEE=0.15 Avg=8.29 Devsq=10.03 Development $ In BY12$M Weight In Lbs. $1,0001,000 $2,0003,000 $1,6002,500 $1, $2,0003,500 $3,5009,000 $5,00030,000 $4,00010,000 $1,6004,000 ln(Dev $)ln(Weight)
20 xln(x)Prediction Erroreyy with eAnti-log of Y 12000=LN(x) =SEE*(SQRT(((n+1)/n)+(((X- Avg)^2)/Devsq)))0 =Slope*X+Intercept=y+e=EXP(y+e) Example: Non-Linear Example: X = 20 Student-t Distribution Midpoint = 0, Scale = Prediction Error Degrees of Freedom = n-k-1 xln(x)Prediction Erroreyy with eAnti-log of Y
21 Deciles CV5.2% Basis and Values of Risk Parameters Risk ParameterMinMost LikelyMax PI Dist Prediction Interval Example Non-Linear Risk on Error Term: - PI (Student-t) distribution - CV of.20 Student-t Distribution Parameters: Midpoint = 0, Scale = 0.16 (Prediction Error) Degrees of Freedom = 7 (n-2) Note: Regression of the original dataset had a R² =
22 Example 2: Non-Linear Learning Curve Example Dataset Inputs Prediction Error = Prediction Error =SEE*(SQRT(((n+1)/n)+((( X-Avg)^2)/Devsq))) n=COUNT(ln(x)) Slope=SLOPE(ln(y), ln(x)) Intercept=INTERCEPT(ln(y), ln(x)) SEE=STEYX(ln(y), ln(x)) Avg=AVERAGE(ln(x)) Devsq=DEVSQ(ln(x)) n=6 Slope=-0.13 Intercept=7.13 SEE=0.05 Avg=2.71 Devsq=9.53 ln(x)ln(y) x Lot midpoint y Unit Cost 1.87 $1, $ $ $ $ $750
23 xln(x)Prediction Erroreyy with eAnti-log of Y 20=LN(x) =SEE*(SQRT(((n+1)/n)+(((X- Avg)^2)/Devsq)))0 =Slope*X+Intercept=y+e=EXP(y+e) Example 2: Non-Linear Learning Curve Example Example: X = 20 Student-t Distribution Midpoint = 0, Scale = Prediction Error Degrees of Freedom = n-k-1 xln(x)Prediction Erroreyy with eAnti-log of Y
24 Deciles CV5.2% Basis and Values of Risk Parameters Risk ParameterMinMost LikelyMax PI Dist Prediction Interval Example 2 Non-Linear Risk on Error Term: - PI (Student-t) distribution - CV of.09 Student-t Distribution Parameters: Midpoint = 0, Scale = 0.06 (Prediction Error) Degrees of Freedom = 4 (n-2) Note: Regression of the original dataset had a R² =
25 Linear Regression Example Fit Statistics
26 Nonlinear Regression Example 2 Fit Statistics