Design of Experiments and Data Analysis

Slides:



Advertisements
Similar presentations
Hypothesis Testing Steps in Hypothesis Testing:
Advertisements

Inference for Regression
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Chapter 12 Simple Linear Regression
Multiple Regression Predicting a response with multiple explanatory variables.
October 21, 2011C. Purdy--Graduate Seminar-- Design of Experiments 1 SECS Seminar Design of Experiments How to frame a hypothesis/thesis Theoretical, simulation,
Chapter 12 Simple Regression
VLSI Systems--Spring 2009 Introduction: --syllabus; goals --schedule --project --student survey, group formation.
Statistics CSE 807.
VLSI Systems Design—Experiments Necessary steps: Explore the problem space Design experiment(s) Carry out experiment(s) Analyze results software packages:
The Simple Regression Model
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Pengujian Parameter Koefisien Korelasi Pertemuan 04 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
k r Factorial Designs with Replications r replications of 2 k Experiments –2 k r observations. –Allows estimation of experimental errors Model:
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
One-Factor Experiments Andy Wang CIS 5930 Computer Systems Performance Analysis.
CPE 619 Simple Linear Regression Models Aleksandar Milenković The LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama.
Simple Linear Regression Models
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
1 1 Slide © 2005 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
© 1998, Geoff Kuenning General 2 k Factorial Designs Used to explain the effects of k factors, each with two alternatives or levels 2 2 factorial designs.
© 2003 Prentice-Hall, Inc.Chap 13-1 Basic Business Statistics (9 th Edition) Chapter 13 Simple Linear Regression.
CHAPTER 14 MULTIPLE REGRESSION
1 1 Slide Simple Linear Regression Coefficient of Determination Chapter 14 BA 303 – Spring 2011.
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Using R for Marketing Research Dan Toomey 2/23/2015
Simple Linear Regression ANOVA for regression (10.2)
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
CPE 619 One Factor Experiments Aleksandar Milenković The LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama in.
Experiment Design Overview Number of factors 1 2 k levels 2:min/max n - cat num regression models2k2k repl interactions & errors 2 k-p weak interactions.
Chapter 12 Simple Linear Regression n Simple Linear Regression Model n Least Squares Method n Coefficient of Determination n Model Assumptions n Testing.
1 1 Slide The Simple Linear Regression Model n Simple Linear Regression Model y =  0 +  1 x +  n Simple Linear Regression Equation E( y ) =  0 + 
Stats Methods at IC Lecture 3: Regression.
Chapter 13 Simple Linear Regression
Lecture 11: Simple Linear Regression
Chapter 20 Linear and Multiple Regression
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
Inference for Least Squares Lines
Chapter 12 Simple Linear Regression and Correlation
CHAPTER 7 Linear Correlation & Regression Methods
Statistics for Managers using Microsoft Excel 3rd Edition
Simple Linear Regression
Chapter 11 Simple Regression
Regression model with multiple predictors
Two-Factor Full Factorial Designs
Relationship with one independent variable
Chapter 13 Simple Linear Regression
Simple Linear Regression
CHAPTER 29: Multiple Regression*
Chapter 11 Analysis of Variance
Chapter 12 Simple Linear Regression and Correlation
Simple Linear Regression
Model Comparison: some basic concepts
Multi Linear Regression Lab
Relationship with one independent variable
SIMPLE LINEAR REGRESSION
Replicated Binary Designs
Simple Linear Regression
Design of Experiments and Data Analysis
One-Factor Experiments
MGS 3100 Business Analysis Regression Feb 18, 2016
St. Edward’s University
Chapter 13 Simple Linear Regression
Presentation transcript:

Design of Experiments and Data Analysis Graduate Seminar Fall 2006 Hal Carter These slides available at www.ececs.uc.edu/~hcarter/presentations/experimental_design.ppt Bibliography 1. Julian Faraway, “Practical Regression and Anova using R,” July 2002. Avail at cran.us.r-project.org/other-docs.html 2. “The R Project for Statistical Computing,” Available at www.r-project.org. Software available for Linux, Windows, and OS-X. 3. Raj Jain, “The Art of Computer Systems Performance Analysis,” Wiley, 1991. 4. “An Introduction to R,” Available at www.cran.us.r-project.org/manuals.html

Agenda Analyze and Display Data Design Your Experiments Simple Statistical Analysis Comparing Results Determining O(n) Design Your Experiments 2K Designs Including Replications Full Factor Designs

A System Factors System System inputs System outputs Responses

Experimental Research Define System Define system outputs first Then define system inputs Finally, define behavior (i.e., transfer function) Identify Factors and Levels Identify system parameters that vary (many) Reduce parameters to important factors (few) Identify values (i.e., levels) for each factor Identify Response(s) Identify time or space effects of interest Design Experiments Identify factor-level experiments

Create and Execute System; Analyze Data Define Workload Workloads are inputs that are applied to system Workload can be a factor (but often isn't) Create System Create system so it can be executed Real prototype Simulation model Empirical equations Execute System Execute system for each factor-level binding Collect and archive response data Analyze & Display Data Analyze data according to experiment design Evaluate raw and analyzed data for errors Display raw and analyzed data to draw conclusions

Some Examples Epitaxial growth Analog Simulation New method using non-linear temp profile What is the system? Responses Total time Quality of layer Total energy required Maximum layer thickness Factors Temperature profile Oxygen density Initial temperature Ambient temperature Analog Simulation Which of three solvers is best? What is the system? Responses Fastest simulation time Most accurate result Most robust to types of circuits being simulated Factors Solver Type of circuit model Matrix data structure

SIMPLE MODELS OF DATA Evaluation of a new wireless network protocol. System: wireless network with new protocol Workload: 10 messages applied at single source Each message identical configuration Experiment output: Roundtrip latency per message (ms) Data file “latency.dat” Mean: 19.6 ms Variance: 10.71 ms2 Std Dev: 3.27 ms Latency 22 23 19 18 15 20 26 17 % R > data=read.table("latency.dat",header=T) > data Latency 1 22 2 23 3 19 4 18 5 15 6 20 7 26 8 17 9 19 10 17 > attach(data) > mean(Latency) [1] 19.6 > var(Latency) [1] 10.71111 > sd(Latency) [1] 3.272783 > Index=c(1:10) > plot(Index,Latency,pch=19,cex.lab=1.5)

Verify Model Preconditions Check normal distribution `Use quantile-quantile plot Pattern adheres consistently along ideal quantile-quantile line Check randomness Use plot of residuals around mean Residuals appear random > # Plot residuals to assess randomness > Residuals=Latency-mean(Latency) > plot(Index,Residuals,pch=19,cex.lab=1.5) > abline(0,0) > # Plot quantile-quantile plot to assess if residuals > # normally distributed > qqnorm(Latency, pch=19,cex.lab=1.5) > qqline(Latency)

For the latency data, m = 10, a = 0.05: Confidence Intervals Sample mean vs Population mean CI: > 30 samples CI: < 30 samples > mean(Latency) - qt(0.975,9)*sd(Latency)/sqrt(10) [1] 17.25879 > mean(Latency) + qt(0.975,9)*sd(Latency)/sqrt(10) [1] 21.94121 For the latency data, m = 10, a = 0.05: (17.26, 21.94) Raj Jain, “The Art of Computer Systems Performance Analysis,” Wiley, 1991.

Depth Resistance 1 1.689015 2 4.486722 3 7.915209 4 6.362388 5 11.830739 6 12.329104 7 14.011396 8 17.600094 9 19.022146 10 21.513802 Scatter and Line Plots Resistance profile of doped silicon epitaxial layer Expect linear resistance increase as depth increases > data=read.table("xyscatter.dat", header=T) > attach(data) > model = lm(Resistance ~ Depth) > summary(model) Residuals: Min 1Q Median 3Q Max -2.11330 -0.40679 0.05759 0.51211 1.57310 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.05863 0.76366 -0.077 0.94 Depth 2.13358 0.12308 17.336 1.25e-07 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 1.118 on 8 degrees of freedom Multiple R-Squared: 0.9741, Adjusted R-squared: 0.9708 F-statistic: 300.5 on 1 and 8 DF, p-value: 1.249e-07 > plot(Depth,Resistance,main=”Epi Layer Resistance”,xlab=”Depth, + microns”,ylab=”Resistance, Mohms”,pch=19, + cex.main=1.5,cex.axis=1.5,,cex.lab=1.5) > abline(-0.05863, 2.13358) > error=Resistance-(-0.05863+2.13358*Depth) > plot(Depth,error,main=”Residual Plot”,xlab=”Depth, micron”, + ylab=”Error, Mohms”,cex.main=1.5,cex.axis=1.5,pch=19,cex.lab=1.5) > abline(0,0)

Linear Regression Statistics model = lm(Resistance ~ Depth) summary(model) Residuals: Min 1Q Median 3Q Max -2.11330 -0.40679 0.05759 0.51211 1.57310 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.05863 0.76366 -0.077 0.94 Depth 2.13358 0.12308 17.336 1.25e-07 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 1.118 on 8 degrees of freedom Multiple R-Squared: 0.9741, Adjusted R-squared: 0.9708 F-statistic: 300.5 on 1 and 8 DF, p-value: 1.249e-07

Validating Residuals Errors are marginally normally distributed due to “tails” > qqnorm(error, pch=19,cex.lab=1.5,cex.axis=1.5,cex.main=1.5) > qqline(error)

Comparing Two Sets of Data Example: Consider two wireless different access points. Which one is faster? Inputs: same set of 10 messages communicated through both access points. Response (usecs): Latency1 Latency2 22 19 23 20 19 24 18 20 15 14 20 18 26 21 17 17 19 17 17 18 Approach: Take difference of data and determine CI of difference. If CI straddles zero, cannot tell which access point is faster. > data=read.table("compare.dat", header=T) > data Latency1 Latency2 1 22 19 2 23 20 3 19 24 4 18 20 5 15 14 6 20 18 7 26 21 8 17 17 9 19 17 10 17 18 > diff=Latency1-Latency2 > mean(diff)-qt(0.975,9)*sd(diff)/sqrt(10) [1] -1.273301 > mean(diff)+qt(0.975,9)*sd(diff)/sqrt(10) [1] 2.873301 CI95% = (-1.27, 2.87) usecs Confidence interval straddles zero. Thus, cannot determine which is faster with 95% confidence

Plots with error bars Execution time of SuperLU linear system solution on parallel computer Ax = b For each p, ran problem multiple times with same matrix size but different values Determined mean and CI for each p to obtain curve and error intervals # Load Hmisc library > library("Hmisc") # Read data from file > data <- read.table("demo.data", header=T) # Display the data on screen > data x y delta 1 0.1 10.0 0.8 2 0.2 18.6 1.0 3 0.3 38.4 1.5 4 0.4 74.0 3.0 5 0.5 135.0 5.0 6 0.6 227.1 10.0 7 0.7 356.0 20.0 8 0.8 522.0 50.0 9 0.9 751.4 60.0 10 1.0 1010.5 80.0 attach(data) # Plot dashed line curve on screen with error bars > errbar(x, y, y-delta, y+delta, xlab="Number of Processors, p", ylab="Execution Time, msecs") > lines(x, y, type="l", lty=2)

How to determine O(n) > model = lm(t ~ poly(p,4)) > summary(model) Call: lm(formula = t ~ poly(p, 4)) Residuals: 1 2 3 4 5 6 7 8 9 -0.4072 0.7790 0.5840 -1.3090 -0.9755 0.8501 2.6749 -3.1528 0.9564 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 236.9444 0.7908 299.636 7.44e-10 *** poly(p, 4)1 679.5924 2.3723 286.467 8.91e-10 *** poly(p, 4)2 268.3677 2.3723 113.124 3.66e-08 *** poly(p, 4)3 42.8772 2.3723 18.074 5.51e-05 *** poly(p, 4)4 2.4249 2.3723 1.022 0.364 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 2.372 on 4 degrees of freedom Multiple R-Squared: 1, Adjusted R-squared: 0.9999 F-statistic: 2.38e+04 on 4 and 4 DF, p-value: 5.297e-09 > model = lm(t ~ poly(p,4)) > summary(model) Call: lm(formula = t ~ poly(p, 4)) Residuals: 1 2 3 4 5 6 7 8 9 -0.4072 0.7790 0.5840 -1.3090 -0.9755 0.8501 2.6749 -3.1528 0.9564 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 236.9444 0.7908 299.636 7.44e-10 *** poly(p, 4)1 679.5924 2.3723 286.467 8.91e-10 *** poly(p, 4)2 268.3677 2.3723 113.124 3.66e-08 *** poly(p, 4)3 42.8772 2.3723 18.074 5.51e-05 *** poly(p, 4)4 2.4249 2.3723 1.022 0.364 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 2.372 on 4 degrees of freedom Multiple R-Squared: 1, Adjusted R-squared: 0.9999 F-statistic: 2.38e+04 on 4 and 4 DF, p-value: 5.297e-09

R2 – Coefficient of Determination SSE around the mean is SST = ∑ (yi – mean(y))2 = ∑(yi2) – n(mean(y)2) = SSY -SS0 SSE around the model is SSE = ∑ei2 SSR = SST – SSE R2 = SSR/SST = (SST-SSE)/SST R2 is a measure of how good the model is. The closer R2 is to 1 the better. Example: Let SST = 1499 and SSE = 97. Then R2 = 93.5%

Using the t-test Consider the following data (“sleep.R”) extra group 1 0.7 1 2 -1.6 1 3 -0.2 1 4 -1.2 1 5 -0.1 1 6 3.4 1 7 3.7 1 8 0.8 1 9 0.0 1 10 2.0 1 11 1.9 2 12 0.8 2 13 1.1 2 14 0.1 2 15 -0.1 2 16 4.4 2 17 5.5 2 18 1.6 2 19 4.6 2 20 3.4 2 File “sleep.R” from /usr/lib/R/library/base/data/ "sleep" <- structure(list(extra = c(0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, 0.8, 0, 2, 1.9, 0.8, 1.1, 0.1, -0.1, 4.4, 5.5, 1.6, 4.6, 3.4), group = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), .Label = c("1", "2"), class = "factor")), .Names = c("extra", "group"), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20"), class = "data.frame") To read into an R session: > source("sleep.R") > sleep extra group 1 0.7 1 2 -1.6 1 3 -0.2 1 4 -1.2 1 5 -0.1 1 6 3.4 1 7 3.7 1 8 0.8 1 9 0.0 1 <more data> From “Introduction to R”, http://www.R-project.org

T.test result > t.test(extra ~ group, data = sleep) Welch Two Sample t-test data: extra by group t = -1.8608, df = 17.776, p-value = 0.0794 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.3654832 0.2054832 sample estimates: mean of x mean of y 0.75 2.33 > data(sleep) > plot(extra ~ group, data = sleep) > ## Traditional interface > with(sleep, t.test(extra[group == 1], extra[group == 2])) Welch Two Sample t-test data: extra[group == 1] and extra[group == 2] t = -1.8608, df = 17.776, p-value = 0.0794 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.3654832 0.2054832 sample estimates: mean of x mean of y 0.75 2.33 > ## Formula interface > t.test(extra ~ group, data = sleep) data: extra by group mean in group 1 mean in group 2 0.75 2.33 p-value is smallest 1- confidence where null hyp. not true. p-value = 0.0794 means difference not 0 above 92%

2k Factorial Design y = q0 + qAxA + qBxB + qABxAB (k=2) SST = total variation around the mean = ∑ (yi – mean(y))2 = SSA+SSB+SSAB where SSA = 22qA2 Note: var(y) = SST/(n-1) Fraction of variation explained by A = SSA/SST

2k Design Cache Factor Levels Experiment Design Address Trace Misses Line Length (L) 32, 512 words No. Sections (K) 4, 16 sections Control Method (C) multiplexed, linear Cache Experiment Design Address Trace Misses L K C Misses 32 4 mux 512 4 mux 32 16 mux 512 16 mux 32 4 lin 512 4 lin 32 16 lin 512 16 lin Are all factors needed? If a factor has little effect on the variability of the output, why study it further? Method? a. Evaluate variation for each factor using only two levels each b. Must consider interactions as well Encoded Experiment Design L K C Misses -1 -1 -1 1 -1 -1 -1 1 -1 1 1 -1 -1 -1 1 1 -1 1 -1 1 1 1 1 1 > data=read.table("2k.data", header=T) > data L K C Misses 1 32 4 mux 14 2 512 4 mux 22 3 32 16 mux 10 4 512 16 mux 34 5 32 4 lin 46 6 512 4 lin 58 7 32 16 lin 50 8 512 16 lin 86 > attach(data) > L=factor(L) > K=factor(K) > C=factor(C) > analysis=aov(Misses~L*K*C) Interaction: effect of a factor dependent on the levels of another

2k Design Analyze Results (Sign Table) Obtain Reponses I L K C LK LC KC LKC Miss.Rate 1 -1 -1 -1 1 1 1 -1 14 1 1 -1 -1 -1 -1 1 1 22 1 -1 1 -1 -1 1 -1 1 10 1 1 1 -1 1 -1 -1 -1 34 1 -1 -1 1 1 -1 -1 1 46 1 1 -1 1 -1 1 -1 -1 58 1 -1 1 1 -1 -1 1 -1 50 1 1 1 1 1 1 1 1 86 L K C Misses -1 -1 -1 14 1 -1 -1 22 -1 1 -1 10 1 1 -1 34 -1 -1 1 46 1 -1 1 58 -1 1 1 50 1 1 1 86 > analysis Call: aov(formula = Misses ~ L * K * C) Terms: L K C L:K L:C K:C L:K:C Sum of Squares 800 200 3200 200 32 72 8 Deg. of Freedom 1 1 1 1 1 1 1 Estimated effects may be unbalanced > summary(analysis) Df Sum Sq Mean Sq L 1 800 800 K 1 200 200 C 1 3200 3200 L:K 1 200 200 L:C 1 32 32 K:C 1 72 72 L:K:C 1 8 8 > SSx=c(800,200,3200,200,32,72,8) > SST=sum(SSx) > Percent.Variation=100*SSx/SST > Percent.Variation [1] 17.7304965 4.4326241 70.9219858 4.4326241 0.7092199 1.5957447 0.1773050 qi: 40 10 5 20 5 2 3 1 = 1/∑(signi*Responsei) SSL = 23q2L = 800 SST = SSL+SSK+SSC+SSLK+SSLC+SSKC+SSLKC = 800+200+3200+200+32+72+8 = 4512 %variation(L) = SSL/SST = 800/4512 = 17.7%

Full Factorial Design Model: yij = m+ai + bj + eij Effects computed such that ∑ai = 0 and ∑bj = 0 m = mean(y..) aj = mean(y.j) – m bi = mean(yi.) – m Experimental Errors SSE = ei2j SS0 = abm2 SSA= b∑a2 SSB= a∑b2 SST = SS0+SSA+SSB+SSE

Full-Factor Design Example Determination of the speed of light Morley Experiments Factors: Experiment No. (Expt) Run No. (Run) Levels: Expt – 5 experiments Run – 20 repeated runs Expt Run Speed 001 1 1 850 002 1 2 740 003 1 3 900 004 1 4 1070 <more data> 019 1 19 960 020 1 20 960 021 2 1 960 022 2 2 940 023 2 3 960 096 5 16 940 097 5 17 950 098 5 18 800 099 5 19 810 100 5 20 870 > mm <- read.table("morley.tab") # Get file from # usr/lib/R/library/base/data > mm Expt Run Speed 1 1 1 850 2 1 2 740 3 1 3 900 4 1 4 1070 5 1 5 930 <95 more lines> > attach(mm) > Expt <- factor(Expt) # Experiment is a factor with levels 1, 2, ..., 5 > Run <- factor(Run) # Run is a factor with levels 1, 2, ..., 20 > Plot a boxplot of each factor > plot(Expt, Speed, main="Speed of Light Data (units)", xlab="Experiment No.") > fm <- aov(Speed~Run+Expt, data=mm) # Determine ANOVA > summary(fm) # Display ANOVA of factors Df Sum Sq Mean Sq F value Pr(>F) Run 19 113344 5965 1.1053 0.363209 Expt 4 94514 23629 4.3781 0.003071 ** Residuals 76 410166 5397 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Box Plots of Factors

Two-Factor Full Factorial > fm <- aov(Speed~Run+Expt, data=mm) # Determine ANOVA > summary(fm) # Display ANOVA of factors Df Sum Sq Mean Sq F value Pr(>F) Run 19 113344 5965 1.1053 0.363209 Expt 4 94514 23629 4.3781 0.003071 ** Residuals 76 410166 5397 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Conclusion: Data across experiments has acceptably small variation, but variation within runs is significant