EIPB 698D Lecture 5 Raul Cruz-Cano Spring 2013. Midterm Comments PROC MEANS VS. PROS SURVEYMEANS For non–parametric: Kriskal-Wallis.

EIPB 698D Lecture 5 Raul Cruz-Cano Spring 2013

Midterm Comments PROC MEANS VS. PROS SURVEYMEANS For non–parametric: Kriskal-Wallis

Proc Reg The REG procedure is one of many regression procedures in the SAS System. PROC REGPROC REG ; MODELMODEL dependents= ; BYBY variables ; OUTPUTOUTPUT keyword=names ;

data blood; INFILE ‘F:\blood.txt'; INPUT subjectID $ gender $ bloodtype $ age_group $ RBC WBC cholesterol; run; data blood1; set blood; if gender='Female' then sex=1; else sex=0; if bloodtype='A' then typeA=1; else typeA=0; if bloodtype='B' then typeB=1; else typeB=0; if bloodtype='AB' then typeAB=1; else typeAB=0; if age_group='Old' then Age_old=1; else Age_old=0; run; proc reg data =blood1; model cholesterol =sex typeA typeB typeAB Age_old RBC WBC ; run;

Proc reg output Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 7 41237 5891.02895 2.54 0.0140 Error 655 1521839 2323.41811 Corrected Total 662 1563076 DF - These are the degrees of freedom associated with the sources of variance. (1) The total variance has N-1 degrees of freedom (663-1=662). (2) The model degrees of freedom corresponds to the number of predictors minus 1 (P-1). Including the intercept, there are 8 predictors, so the model has 8-1=7 degrees of freedom. (3) The Residual degrees of freedom is the DF total minus the DF model, 662-7 is 655.

Proc reg output Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 7 41237 5891.02895 2.54 0.0140 Error 655 1521839 2323.41811 Corrected Total 662 1563076 Sum of Squares - associated with the three sources of variance, total, model and residual. SSTotal The total variability around the mean. Sum(Y - Ybar) 2. SSResidual The sum of squared errors in prediction. Sum(Y - Ypredicted) 2. SSModel The improvement in prediction by using the predicted value of Y over just using the mean of Y. Hence, this would be the squared differences between the predicted value of Y and the mean of Y, Sum (Ypredicted - Ybar) 2. Note that the SSTotal = SSModel + SSResidual. SSModel / SSTotal is equal to the value of R-Square, the proportion of the variance explained by the independent variables

Proc reg output Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 7 41237 5891.02895 2.54 0.0140 Error 655 1521839 2323.41811 Corrected Total 662 1563076 Mean Square - These are the Mean Squares, the Sum of Squares divided by their respective DF. These are computed so you can compute the F ratio, dividing the Mean Square Model by the Mean Square Residual to test the significance of the predictors in the model

Proc reg output Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 7 41237 5891.02895 2.54 0.0140 Error 655 1521839 2323.41811 Corrected Total 662 1563076 F Value and Pr > F - The F-value is the Mean Square Model divided by the Mean Square Residual. F-value and P value are used to answer the question "Do the independent variables predict the dependent variable?". The p-value is compared to your alpha level (typically 0.05) and, if smaller, you can conclude "Yes, the independent variables reliably predict the dependent variable". Note that this is an overall significance test assessing whether the group of independent variables when used together reliably predict the dependent variable, and does not address the ability of any of the particular independent variables to predict the dependent variable.

Proc reg output Root MSE 48.20185 R-Square 0.0264 Dependent Mean 201.69683 Adj R-Sq 0.0160 Coeff Var 23.89817 Root MSE - Root MSE is the standard deviation of the error term, and is the square root of the Mean Square Residual (or Error).

Proc reg output Root MSE 48.20185 R-Square 0.0264 Dependent Mean 201.69683 Adj R-Sq 0.0160 Coeff Var 23.89817 Dependent Mean - This is the mean of the dependent variable. Coeff Var - This is the coefficient of variation, which is a unit-less measure of variation in the data. It is the root MSE divided by the mean of the dependent variable, multiplied by 100: (100*(48.2/201.69) =23.90). How much variability is explained by the model

Proc reg output Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 187.91927 17.45409 10.77 <.0001 sex 1 1.48640 3.79640 0.39 0.6955 typeA 1 0.74839 4.01841 0.19 0.8523 typeB 1 10.14482 6.97339 1.45 0.1462 typeAB 1 -19.90314 10.45833 -1.90 0.0575 Age_old 1 -11.61798 3.85823 -3.01 0.0027 RBC 1 0.00264 0.00191 1.38 0.1676 WBC 1 0.20512 1.88816 0.11 0.9135 t Value and Pr > |t|- These columns provide the t-value and 2 tailed p-value used in testing the null hypothesis that the coefficient/parameter is 0.

Another (better?) approach for weighted data Experimental design data have all the properties that we learned about in statistics classes. – The data are going to be independent – Identically-distributed observations with some known error distribution – there is an underlying assumption that the data come to use as a finite number of observations from a conceptually infinite population – Simple random sampling without replacement for the sample data Sample survey data, – Does not come from a finite target population – The sample survey data do not have independent errors. The sample survey data do not come from a conceptually infinite population. – The sample survey data may cover many small sub-populations, so we do not expect that the errors are identically distributed. 12

Household Component of the Medical Expenditure Panel Survey (MEPS HC) The MEPS HC is a nationally representative survey of the U.S. civilian noninstitutionalized population. It collects medical expenditure data as well as information on demographic characteristics, access to health care, health insurance coverage, as well as income and employment data. MEPS is cosponsored by the Agency for Healthcare Research and Quality (AHRQ) and the National Center for Health Statistics (NCHS). For the comparisons reported here we used the MEPS 2005 Full Year Consolidated Data File (HC-097). This is a public use file available for download from the MEPS web site (http://www.meps.ahrq.gov). 13

Transforming from SAS transport (SSP) format to SAS Dataset (SAS7BDAT) The MEPS is not a simple random sample, its design includes: – Stratification – Clustering – Multiple stages of Selection – Disproportionate sampling. The MEPS public use files (such as HC-097) include variables for generating weighted national estimates and for use of the Taylor method for variance estimation. These variables are: – person-level weight (PERWT05F on HC-097) – stratum (VARSTR on HC-097) – cluster/psu(VARPSU on HC-097). LIBNAME PUFLIB 'C:\'; FILENAME IN1 'C:\H97.SSP'; PROC XCOPY IN=IN1 OUT=PUFLIB IMPORT; RUN; Needed for even better estimates of the CI H97.SASBDAT occupies 408MB vs. 257MB for H97.SSP vs. 14MB for H97.ZIP 14

PROC SURVEYFREQ Simple Example SAS7BDAT 15 PROC SURVEYREG DATA= mylib.H97; strata VARSTR; cluster VARPSU; model TTLP05X = SEX; weight PERWT05F; Run; Predict Total Income Based on Sex

Logistic regression For binary response models, the response, Y, of an individual or an experimental unit can take on one of two possible values, denoted for convenience by 1 and 0 (for example, Y=1 if a disease is present, otherwise Y=0). Suppose x is a vector of explanatory variables and is the response probability to be modeled. The logistic regression model has the form Logit (P(Y=1)) =log (P(Y=1)/(1- P(Y=1)) = β 0 + β 1 x

Proc logistic The following statements are available in PROC LOGISTIC: PROC LOGISTICPROC LOGISTIC ; BYBY variables ; CLASSCLASS variable ; MODELMODEL response = ; MODELMODEL events/trials = ; OUTPUTOUTPUT / ; The PROC LOGISTIC and MODEL statements are required; only one MODEL statement can be specified. The CLASS statement (if used) must precede the MODEL statement.

High school data The data were collected on 200 high school students, with measurements on various tests, including science, math, reading and science studies. The response variable is high writing test score (high_write), where a writing score greater than or equal to 60 is considered high, and less than 60 considered low; from which we explore its relationship with gender, reading test score (read), and science test score (science).

High school data data new ; set d.hsb2; if write>=60 then high_write=1; else high_write=0; keep ID female math read science write high_write; run; proc logistic data= new descending; model high_write = female read science; run;

Logistic output Model Information Data Set WORK.NEW Response Variable high_write Number of Response Levels 2 Model binary logit Optimization Technique Fisher's scoring Number of Observations Read 200 Number of Observations Used 200 This the data set used in this procedure. This is the type of regression model that was fit to our data. The term logit and logistic are exchangeable.

Logistic output Response Profile Ordered high_ Total Value write Frequency 1 1 53 2 0 147 Probability modeled is high_write=1. Ordered value refers to how SAS models the levels of the dependent variable. When we specified the descending option, SAS treats the levels in a descending order (high to low), such that when the regression coefficients are estimated, a positive coefficient corresponds to a positive relationship for high write status. By default SAS models the lower level This is a note informing which level of the response variable we are modeling.

Logistic output Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 233.289 168.236 SC 236.587 181.430 -2 Log L 231.289 160.236 This describes whether the maximum- likehood algorithm has converged or not, and what kind of convergence criterion is used to asses convergence. Model with no predictors just intercept tem These are various measurements used to assess the model fit. The smaller values the better fit. The fitted model

Logistic output Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 71.0525 3 <.0001 Score 58.6092 3 <.0001 Wald 39.8751 3 <.0001 These are three asymptotically equivalent Chi-Square tests. They test against the null hypothesis that all of the predictors' regression coefficient are equal to zero in the model. With P<0.001, we will reject Ho and conclude that at least one of the predictors' regression coefficient is not equal to zero.

Logistic output Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -12.7772 1.9759 41.8176 <.0001 female 1 1.4825 0.4474 10.9799 0.0009 read 1 0.1035 0.0258 16.1467 <.0001 science 1 0.0948 0.0305 9.6883 0.0019 Here are the parameter estimates along with their P-value. Base on the estimates, our model is log[ p / (1-p) ] = -12.78 + 1.48*female + 0.10*read + 0.09*science.

Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits female 4.404 1.832 10.584 read 1.109 1.054 1.167 science 1.099 1.036 1.167 The odds ratio is obtained by exponentiating the Estimate, exp[Estimate]. We can interpret the odds ratio as follows: for a one unit change in the predictor variable, the odds ratio for a positive outcome is expected to change by the respective coefficient, given the other variables in the model are held constant. Logistic output

Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits female 4.404 1.832 10.584 read 1.109 1.054 1.167 science 1.099 1.036 1.167 If the 95% CI does not cover 1, it suggests the estimate is statistically significant Logistic output

Weighted Example Just as with linear regression, logistic regression allows you to look at the effect of multiple predictors on an outcome. Consider the following example: 15- and 16-year-old adolescents were asked if they have ever had sexual intercourse. – The outcome of interest is intercourse. – The predictors are race (white and black) and gender (male and female). Example from Agresti, A. Categorical Data Analysis, 2 nd ed. 2002.

Here is a table of the data: Intercourse RaceGenderYesNo WhiteMale43134 Female26149 BlackMale2923 Female2236 Raul Cruz-Cano, HLTH653 Spring 2013

Data Set Intercourse DATA intercourse; INPUT white male intercourse count; DATALINES; 1 1 1 43 1 1 0 134 1 0 1 26 1 0 0 149 0 1 1 29 0 1 0 23 0 0 1 22 0 0 0 36 ; RUN;

SAS: “descending” models the probability that intercourse = 1 (yes) rather than = 0 (no). “rsquare” requests the R 2 value from SAS; it is interpreted the same way as the R 2 from linear regression. “lackfit” requests the Hosmer and Lemeshow Goodness-of-Fit Test. This tells you if the model you have created is a good fit for the data. PROC LOGISTIC DATA = intercourse descending; weight count; MODEL intercourse = white male/rsquare lackfit; RUN;

SAS Output: R 2

Interpreting the R 2 value The R 2 value is 0.9907. This means that 99.07% of the variability in our outcome (intercourse) is explained by including gender and race in our model.

PROC LOGISTIC Output The odds of having intercourse is 1.911 times greater for males versus females.

Hosmer and Lemeshow GOF Test

H-L GOF Test The Hosmer and Lemeshow Goodness-of-Fit Test tests the hypotheses: H o : the model is a good fit, vs. H a : the model is NOT a good fit With this test, we want to FAIL to reject the null hypothesis, because that means our model is a good fit (this is different from most of the hypothesis testing you have seen). Look for a p-value > 0.10 in the H-L GOF test. This indicates the model is a good fit. In this case, the pvalue = 0.2419, so we do NOT reject the null hypothesis, and we conclude the model is a good fit.

Model Selection in SAS Can be applied to both Linear and Logistic Models Often, if you have multiple predictors and interactions in your model, SAS can systematically select significant predictors using forward selection, backwards selection, or stepwise selection. In forward selection, SAS starts with no predictors in the model. It then selects the predictor with the smallest pvalue and adds it to the model. It then selects another predictor from the remaining variables with the smallest pvalue and adds it to the model. It continues doing this until no more predictors have pvalues less than 0.05. In backwards selection, SAS starts with all of the predictors in the model and eliminates the non-significant predictors one at a time, refitting the model between each elimination. It stops once all the predictors remaining in the model are statistically significant.

Forward Selection in SAS We will let SAS select a model for us out of the three predictors: white, male, white*male. Type the following code into SAS: PROC LOGISTIC DATA = intercourse descending; weight count; MODEL intercourse = white male white*male/selection = forward lackfit; RUN;

Output from Forward Selection: “white” is added to the model

“male” is added to the model

No more predictors are found to be statistically significant

The Final Model:

Hosmer and Lemeshow GOF Test: The model is a good fit

SAS Weigted vs. Survey Procedures A random sample 300 students from each of the classes: freshman, sophomore, junior, and senior classes. proc format; value Design 1='A' 2='B' 3='C'; value Rating 1='dislike very much' 2='dislike' 3='neutral' 4='like' 5='like very much'; value Class 1='Freshman' 2='Sophomore' 3='Junior' 4='Senior'; run; data Enrollment; format Class Class.; input Class _TOTAL_; datalines; 1 3734 2 3565 3 3903 4 4196 ; run; data WebSurvey; format Class Class. Design Design. Rating Rating. ; do Class=1 to 4; do Design=1 to 3; do Rating=1 to 5; input Count @@; output; end; datalines; 10 34 35 16 15 8 21 23 26 22 5 10 24 30 21 1 14 25 23 37 11 14 20 34 21 16 19 30 23 12 19 12 26 18 25 11 14 24 33 18 10 18 32 23 17 8 15 35 30 12 15 22 34 9 20 2 34 30 18 16 ; run; data WebSurvey; set WebSurvey; if Class=1 then Weight=3734/300; if Class=2 then Weight=3565/300; if Class=3 then Weight=3903/300; if Class=4 then Weight=4196/300; run;

PROC Logistic proc logistic data=WebSurvey; freq Count; class Design; model Rating (ref='neutral') = Design ; weight Weight; run;

PROC surveylogistic proc surveylogistic data=WebSurvey total=Enrollment; freq Count; class Design; model Rating (ref='neutral') = Design; stratum Class; weight Weight; run; If you want “better” results.. For the Ratings for Design B vs. Design C compare 1.The point estimate 2.95% Confidence Interval

EIPB 698D Lecture 5 Raul Cruz-Cano Spring 2013. Midterm Comments PROC MEANS VS. PROS SURVEYMEANS For non–parametric: Kriskal-Wallis.

Similar presentations

Presentation on theme: "EIPB 698D Lecture 5 Raul Cruz-Cano Spring 2013. Midterm Comments PROC MEANS VS. PROS SURVEYMEANS For non–parametric: Kriskal-Wallis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EIPB 698D Lecture 5 Raul Cruz-Cano Spring 2013. Midterm Comments PROC MEANS VS. PROS SURVEYMEANS For non–parametric: Kriskal-Wallis.

Similar presentations

Presentation on theme: "EIPB 698D Lecture 5 Raul Cruz-Cano Spring 2013. Midterm Comments PROC MEANS VS. PROS SURVEYMEANS For non–parametric: Kriskal-Wallis."— Presentation transcript:

Similar presentations

About project

Feedback