Download presentation
1
Logistic Regression for binary outcomes
2
In Linear Regression, Y is continuous
In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression since: Y can’t be linearly related to Xs. Y does NOT have a Gaussian (normal) distribution around “mean” P. We need a “linearizing” transformation and a non Gaussian error model
3
Since 0 <= P <= 1 Might use odds = P/(1-P) Odds has no “ceiling” but has “floor” of zero. So we use the logit transformation ln(P/(1-P)) = ln(odds) = logit(P) Logit does not have a floor or ceiling. Model: logit = ln(P/(1-P))=β0+ β1X1 + β2X2+…+βkXk or Odds= e(β0 + β1X1 + β2X2+…+βkXk)=elogit
4
Since P=odds/(1 + odds) & odds = elogit
P = elogit/(1 + elogit) = 1/(1 + e-logit)
5
If ln(odds)= β0+ β1X1 + β2X2+…+βkXk
then odds = (eβ0) (eβ1X1) (eβ2X2)…(eβkXk) or odds = (base odds) OR1 OR2 … ORk Model is multiplicative on the odds scale (Base odds are odds when all Xs=0) ORi = odds ratio for the ith X
6
Interpreting β coefficients
Example: Dichotomous X X = 0 for males, X=1 for females logit(P) = β0 + β1 X M: X=0, logit(Pm)= β0 F: X=1, logit(Pf) = β0 + β1 logit(Pf) – logit(Pm) = β1 log(OR) = β1, eβ1 = OR
7
Example: P is proportion with disease
logit(P) = β0 + β1 age + β2 sex “sex” is coded 0 for M, 1 for F OR for F vs M for disease is eβ2 if both are the same age. eβ1 is the increase in the odds of disease for a one year increase in age. (eβ1)k = ekβ1 is the OR for a ‘k’ year change in age in two groups with the same gender.
8
Example: P is proportion with a MI Predictors: age in years
htn = hypertension (1=yes, 0=no) smoke = smoking (1=yes, 0=no) Logit(P) = β0+ β1age + β2 htn + β3 smoke Q: Want OR for a 40 year old with hypertension vs otherwise identical 30 year old without hypertension. A:β0+β140+β2+β3smoke–(β0+β130+β3smoke) = β110+β2=log OR OR = e[10 β1+β2].
9
Interactions P is proportion with CHD
S:1= smoking, 0=non. D:1=drinking, 0 =non Logit(P)= β0+ β1S + β2 D + β3 SD Referent category is S=0, D=0 S D odds OR eβ OR00=1= eβ0/ eβ0 eβ0+β OR10= eβ1 eβ0+β OR01= eβ2 eβ0+β1+β2+β3 OR11= e(β1+β2+β3) When will OR11=OR10 x OR01? IFF β3=0
10
Interpretation example
Potential predictors (13) of in hospital infection mortality (yes or no) Crabtree, et al JAMA 8 Dec 1999 No 22, Gender (female or male) Age in years APACHE score (0-129) Diabetes (y/n) Renal insufficiency / Hemodyalysis (y/n) Intubation / mechanical ventilation (y/n) Malignancy (y/n) Steroid therapy (y/n) Transfusions (y/n) Organ transplant (y/n) WBC - count Max temperature - degrees Days from admission to treatment (> 7 days)
11
Factors Associated With Mortality for All Infections
Characteristic Odds Ratio (95% CI) p value Incr APACHE score ( ) <.001 Transfusion (y/n) ( ) <.001 Increasing age ( ) <.001 Malignancy ( ) <.001 Max Temperature ( ) <.001 Adm to treat>7 d ( ) Female (y/n) ( ) *APACHE = Acute Physiology & Chronic Health Evaluation Score
12
Diabetes complications -Descriptive stats
Table of obese by diabetes complication obese diabetes complication Freq | no- 0|yes- 1| Total % yes no 0| 56 | 28 | /84=33% yes 1| 20 | 41 | /61=67% Total %obese 26% 59% RR=2.0, OR=4.1 , p < 0.001 Fasting glucose (“fast glu”) mg/dl n min median mean max No complication Complication , p= Steady state glucose (“steady glu”) mg/dl n min median mean max No complication Complication , p=
13
Diabetes complication
Parameter DF beta SE(b) Chi-Square p Intercept <.0001 obese Fast glu Steady glu <.0001 Log odds diabetes complication = obese fast glu steady glu
14
Statistical sig of the βs
Linear regr t = b/SE -> p value Logistic regr Χ2 = (b/SE)2 -> p value Must first form (95%) CI for β on log scale b – 1.96 SE, b SE Then take antilogs of each end e[b – 1.96 SE], e[b SE]
15
Diabetes complications
Odds Ratio Estimates Point % Wald Effect Estimate Confidence Limits obese e0.328= Fast glu e0.108= Steady glu e0.023=
16
Model fit-Linear vs Logistic regression k variables, n observations
Variation df sum square or deviance Model k G Error n-k D Total n T <-fixed Yi= ith observation, Ŷi=prediction for ith obs statistic Linear regr Logistic regr D/(n-k) Residual SDe Mean deviance Σ[(Yi-Ŷi)/Ŷ]2 -- Hosmer-L χ2 Corr(Y,Ŷ)2 R2 Cox-Snell R2 G/T Pseudo R2
17
Good regression models have large G and small D
Good regression models have large G and small D. For logistic regression, D/(n-k), the mean deviance, should be near 1.0. There are two versions of the R2 for logistic regression.
18
Goodness of fit:Deviance
Deviance in logistic is like SS in linear regr df -2log L p value Model (G) < 0.001 Error (D) total (T) mean deviance =83.46/141=0.59 (want mean deviance to be ≤ 1) R2pseudo=G/total =117/201= 0.58, R2cs =0.554
19
Goodness of fit:H-L chi sq
Compare observed vs model predicted (expected) frequencies by pred. decile decile total obs y exp y obs no exp no … chi-square=9.89, df=7, p =
20
Goodness of fit vs R2 Interpretation when goodness of fit is acceptable and R2 is poor. Need to include interactions or make transformation on X variables in model? Need to obtain more X variables?
21
Sensitivity & Specificity
True pos True neg Classify pos a b Classify neg c d total a+c b+d Sensitivity=a/(a+c), false neg=c/(a+c) Specificity=d/(b+d), false pos=b/(b+d) Accuracy = W sensitivity + (1-W) specificity
22
Predict positive if P > Pc Predict negative if P < Pc
Any good classification rule, including a logistic model, should have high sensitivity & specificity. In logistic, we choose a cutpoint, Pc, Predict positive if P > Pc Predict negative if P < Pc
23
Diabetes complication
logit(Pi) = obese fast glu steady glu Pi = 1/(1+ exp(-logit)) Compute Pi for all observations, find value of Pi (call it P0) that maximizes accuracy=0.5 sensitivity specificity This is an ROC analysis using the logit (or Pi)
24
ROC for logistic model
25
Diabetes model accuracy
Logit =0.447, P0=e0.447/(1+e0.447) = 0.61 True comp True no comp Pred yes 55 11 Pred no 14 65 total 69 76 Sens=55/69= 79.7%, Spec=65/76=85.5% Accuracy = (81.2% %)/2 = 83.4%
26
C statistic (report this)
n0=num negative, n1=num positive Make all n0 x n1 pairs (1,0) Concordant if predicted P for Y=1 > predicted P for Y=0 Discordant if predicted P for Y=1 < predicted P for Y=0 C = num concordant num ties n0 x n1 C=0.949 for diabetes complication model
27
Logistic model is also a discriminant model (LDA)
Histograms of logit scores for each group
28
Poisson Regression Y is a low positive integer, 0, 1,2, … Model:
ln(mean Y) = β0+ β1X1 + β2X2+…+βkXk so mean Y = exp(β0+ β1X1 + β2X2+…+βkXk) dY/dXi = βi mean Y, βi = (dY/dXi)/mean Y 100 βi is the percent change per unit change in Xi
29
End
30
Equation for logit = log odds=depr “score”
logit = female + chron ill income odds depr = elogit, risk = odds/(1+odds) coding: Female: 0 for M, 1 for F Chron ill: 0 for no, 1 for yes Income in 1000s
31
Example: Depression (y/n)
Model for depression term coeff=β SE p value Intercept female chron ill income Female, chron ill are binary, income in 1000s
32
ORs Intercept -1.8259 --- female 0.8332 2.301 chron ill 0.3578 1.430
term coeff=β OR = eβ Intercept female chron ill income
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.