EHS 655 Lecture 11: Transformations, inferential statistics (t-test, ANOVA)
What we’ll cover today Transformations Inferential statistics t-test ANOVA Review of midterm report requirements
TRANSFORMING VARIABLES Many inferential statistical methods assume data are normally distributed t-test ANOVA Linear regression However, many exposures positive and right-skewed
One solution: log-transform data Yi=ln(xi) where Yi is log-transformed data point xi is original data point ln is natural logarithmic function Natural log (ln) transform of lognormally distributed variable has properties of normal distribution i.e., bell-shaped and symmetric Described by geometric mean (GM) and geometric standard deviation (GSD)
Log transformation Exposure distributions – original and transformed Rappaport and Kupper, 2008
Log transformation
Evaluating lognormal distribution Quantile-quantile plots Untransformed Log-transformed Stata: qnorm varname1
Interpreting log transformed estimates Arithmetic mean of log transformed exposures Arithmetic SD of log transformed exposures Geometric mean Antilog of mean of log-transformed exposures Geometric standard deviation Stata: can use two combinations for transformation ln() or log() and exp() … OR … log10() and 10log10value
Caution about transformation Back-transformed mean ≠ original variable mean GM isn't easily interpreted Proper to run statistical tests on transformed values But often report means in unit of untransformed scale as well “If it ain’t broke, don’t fix it.” Transformation bad if: Distribution more or less symmetrical, few outliers Variances reasonably homogeneous Transformation may be useful Markedly skewed data or heterogeneous variances
INFERENTIAL STATISTICS Descriptive statistics applied to populations are called parameters Inferential statistics apply to samples We’ll focus on two inferential approaches today t-test ANOVA
t-test
t-test Detect differences between means of (normally-distributed) samples Significant t-statistic = means differ Student’s (unpaired) t-test Test hypothesis that means of two samples are equal; null is Stata: ttest varname1, by(groupvar) Paired sample t-test Test whether two measurements on same individual are equal Stata: ttest varname1 == varname2
Things we can do with a t-test Single-sample t-test: identify differences in the mean of a group and a reference value Unpaired t-test: identify differences in mean exposures between two groups Paired-sample t-test: identify differences in exposure before and after an intervention in a group of subjects
Interpreting a single-sample t-test in Stata
Interpreting a t-test between groups in Stata
Interpreting a paired t-test in Stata
ANOVA (ANalysis Of Variance) Technique for assessing how categorical independent variables affect continuous dependent variable Like a t-test generalized to three or more means Tells use whether means from k groups are same or not Null hypothesis:
Things we can do with ANOVA Identify differences in mean exposures between more than two groups Evaluate relationship of within-worker variance within exposure group to between-worker variance Within-worker > between worker = good exposure grouping Within-worker < between worker = poor exposure grouping
ANOVA assumptions Continuous dependent variable Independent variable is 2+ categorical groups Data independent from each other Errors normally distributed Variances same for all groups ANOVA fairly robust for these assumptions But data should not be extremely far off
ANOVA illustrated
Generic ANOVA components
ANOVA – F-test Compares variability in exposure accounted for by predictor variable vs error variability Error variability (mean squared error) measures inherent randomness of observations Large differences between groups = significant F test
F-statistic
F-statistic
Stata ANOVA output Stata: oneway responsevar groupvar Bigger F = significant
Stata ANOVA output
Stata ANOVA output Stata: anova responsevar groupvar Note different output: now get R2, adj R2, RMSE, etc. More in regression lecture
Stata ANOVA output Stata: oneway responsevar groupvar, tabulate Tabulate gives results by group
Why use ANOVA instead of t-test? Could do t-tests for all pairs of predictor variable categories Not a good idea As number of exposure groups grows, so does number of needed pair comparisons Each comparison introduces risk of error ANOVA puts all data into one number (F) and gives one P for null hypothesis
What if I want to know which groups are different Multiple comparisons possible After you run oneway command, use this second command Stata: pwcompare groupvar, effects sort mcompare(tukey)
Multiple comparison ANOVA output
Measure of agreement between categorical and continuous variables Stata: loneway responsevar groupvar Intraclass correlation coefficient = measure of agreement, same scale as Cohen’s kappa
ANOVA in action Enough with words already. Let’s see how ANOVA actually works http://web.utah.edu/stat/introstats/anovaflash.html Stata ANOVA commands: oneway responsevar groupvar Option (to get more detailed output by group) oneway responsevar groupvar, tabulate means standard
Resources Choosing statistical tests http://www.ats.ucla.edu/stat/spss/whatstat/default.htm Stata annotated output from various tests http://www.ats.ucla.edu/stat/AnnotatedOutput/
Review of midterm report
Example of noise exposure calculation requiring transformation Can describe noise exposures (in dBA) across individuals arithmetically In other words, to estimate a group mean for individuals in, say, the same trade, compute arithmetic mean To estimate average noise exposures within individual (in dBA) is computing dose Requires temporary transformation LEQi= 10 log [1/N (10 (TWA1/10) +10 (TWA2/10) + …+ 10 (TWAn/10))] Where N is total number of TWAs used to estimate average LEQ for person i How to operationalize in Stata? Note temporary transformation