EHS 655 Lecture 4: Descriptive statistics, censored data
What we’ll cover today Descriptive analysis and visualization Distribution Central tendency Dispersion Censored data Stata – basic commands
DESCRIPTIVE ANALYSIS Before we can make inference from data we must thoroughly examine variables Catch mistakes Look for patterns Find violations of statistical assumptions Generate hypotheses Avoid headaches later
Scope of dataset/analysis Univariate Measurements on one variable per subject Bivariate Measurements on two variables per subject Multivariate Measurements on many variables per subject Today’s focus
UNIVARIATE ANALYSES Characteristics of single variable Typically Distribution (frequency distribution) Central tendency (mean, median, mode) Dispersion (range, quartiles, absolute deviation, variance, standard deviation)
Distribution: categorical (table) Stata: “tab varname”
Distribution: categorical (ordinal) Stata: “graph bar (percent), over(varname)”
Distribution – quantitative (ratio) histogram Stata: “histogram varname, freq” (add “normal” to superimpose normal curve) http://www.inchem.org/documents/ehc/ehc/ehc214.htm
Distribution: cumulative distribution Stata: “cumul varname, gen (newvar) line newvar varname, sort”
Distribution: exceedance fraction http://depts.washington.edu/occnoise/content/generaltradesIDweb.pdf
Central tendency: mean, median, mode Use: identify “center” around which data are distributed Mean: Best for symmetric, non-skewed distributions Median: Best for skewed distribution or data with outliers Mode: Dataset may be bimodal, or may lack mode
Examples of central tendency Symmetrical, unimodal Symmetrial, bimodal Positively skewed, unimodal Negatively skewed, unimodal
When to use mean, median, mode Stata: “tabstat varname, stat (mean median)” Note: Stata does not have an easy way to identify mode Type of variable Best measure of central tendency Nominal Mode Ordinal Median Interval/ratio (not skewed) Mean Interval/ratio (skewed)
Dispersion Measures which identify spread of data (i.e., how far measurements are from “center”) Range Quartiles Standard deviation (SD) Variance Coefficient of variation Stata: “sum varname” Provides n, range, SD or Stata: “sum varname, detail” Provides n, range, quartiles, SD, variance Stata: “tabstat varname, stat(mean sd median range iqr cv)
Dispersion: range Simplest measure of dispersion Range Maximum - minimum Range
Dispersion: quartiles 3 points that divide data set into 4 equal groups 1st quartile (Q1) marks lowest 25% of data = 25th percentile 2nd quartile (Q2) splits data set in half = 50th percentile 3rd quartile (Q3) marks highest 25% of data = 75th percentile Upper – lower quartile is interquartile range (IQR)
Dispersion: boxplot Stata: “graph box varname1, over(varname2)” http://www.inchem.org/documents/ehc/ehc/ehc214.htm
Dispersion: standard deviation Variation of data Not dependent on n Not affected by number of measurements Expressed in same units as data Commonly used in exposure analysis
Variance Square of standard deviation Squaring eliminates negative values Unit is square of measurement unit (!) Values farther from mean contribute more to variance Commonly used in exposure analysis
Coefficient of variation Normalized measure of dispersion Dimensionless, often expressed as % Allows comparison of datasets with different units or means Unlike σ, cannot be used to construct confidence intervals around mean Mean close to 0 = Cv will approach infinity
BIVARIATE ANALYSES Allow us to begin to explore relationships between variables Scatter plot Correlation Cross-tabulation
Bivariate: scatter plot Stata: “scatter varname1 varname2” http://www.inchem.org/documents/ehc/ehc/ehc214.htm
Bivariate: Pearson correlation r (Pearson’s correlation coefficient) is amount of change in one value you expect from change in another value Assumptions: Both variables interval or ratio data Both variables normally distributed Absence of outliers Linear relationship Homeskedasticity Stata: “pwcorr varname1 varname2, sig”
Bivariate: correlation
Bivariate: Spearman correlation Spearman’s rank correlation coefficient (rs or ρ) Pearson correlation coefficient between ranked variables Raw scores Xi, Yi converted to ranks xi, yi Nonparametric (no distributional assumptions) Assumptions: Ordinal, interval, or ratio data Monotonic relationship Stata: “spearman varname1 varname2, stats(rho p)”
Bivariate: spearman vs. Pearson correlation coefficient examples
Bivariate: Cross-tabulation Stata: tab varname1 varname2
CENSORED DATA Uncensored/complete Left censored Interval censored Value of each sample unit observed/known Default assumption Left censored Data <max value Interval censored Data between min and maxi value Right censored Data >max value
Exercise Come up with one example of exposure data where you might find each type of censoring Right censoring Left censoring Interval censoring
Common approach #1 to dealing with censored data Assign all censored data ½ LOD Assumes data uniformly distributed below LOD
Common approach #2 to dealing with censored data Hornung and Reed (1990) More accurate than LOD/2 when data normally or lognormally distributed Okay if <~50% data censored if low to moderate variability Less accurate than LOD/2 for highly skewed data
On to Stata Basic data manipulation commands Define label/name for a variable (“label variable”) Create labels (“label define”) Assign labels to variable (“label values”) Rename variable (“rename”) Generate a new variable (“generate”) Replace an existing variable (“replace”)
On to Stata Anyone have to use the “Break” button?