Download presentation
Presentation is loading. Please wait.
Published byEileen Crawford Modified over 7 years ago
1
EHS 655 Lecture 4: Descriptive statistics, censored data
2
What we’ll cover today Descriptive analysis and visualization
Distribution Central tendency Dispersion Censored data Stata – basic commands
3
DESCRIPTIVE ANALYSIS Before we can make inference from data we must thoroughly examine variables Catch mistakes Look for patterns Find violations of statistical assumptions Generate hypotheses Avoid headaches later
4
Scope of dataset/analysis
Univariate Measurements on one variable per subject Bivariate Measurements on two variables per subject Multivariate Measurements on many variables per subject Today’s focus
5
UNIVARIATE ANALYSES Characteristics of single variable Typically
Distribution (frequency distribution) Central tendency (mean, median, mode) Dispersion (range, quartiles, absolute deviation, variance, standard deviation)
6
Distribution: categorical (table)
Stata: “tab varname”
7
Distribution: categorical (ordinal)
Stata: “graph bar (percent), over(varname)”
8
Distribution – quantitative (ratio) histogram
Stata: “histogram varname, freq” (add “normal” to superimpose normal curve)
9
Distribution: cumulative distribution
Stata: “cumul varname, gen (newvar) line newvar varname, sort”
10
Distribution: exceedance fraction
11
Central tendency: mean, median, mode
Use: identify “center” around which data are distributed Mean: Best for symmetric, non-skewed distributions Median: Best for skewed distribution or data with outliers Mode: Dataset may be bimodal, or may lack mode
12
Examples of central tendency
Symmetrical, unimodal Symmetrial, bimodal Positively skewed, unimodal Negatively skewed, unimodal
13
When to use mean, median, mode
Stata: “tabstat varname, stat (mean median)” Note: Stata does not have an easy way to identify mode Type of variable Best measure of central tendency Nominal Mode Ordinal Median Interval/ratio (not skewed) Mean Interval/ratio (skewed)
14
Dispersion Measures which identify spread of data (i.e., how far measurements are from “center”) Range Quartiles Standard deviation (SD) Variance Coefficient of variation Stata: “sum varname” Provides n, range, SD or Stata: “sum varname, detail” Provides n, range, quartiles, SD, variance Stata: “tabstat varname, stat(mean sd median range iqr cv)
15
Dispersion: range Simplest measure of dispersion Range
Maximum - minimum Range
16
Dispersion: quartiles
3 points that divide data set into 4 equal groups 1st quartile (Q1) marks lowest 25% of data = 25th percentile 2nd quartile (Q2) splits data set in half = 50th percentile 3rd quartile (Q3) marks highest 25% of data = 75th percentile Upper – lower quartile is interquartile range (IQR)
17
Dispersion: boxplot Stata: “graph box varname1, over(varname2)”
18
Dispersion: standard deviation
Variation of data Not dependent on n Not affected by number of measurements Expressed in same units as data Commonly used in exposure analysis
19
Variance Square of standard deviation
Squaring eliminates negative values Unit is square of measurement unit (!) Values farther from mean contribute more to variance Commonly used in exposure analysis
20
Coefficient of variation
Normalized measure of dispersion Dimensionless, often expressed as % Allows comparison of datasets with different units or means Unlike σ, cannot be used to construct confidence intervals around mean Mean close to 0 = Cv will approach infinity
21
BIVARIATE ANALYSES Allow us to begin to explore relationships between variables Scatter plot Correlation Cross-tabulation
22
Bivariate: scatter plot
Stata: “scatter varname1 varname2”
23
Bivariate: Pearson correlation
r (Pearson’s correlation coefficient) is amount of change in one value you expect from change in another value Assumptions: Both variables interval or ratio data Both variables normally distributed Absence of outliers Linear relationship Homeskedasticity Stata: “pwcorr varname1 varname2, sig”
24
Bivariate: correlation
25
Bivariate: Spearman correlation
Spearman’s rank correlation coefficient (rs or ρ) Pearson correlation coefficient between ranked variables Raw scores Xi, Yi converted to ranks xi, yi Nonparametric (no distributional assumptions) Assumptions: Ordinal, interval, or ratio data Monotonic relationship Stata: “spearman varname1 varname2, stats(rho p)”
26
Bivariate: spearman vs. Pearson correlation coefficient examples
27
Bivariate: Cross-tabulation
Stata: tab varname1 varname2
28
CENSORED DATA Uncensored/complete Left censored Interval censored
Value of each sample unit observed/known Default assumption Left censored Data <max value Interval censored Data between min and maxi value Right censored Data >max value
29
Exercise Come up with one example of exposure data where you might find each type of censoring Right censoring Left censoring Interval censoring
30
Common approach #1 to dealing with censored data
Assign all censored data ½ LOD Assumes data uniformly distributed below LOD
31
Common approach #2 to dealing with censored data
Hornung and Reed (1990) More accurate than LOD/2 when data normally or lognormally distributed Okay if <~50% data censored if low to moderate variability Less accurate than LOD/2 for highly skewed data
32
On to Stata Basic data manipulation commands
Define label/name for a variable (“label variable”) Create labels (“label define”) Assign labels to variable (“label values”) Rename variable (“rename”) Generate a new variable (“generate”) Replace an existing variable (“replace”)
33
On to Stata Anyone have to use the “Break” button?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.