Presentation is loading. Please wait.

Presentation is loading. Please wait.

Biostatistics in Practice Peter D. Christenson Biostatistician Session 2: Summarization of Quantitative Information.

Similar presentations


Presentation on theme: "Biostatistics in Practice Peter D. Christenson Biostatistician Session 2: Summarization of Quantitative Information."— Presentation transcript:

1 Biostatistics in Practice Peter D. Christenson Biostatistician http://gcrc.humc.edu/Biostat Session 2: Summarization of Quantitative Information

2 Readings for Session 2 from StatisticalPractice.com Units of Analysis Look at the data Summary statistics Location and spread Correlation Normal distribution Confidence intervals

3 Units of Analysis Go over this entire reading. The author states that students are “more similar” to each other than are other students, or some students are “independent”. What does this mean? “Independent” really refers to the measurement that is made, not the “units”. If knowledge of the value for a student does not change the likelihood of another student’s value, then the students are independent for this measurement. Would students from the same class likely be independent on height? How about on knowing what a case-control study is?

4 Look at the Data: I Statistical methods depend on the “form” of a set of data, which can be assessed with some common useful graphics: Graph NameY-axisX-axis HistogramCountCategory ScatterplotContinuous Continuous Dot PlotContinuous Category Box PlotPercentiles Category Line PlotMean or value Category

5 Look at the Data: II What do we look for? Histograms: Ideal: Symmetric, bell-shaped. Skewness? Multiple peaks? Many values at, say, 0, and bell-shaped otherwise? Outliers? Scatterplots: Ideal: Narrow ellipse. Outliers? Funnel-shaped? Gap with no values for one or both variables.

6 Summary Statistics: I Location: Mean for symmetric data. Median for skewed data. Geometric mean for some skewed data (see next slide). Spread (standard deviation=SD): Standard, convention, non-intuitive values. SD of what? E.g., SD of individuals, or of group means. Fundamental, critical measure for most statistical methods. See graphs in reading for how mean and SD change if units of measurement change, e.g., nmoles to mg.

7 Summary Statistics: II Rule of Thumb: For bell-shaped distributions of data (“normally” distributed): ~ 68% of values are within mean ±1 SD ~ 95% of values are within mean ±2 SD ~ 99.7% of values are within mean ±3 SD Geometric means (see next slide): Used for some skewed data. 1.Take logs of individual values. 2.Find, say, mean ±2 SD → mean (low, up) of the logged values. 3.Find antilogs of mean, low, up. Call them GM, low2, up2 (back on original scale). 4.GM is the “geometric mean”. The interval (low2,up2) is skewed about GM (corresponds to graph).

8 Geometric Means These are histograms rotated 90º, and box plots. Note how the log transformation gives a symmetric distribution.

9 Summary Statistics: III (Correlation) Always look at scatterplot. See graphs in readings for values ranging from -1 (perfectly inverse relation) to +1 (perfectly direct). Zero=no relation. Measures linear association. Very sensitive to outliers. Specific to the ranges of the two variables. Typically, cannot extrapolate to populations with other ranges. Subgroups may not have the same correlation; in fact, they could have the opposite association (ecological fallacy). Special correlations are used for non-symmetric data. Measures association, not causation.

10 Confidence Intervals: I See beginning of reading for the goal of confidence intervals. CIs are not about individuals, but rather about populations, i.e., groups of individuals. A mean from a sample estimates the mean of the entire population. 95% CI for the mean is a range of values we're 95% sure contains the unknown mean. Reading example: N=40 non-smokers. Vitamin C mean±2SD is 90±2*35 = 20 to 160 = “normal range”. Our estimate of the unknown mean for all non-smokers is 90, but how confident are we about that estimate? Need a ±range for it that we are 95% confident contains the unknown mean.

11 Confidence Intervals: II Can calculate a CI for any unknown parameter. Typical 95% CI for a mean is roughly: mean ± 2SD/√N. Larger SD → wider CI. Larger N → narrower CI. More confidence → wider CI. For reading example, about 90 ± 2*35/√40 = 78 to 102. I am being sloppy with terminology. The underlined mean above is the always-to-be-unknown mean for the population (everyone). The other mean, before ±, is the mean that is calculated from the sample of N, and estimates the unknown mean. Note explicit use of N; correct unit of analysis is critical. What if we measured vitamin C on 10 days for each subject?

12 Confidence vs. Prediction Intervals Typical 95% CI for a mean is roughly: mean ± 2SD/√N. Recall that this CI is the range of values we're 95% sure contains the unknown mean for “everyone”. What about (normal) ranges for individuals? This is often called a prediction interval (PI). 95% of individuals fall in a 95% PI. 95% chance that an individual falls in a 95% PI. Typical 95% PI for an individual is roughly: mean ± 2SD. With large N (? often N>30 is used), do not need bell-shaped data distribution for the CI, but that shape IS needed for the PI, regardless of N. Otherwise, we use percentiles for normal ranges.


Download ppt "Biostatistics in Practice Peter D. Christenson Biostatistician Session 2: Summarization of Quantitative Information."

Similar presentations


Ads by Google