Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 2: Summarization of Quantitative Information.

Slides:



Advertisements
Similar presentations
Biostatistics in Practice Session 2: Quantitative and Inferential Issues II Youngju Pak Biostatistician 1.
Advertisements

Objectives (BPS chapter 24)
Jan Shapes of distributions… “Statistics” for one quantitative variable… Mean and median Percentiles Standard deviations Transforming data… Rescale:
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Bootstrapping applied to t-tests
Estimation Goal: Use sample data to make predictions regarding unknown population parameters Point Estimate - Single value that is best guess of true parameter.
Inference for regression - Simple linear regression
Chapter 1 Descriptive Analysis. Statistics – Making sense out of data. Gives verifiable evidence to support the answer to a question. 4 Major Parts 1.Collecting.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Biost 511 DL Discussion Section Announcements Quiz 1 (CEU students only) Will be available on Canvas.uw.edu Friday 12 pm – Sunday 11:59 pm One hour to.
 Multiple choice questions…grab handout!. Data Analysis: Displaying Quantitative Data.
Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 5: Methods for Assessing Associations.
Welcome to Math 6 Statistics: Use Graphs to Show Data Histograms.
3.3 Density Curves and Normal Distributions
Chapter 3 (continued) Nutan S. Mishra. Exercises Size of the data set = 12 for all the five problems In 3.11 variable x 1 = monthly rent of.
Numerical Descriptive Techniques
1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Sampling and Confidence Interval
Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 2: Summarization of Quantitative Information.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Review of Chapters 1- 5 We review some important themes from the first 5 chapters 1.Introduction Statistics- Set of methods for collecting/analyzing data.
Biostatistics in Practice Peter D. Christenson Biostatistician Session 5: Methods for Assessing Associations.
Density Curves and the Normal Distribution.
Objectives (IPS Chapter 2.1)
Biostatistics Class 1 1/25/2000 Introduction Descriptive Statistics.
Sampling and Confidence Interval Kenneth Kwan Ho Chui, PhD, MPH Department of Public Health and Community Medicine
Copyright © 2009 Pearson Education, Inc. Chapter 6 The Standard Deviation as a Ruler and the Normal Model.
Skewness & Kurtosis: Reference
The Standard Deviation as a Ruler and the Normal Model
Review of Chapters 1- 6 We review some important themes from the first 6 chapters 1.Introduction Statistics- Set of methods for collecting/analyzing data.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 4 Describing Numerical Data.
Measures of central tendency are statistics that express the most typical or average scores in a distribution These measures are: The Mode The Median.
 Statistics The Baaaasics. “For most biologists, statistics is just a useful tool, like a microscope, and knowing the detailed mathematical basis of.
Biostatistics in Practice Session 2: Quantitative and Inferential Issues II Youngju Pak Biostatistician 1.
Categorical vs. Quantitative…
Biostatistics in Practice Peter D. Christenson Biostatistician Session 2: Summarization of Quantitative Information.
Unit 4 Statistical Analysis Data Representations.
To be given to you next time: Short Project, What do students drive? AP Problems.
Biostatistics in Practice Peter D. Christenson Biostatistician Session 6: Case Study.
MMSI – SATURDAY SESSION with Mr. Flynn. Describing patterns and departures from patterns (20%–30% of exam) Exploratory analysis of data makes use of graphical.
Biostatistics in Practice Peter D. Christenson Biostatistician Session 3: Testing Hypotheses.
Biostatistics in Practice Peter D. Christenson Biostatistician Session 4: Study Size for Precision or Power.
Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician
UNIT #1 CHAPTERS BY JEREMY GREEN, ADAM PAQUETTEY, AND MATT STAUB.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
STA Lecture 151 STA 291 Lecture 15 – Normal Distributions (Bell curve)
Outline of Today’s Discussion 1.Displaying the Order in a Group of Numbers: 2.The Mean, Variance, Standard Deviation, & Z-Scores 3.SPSS: Data Entry, Definition,
Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 3: Testing Hypotheses.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 6 The Standard Deviation as a Ruler and the Normal Model.
AP Statistics Section 15 A. The Regression Model When a scatterplot shows a linear relationship between a quantitative explanatory variable x and a quantitative.
BPS - 5th Ed. Chapter 231 Inference for Regression.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Midterm Review IN CLASS. Chapter 1: The Art and Science of Data 1.Recognize individuals and variables in a statistical study. 2.Distinguish between categorical.
Advanced Quantitative Techniques
CHAPTER 2 Modeling Distributions of Data
Description of Data (Summary and Variability measures)
CHAPTER 2 Modeling Distributions of Data
Descriptive Statistics
The Normal Distribution
CHAPTER 2 Modeling Distributions of Data
Basic Practice of Statistics - 3rd Edition Inference for Regression
CHAPTER 12 More About Regression
Welcome!.
CHAPTER 2 Modeling Distributions of Data
DESIGN OF EXPERIMENT (DOE)
CHAPTER 2 Modeling Distributions of Data
Advanced Algebra Unit 1 Vocabulary
CHAPTER 2 Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
Presentation transcript:

Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 2: Summarization of Quantitative Information

Readings for Session 2 from StatisticalPractice.com Units: E.g., Mouse or litter, Not e.g., mg/ml. RAW data, preferably. Summary: Particular method depends on structure in the raw data. Bell curve: often “natural”. Want ranges (for what?). Units of Analysis “Experimental units” Look at the data Summary statistics Typical values and their variability Correlation Normal distribution Confidence intervals

Units of Analysis Go over this entire reading at StatisticalPractice.com. The author states that some students are “more similar” to each other than are other students, or some students are “independent”. What does this mean? “Independent” really refers to the measurement that is made, not the units such as students or classes or schools. If knowledge of the value for a student does not change the likelihood of another student’s value, given class means, then the students are independent for this measurement. Would students from the same class likely be independent on height? How about on knowing some academic fact, such as what a case-control study is?

Example: Units and Independence Ten mice receive treatment A, each is bled, and blood samples are each divided into 3 aliquots. The same is done for 10 mice on treatment B. 1.A serum hormone is measured in the 60 aliquots and compared between A and B. The unit is a mouse, their means from 3 aliquots each are independent, N=10+10, and aliquots for a mouse are not independent. 2.One of the 30 A aliquots is further divided into 25 parts and 5 different in vitro challenges each made to 5 of the parts. The same is done for a single B aliquot. For the challenge experiment, each part is a unit, their values are independent, and N= For comparing A and B, there are only N=1+1 units, the two mice.

Look at the Data Statistical methods depend on the “form” of a set of data, which can be assessed with some common useful graphics: Graph NameY-axisX-axis HistogramCount or %Category ScatterplotContinuous Continuous Dot PlotContinuous Category Box PlotPercentiles Category Line PlotMean or value Category Examples on following slides are from StatisticalPractice.com

Data Graphical Displays HistogramScatter plot Raw Data Summarized* *Raw data version is a stem-leaf plot. We will see one later.

Data Graphical Displays Dot PlotBox Plot Raw Data Summarized*

Data Graphical Displays Line or Profile Plot Summarized - “antennae” can represent various ranges Week

Look at the Data, Continued What do we look for? Histograms:Ideal: Symmetric, bell-shaped. Potential Problems: Skewness. Multiple peaks. Many values at, say, 0, and bell-shaped otherwise Outliers. Scatter plots: Ideal: Football-shaped; ellipse. Potential Problems: Outliers. Funnel-shaped. Gap with no values for one or both variables.

Example Histogram: OK for Default* Analyses Symmetric. One peak. Roughly bell-shaped. No outliers. *software default, typical mean, SD, confidence intervals.

Histograms: Not OK for Default Analyses Skewed Need to transform intensity to another scale, e.g. Log(intensity) Multi-Peak Need to summarize with percentiles, not mean.

Histograms: Not OK for Default Analyses Truncated Values Need to use percentiles for most analyses. Outliers Need to use median, not mean, and percentiles. LLOQ Undetectable in 28 samples (<LLOQ)

Example Scatter Plot: OK for Typical Analyses

Scatter Plot: Not OK for Typical Analyses Gap and Outlier Consider analyzing subgroups. Funnel-Shaped Could transform y-value to another scale, e.g. logarithm. Ott, Amer J Obstet Gyn 2005;192: Ferber et al, Amer J Obstet Gyn 2004;190:

Summary Statistics: I Typical Values (“Location”): Mean for symmetric data. Median for skewed data. Geometric mean for some skewed data (see later slide). Variation in Values (“Spread”. Standard deviation=SD): Standard, convention, non-intuitive values. SD=~ Avg. deviation of values from their mean. SD of what? E.g., SD of individuals, or of group means. Fundamental, critical measure for most statistical methods. See graphs in reading for how mean and SD change if units of measurement change, e.g., nmoles to mg: Mean (a + b*X) = a + b*Mean(X) SD (a + b*X) = b*SD(X)

Examples: Mean and SD Mean = 60.6 min. Note that the entire range of data in A is about 6SDs wide, and is the source of the “Six Sigma” process used in business quality control. SD = 9.6 min.Mean = 15.1 min.SD = 2.8 min. AB

Examples: Mean and SD SkewedMulti-Peak Mean = 1.0 min. SD = 1.1 min. Mean = 70.3 min. SD = 22.3 min.

Summary Statistics: II Rule of Thumb: For bell-shaped distributions of data (“normally” distributed): ~ 68% of values are within mean ±1 SD ~ 95% of values are within mean ±2 SD ~ 99.7% of values are within mean ±3 SD Geometric means (see next slide): Used for some skewed data. 1.Take logs of individual values. 2.Find, say, mean ±2 SD → mean (low, up) of the logged values. 3.Find antilogs of mean, low, up. Call them GM, low2, up2 (back on original scale). 4.GM is the “geometric mean”. The interval (low2,up2) is skewed about GM (corresponds to graph).

Geometric Means These are flipped histograms rotated 90º, and box plots. Any base for the log transformation gives a symmetric distribution. [Ln used here; log 10 gives same GM and bounds.] =~ =~ 11.6 GM = exp(4.633) = low2 = exp( *1.09) = 11.6 upp2 = exp( *1.09) = =~ 102.8

Summary Statistics: III (Correlation) We will examine calculation details later. With 2 continuous measures, always look at scatterplot. See graphs in readings for values ranging from -1 (perfectly inverse relation) to +1 (perfectly direct). Zero=no relation. Measures linear association. Very sensitive to outliers. Specific to the ranges of the two variables. Typically, cannot extrapolate to populations with other ranges. Subgroups may not have the same correlation; in fact, they could have the opposite association (ecological fallacy). Special correlations are used for non-symmetric data. Measures association, not causation.

Correlation Depends on Ranges of X and Y Graph B contains only the graph A points in the ellipse. Correlation is reduced in graph B. Thus: correlations for the same quantities X and Y may be quite different in different study populations. BA

Correlation and Measurement Precision A lack of correlation for the subpopulation with 5<x<6 may be due to inability to measure x and y well. Again, lack of evidence is not evidence of “lack” (of association in this setting). BA r=0 for s B overall

Confidence Intervals: I See beginning of reading for the goal of confidence intervals. CIs are not about individuals, but rather about populations, i.e., groups of individuals. A mean from a sample estimates the mean of the entire population. 95% CI for the mean is a range of values we're 95% sure contains the unknown mean. Reading example: N=40 non-smokers. Vitamin C mean±2SD is 90±2*35 = 20 to 160 = “normal range”. Our estimate of the unknown mean for all non-smokers is 90, but how confident are we about that estimate? Need a ±range for it that we are 95% confident contains the unknown mean.

Confidence Intervals: II Can calculate a CI for any unknown parameter. Typical 95% CI for a mean is roughly: mean ± 2SD/√N. Larger SD → wider CI. Larger N → narrower CI. More confidence → wider CI. For reading example, CI=~ 90 ± 2*35/√40 = 78 to 102. I am being sloppy with terminology. The underlined mean above is the always-to-be-unknown mean for the population (everyone). The other mean, before ±, is the mean that is calculated from the sample of N, denoted X-bar, and it estimates the unknown mean, denoted μ. Note explicit use of N; correct unit of analysis is critical. What if we measured vitamin C on 10 days for each subject?

Confidence vs. Prediction Intervals Typical 95% CI for a mean is roughly: mean ± 2SD/√N. Recall that this CI is the range of values we're 95% sure contains the unknown mean for “everyone”. What about (normal) ranges for individuals? This is often called a prediction interval (PI) = “normal” range = reference range. 95% of individuals fall in a 95% PI. 95% chance that an individual falls in a 95% PI. Typical 95% PI for an individual is roughly: mean ± 2SD. With large N (how large? often N>30 is used), do not need bell-shaped data distribution for the CI, but that shape IS needed for the PI, regardless of N. Otherwise, we use percentiles for normal ranges.

CI and PI for the Antibody Example =~ =~ 11.6 GM = exp(4.633) = low2 = exp( *1.09) = 11.6 upp2 = exp( *1.09) = =~ So, there is 95% assurance that an individual is between 11.6 and 909.6, the PI. So, there is 95% assurance that the pop- ulation mean is between 92.1 and 114.8, the CI. GM = exp(4.633) = lower = exp( *1.09/√394) = 92.1 upper = exp( *1.09/√394) = 114.8