Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 2: Summarization of Quantitative Information.

Slides:



Advertisements
Similar presentations
Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Advertisements

Biostatistics in Practice Session 2: Quantitative and Inferential Issues II Youngju Pak Biostatistician 1.
QUANTITATIVE DATA ANALYSIS
T-tests Computing a t-test  the t statistic  the t distribution Measures of Effect Size  Confidence Intervals  Cohen’s d.
Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.
Jan Shapes of distributions… “Statistics” for one quantitative variable… Mean and median Percentiles Standard deviations Transforming data… Rescale:
Those who don’t know statistics are condemned to reinvent it… David Freedman.
Introduction to Educational Statistics
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Estimation Goal: Use sample data to make predictions regarding unknown population parameters Point Estimate - Single value that is best guess of true parameter.
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 2: Basic techniques for innovation data analysis. Part I: Statistical inferences.
Describing distributions with numbers
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Given a sample from some population: What is a good “summary” value which well describes the sample? We will look at: Average (arithmetic mean) Median.
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 5: Methods for Assessing Associations.
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Inference for Linear Regression Conditions for Regression Inference: Suppose we have n observations on an explanatory variable x and a response variable.
Evidence Based Medicine
Dan Piett STAT West Virginia University
Statistics & Biology Shelly’s Super Happy Fun Times February 7, 2012 Will Herrick.
Review of Chapters 1- 5 We review some important themes from the first 5 chapters 1.Introduction Statistics- Set of methods for collecting/analyzing data.
Education Research 250:205 Writing Chapter 3. Objectives Subjects Instrumentation Procedures Experimental Design Statistical Analysis  Displaying data.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
Biostatistics in Practice Peter D. Christenson Biostatistician Session 5: Methods for Assessing Associations.
Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 6: Case Study.
Biostatistics Class 1 1/25/2000 Introduction Descriptive Statistics.
Sampling and Confidence Interval Kenneth Kwan Ho Chui, PhD, MPH Department of Public Health and Community Medicine
Skewness & Kurtosis: Reference
The Standard Deviation as a Ruler and the Normal Model
Review of Chapters 1- 6 We review some important themes from the first 6 chapters 1.Introduction Statistics- Set of methods for collecting/analyzing data.
QUANTITATIVE RESEARCH AND BASIC STATISTICS. TODAYS AGENDA Progress, challenges and support needed Response to TAP Check-in, Warm-up responses and TAP.
Average Arithmetic and Average Quadratic Deviation.
Statistics - methodology for collecting, analyzing, interpreting and drawing conclusions from collected data Anastasia Kadina GM presentation 6/15/2015.
Biostatistics in Practice Session 2: Quantitative and Inferential Issues II Youngju Pak Biostatistician 1.
Biostatistics in Practice Peter D. Christenson Biostatistician Session 2: Summarization of Quantitative Information.
Unit 4 Statistical Analysis Data Representations.
Thursday August 29, 2013 The Z Transformation. Today: Z-Scores First--Upper and lower real limits: Boundaries of intervals for scores that are represented.
Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 2: Summarization of Quantitative Information.
Biostatistics in Practice Peter D. Christenson Biostatistician Session 6: Case Study.
MMSI – SATURDAY SESSION with Mr. Flynn. Describing patterns and departures from patterns (20%–30% of exam) Exploratory analysis of data makes use of graphical.
Biostatistics in Practice Peter D. Christenson Biostatistician Session 3: Testing Hypotheses.
Going from data to analysis Dr. Nancy Mayo. Getting it right Research is about getting the right answer, not just an answer An answer is easy The right.
Biostatistics in Practice Peter D. Christenson Biostatistician Session 4: Study Size for Precision or Power.
Copyright © 2005 Pearson Education, Inc. Slide 6-1.
01/20151 EPI 5344: Survival Analysis in Epidemiology Actuarial and Kaplan-Meier methods February 24, 2015 Dr. N. Birkett, School of Epidemiology, Public.
Biostatistics Case Studies 2006 Peter D. Christenson Biostatistician Session 2: Correlation of Time Courses of Simultaneous.
Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician
Biostatistics Case Studies 2007 Peter D. Christenson Biostatistician Session 2: Aging and Survival.
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 3: Testing Hypotheses.
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
BPS - 5th Ed. Chapter 231 Inference for Regression.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
AP Statistics. Chapter 1 Think – Where are you going, and why? Show – Calculate and display. Tell – What have you learned? Without this step, you’re never.
STATS DAY First a few review questions. Which of the following correlation coefficients would a statistician know, at first glance, is a mistake? A. 0.0.
Quantitative Techniques – Class I
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
Statistical Methods For Engineers
Basic Statistical Terms
Statistics: The Interpretation of Data
Basic Practice of Statistics - 3rd Edition Inference for Regression
Advanced Algebra Unit 1 Vocabulary
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
Presentation transcript:

Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 2: Summarization of Quantitative Information

Topics for this Session Experimental Units Independence of Measurements Graphs: Summarizing Results Graphs: Aids for Analysis Summary Measures Confidence Intervals Prediction Intervals

Most Practical from this Session Geometric Means Confidence Intervals Reference Ranges Justify Methods from Graphs

Experimental Units _____ Independence of Measurements

Statistical Independence Experimental units are the smallest independent entities for addressing a scientific question in an analysis of an experiment. “Independent” refers to the measurement that is made and the question, not the units. Definition: If knowledge of the value for a unit does not provide information about another unit’s value, given other factors (and the overall mean) in the analysis of the experiment, then the units are independent for this measurement. There may be a hierarchy of units.

Importance of Independence Many basic statistical methods require that measurements are independent for the analysis to be valid. Other methods can incorporate the lack of independence. There can be some subjectivity regarding independence. Statistical methods use models. Models can be wrong.

Example: Units and Independence Ten mice receive treatment A, each is bled, and blood samples are each divided into 3 aliquots. The same is done for 10 mice on treatment B. 1.A serum hormone is measured in the 60 aliquots and compared between A and B. The aliquots for a mouse are not independent. The unit is a mouse. A summary statistic from a mouse’s 3 aliquots (e.g., maximum or mean) are independent. N=10 and 10, not 30 and 30.

Example, Continued 2.One of the 30 A aliquots is further divided into 25 parts and 5 different in vitro challenges are each made to a random set of 5 of the parts. The same is done for a single B aliquot. For this challenge experiment, each part is a unit, the values of challenge response are independent, and N= For comparing A and B, there are only N=1+1 experimental units, the two mice.

Experimental Units in Case Study

There is a nested hierarchy of several "levels" of data: Schools, children within the schools, and diets received by every child. What would you use for the "N" for this study? Which outcomes do you intuitively think are correlated (in common language)? Results from one child's three diets? Results from children in the same school? Schools?

Experimental Units in Case Study N = Number of children Results from one child's three diets cannot be modeled as independent. Results from children in the same school also could be “correlated” (dependent). They can be modeled as independent, if the effect of school is included in the analysis. Knowing one child’s score and the school mean gives no info on another child’s score.

Units and Analysis in the Case Study N = Number of children Analysis: This method is a complex generalization of methods we discuss in Session 3. For any method, though, you need to inform the software of the correct experimental units. For some experiments, it is obvious and implicit.

Graphs: Summarizing Results

Common Graphical Summaries Graph NameY-axisX-axis HistogramCount or %Category ScatterplotContinuous Continuous Dot PlotContinuous Category Box PlotPercentiles Category Line PlotMean or value Category Kaplan-MeierProbabilityTime Many of the examples are from StatisticalPractice.com

Data Graphical Displays HistogramScatter plot Raw Data Summarized* * Raw data version is a stem-leaf plot. We will see one later.

Data Graphical Displays Dot PlotBox Plot Raw Data Summarized

Data Graphical Displays Line or Profile Plot Summarized - bars can represent various types of ranges

Data Graphical Displays Kaplan-Meier Plot This is not necessarily 35% of subjects Probability of Surviving 5 years is 0.35

Graphs: Aids for Analysis

Graphical Aids for Analysis Most statistical analyses involve modeling. Parametric methods (t-test, ANOVA, Χ 2 ) have stronger requirements than non- parametric methods (rank -based). Every method is based on data satisfying certain requirements. Many of these requirements can be assessed with some useful common graphics.

Look at the Data for Analysis Requirements What do we look for? In Histograms (one variable): Ideal: Symmetric, bell-shaped. Potential Problems: Skewness. Multiple peaks. Many values at, say, 0, and bell-shaped otherwise. Outliers.

Example Histogram: OK for Typical* Analyses Symmetric. One peak. Roughly bell-shaped. No outliers. *Typical: mean, SD, confidence intervals, to be discussed in later slides.

Histograms: Not OK for Typical Analyses Skewed Need to transform intensity to another scale, e.g. Log(intensity) Multi-Peak Need to summarize with percentiles, not mean.

Histograms: Not OK for Typical Analyses Truncated Values Need to use percentiles for most analyses. Outliers Need to use median, not mean, and percentiles. LLOQ Undetectable in 28 samples (<LLOQ)

Look at the Data for Analysis Requirements What do we look for? In Scatter Plots (two variables): Ideal: Football-shaped; ellipse. Potential Problems: Outliers. Funnel-shaped. Gap with no values for one or both variables.

Example Scatter Plot: OK for Typical Analyses

Scatter Plot: Not OK for Typical Analyses Gap and Outlier Consider analyzing subgroups. Funnel-Shaped Should transform y- value to another scale, e.g. logarithm. Ott, Amer J Obstet Gyn 2005;192: Ferber et al, Amer J Obstet Gyn 2004;190:

Summary Measures

Common Summary Measures Mean and SD or SEM Geometric Mean Z-Scores Correlation Survival Probability Risks, Odds, and Hazards

Summary Statistics: One Variable Data Reduction to a few summary measures. Basic: Need Typical Value and Variability of Values Typical Values (“Location”): Mean for symmetric data. Median for skewed data. Geometric mean for some skewed data - details in later slides.

Summary Statistics: Variation in Values Standard Deviation, SD =~ 1.25 *(Average |deviation| of values from their mean). Standard, convention, non-intuitive values. SD of what? E.g., SD of individuals, or of group means. Fundamental, critical measure for most statistical methods.

Examples: Mean and SD Mean = 60.6 min. Note that the entire range of data in A is about 6SDs wide, and is the source of the “Six Sigma” process used in quality control and business. SD = 9.6 min.Mean = 15.1SD = 2.8 AB

Examples: Mean and SD SkewedMulti-Peak Mean = 1.0 min. SD = 1.1 min. Mean = 70.3 SD = 22.3

Summary Statistics: Rule of Thumb For bell-shaped distributions of data (“normally” distributed): ~ 68% of values are within mean ±1 SD ~ 95% of values are within mean ±2 SD “(Normal) Reference Range” ~ 99.7% of values are within mean ±3 SD

Summary Statistics: Geometric means Commonly used for skewed data. 1.Take logs of individual values. 2.Find, say, mean ±2 SD → mean and (low, up) of the logged values. 3.Find antilogs of mean, low, up. Call them GM, low2, up2 (back on original scale). 4.GM is the “geometric mean”. The interval (low2,up2) is skewed about GM (corresponds to graph). [See next slide]

Geometric Means These are flipped histograms rotated 90º, with box plots. Any log base can be used. ≈ ≈ 11.6 GM = exp(4.633) = low2 = exp( *1.09) = 11.6 upp2 = exp( *1.09) = ≈ 102.8

Confidence Intervals Reference ranges - or Prediction Intervals - are for individuals. Contains values for 95% of individuals. _____________________________________ Confidence intervals (CI) are for a summary measure (parameter) for an entire population. Contains the (still unknown) summary measure for “everyone” with 95% certainty.

Z- Score = (Measure - Mean)/SD Mean = 60.6 min. SD = 9.6 min. Z-Score = (Time-60.6)/ Mean = 0 SD = 1 Standardize a measure to have mean=0 and SD=1. Z-scores make different measures comparable.

Outcome Measure in Case Study GHA = Global Hyperactivity Aggregate For each child at each time: Z1 = Z-Score for ADHD from Teachers Z2 = Z-Score for WWP from Parents Z3 = Z-Score for ADHD in Classroom Z4 = Z-Score for Conner on Computer All have higher values ↔ more hyperactive. Z’s make each measure scaled similarly. GHA= Mean of Z1, Z2, Z3, Z4

Confidence Interval for Population Mean 95% Reference range - or Prediction Interval - or “Normal Range”, if subjects normal, is sample mean ± 2(SD) _____________________________________ 95% Confidence interval (CI) for the (true, but unknown) mean for the entire population is sample mean ± 2(SD/√N) SD/√N is called “Std Error of the Mean” (SEM)

Confidence Interval: More Details Confidence interval (CI) for the (true, but unknown) mean for the entire population is 95%, N=100:sample mean ± 1.98(SD/√N) 95%, N= 30:sample mean ± 2.05(SD/√N) 90%, N=100:sample mean ± 1.66(SD/√N) 99%, N=100:sample mean ± 2.63(SD/√N) If N is small (N<30?), need normally, bell-shaped, data distribution. Otherwise, skewness is OK. This is not true for the PI, where percentiles are needed.

Confidence Interval: Case Study Confidence Interval: ± 1.99(1.04/√73) = ± 0.24 → to 0.10 Table 2 Prediction Interval: ± 1.99(1.04) = ± 2.07 → to Adjusted CI close to

CI for the Antibody Example So, there is 95% assurance that an individual is between 11.6 and 909.6, the PI. So, there is 95% certainty that the population mean is between 92.1 and 114.8, the CI. GM = exp(4.633) = low2 = exp( *1.09) = 11.6 upp2 = exp( *1.09) = GM = exp(4.633) = low2 = exp( *1.09 /√394) = 92.1 upp2 = exp( *1.09 /√394) = 114.8

Summary Statistics: Two Variables (Correlation) Always look at scatterplot. Correlation, r, ranges from -1 (perfect inverse relation) to +1 (perfect direct). Zero=no relation. Specific to the ranges of the two variables. Typically, cannot extrapolate to populations with other ranges. Measures association, not causation. We will examine details in Session 5.

Correlation Depends on Range of Data Graph B contains only the points from graph A that are in the ellipse. Correlation is reduced in graph B. Thus: correlation between two quantities may be quite different in different study populations. BA

Correlation and Measurement Precision A lack of correlation for the subpopulation with 5<x<6 may be due to inability to measure x and y well. Lack of evidence of association is not evidence of lack of association. B A r=0 for s B overall

Actually uses finer subdivisions than 0-2, 2-4, 4-5 years, with exact death times. Example: 100 subjects start a study. Nine subjects drop out at 2 years and 7 drop out at 4 yrs and 20, 20, and 17 died in the intervals 0-2, 2-4, 4-5 yrs. Then, the 0-2 yr interval has 80/100 surviving. The 2-4 interval has 51/71 surviving; 4-5 has 27/44 surviving. So, 5-yr survival prob is (80/100)(51/71)(27/44) = Summary Statistics: Survival Probability Don’t know vital status of 16 subjects at 5 years.

Summary Statistics: Relative Likelihood of an Event Compare groups A and B on mortality. Relative Risk = Prob A [Death] / Prob B [Death] where Prob[Death] ≈ Deaths per 100 Persons Odds Ratio = Odds A [Death] / Odds B [Death] where Odds= Prob[Death] / Prob[Survival] Hazard Ratio ≈ I A [Death] / I B [Death] where I = Incidence = Deaths per 100 PersonDays