1. Introduction 2.Course Information and Schedule 3.Study Design 4.Looking at Data Today’s Topics Introduction to the Practice of Statistics Ch. 1, 2.5, 3.2 MBP1010 – Jan. 5, 2010
(1) How can we describe and draw meaning from a collection of data? (2) How can we infer information about the whole population when we know data from only some of the population (a sample)? Meaning from Data
- science of understanding data and making decisions in the face of variability and uncertainty - statistics is NOT a field of mathematics
Statistical Thinking -humans are good at recognizing patterns and there is real danger of over-interpreting patterns that are merely due to the play of chance (false leads) - role of statistics - to reject chance as an explanation so that we can have reasonable assurance that patterns seen are worthy of interpretation
Statistical Thinking - explore data prior to analysis - think about context and design - reasoning behind standard statistical methods Interpretation/Conclusions
1. Looking at data 2. Concepts of statistical inference and hypothesis testing 3. Specific statistical tests - 1 and 2 sample test for continuous and categorical data - correlation, regression and ANOVA 4. Other Topics - eg survival analysis, logistic regression 5. Bioinformatics Course Overview
Course Information and Schedule Tutorials: Thursdays 2 to 3:30 pm OCI R Tutorials: Thurs Jan 7 and 14 (Part 1 and part 2) Lectures: Tuesdays 1 to 3 pm 620 University, 7-709
Can what we eat influence our risk of cancer? The case of dietary fat and breast cancer Study Design Posted on website: New York Times article Searching for clarity: A primer on medical studies
What should we do next?
An observational study observes individuals and measures variables of interest but does not attempt to influence the responses. Observational Studies
Case/control and cohort studies common in cancer research (epidemiology) - outcome is binary: cancer/ no cancer Observational studies often examine factors associated with continuous outcome variables - eg association of body weight or diet with hormone levels - calcium intake and blood pressure
X X X X X X X 0 X X X Exposure eg diet Case Control Study Exposure eg diet
X0 0 0 X 0 0 X 0 0 X Cohort Study Exposure eg diet Cancer (yes/no)
Relative Risk Compare risk of disease in those with highest versus lowest intake RR = 1.0 no association RR = times the risk 40% higher risk RR = % lower risk
a. Total Fat Odds Ratio or Relative Risk Case Control: Challier (1998) DeStefani (1998) Ewertz (1990) Franceschi (1996) Graham (1982) Graham (1991) Hirohata (1985) Hirohata (1987) (Caucasian) Hirohata (1987) (Japanese) Ingram (1991) Katsouyanni (1988) Katsouyanni (1994) Landa (1994) Lee (1991) Levi (1993) Mannisto (1999) Martin-Moreno (1994) Miller (1978) Núñez (1996) Potischman (1998) Pryor (1989) Richardson (1991) Rohan (1988) Shun-Zhang (1990) Toniolo (1989) Trichopoulou (1995) van't Veer (1990,1991) Wakai (2000) Witte (1997) Yuan (1995) Zaridze (1991) Case Control Summary Cohort : Gaard (1995) Graham (1992) Holmes (1999) Howe (1991) Jones (1987) Knekt (1990) Kushi (1992) Thiébaut (2001 ) Toniolo (1994) van den Brandt (1993) Velie (2000) Wolk (1998) Cohort Summary All Studies Summary Bingham (2003) Cho (2003)
Interpretation Suppose we find that women who eat a low fat diet tend to have lower risk of breast cancer. Can we conclude that the fat in the diet is responsible for the lower risk of breast cancer?
Interpretation Suppose we find that women who eat a low fat diet tend to have lower risk of breast cancer. Can we conclude that the fat in the diet is responsible for the lower risk of breast cancer? No. Other factors may be responsible for the association with dietary fat (confounding)
Problem of Confounding Suppose A is associated with B: This may be because: A causes B B causes A X is associated with both A and B X need not be a cause of either A or B
Problem of Confounding -women who eat more dietary fat may differ from those who less fat (eg. weight, exercise, other dietary factors) -these factors may influence the risk of breast cancer In our dietary fat example:
Trying to control for confounding - measure potential confounders eg. measure weight and physical activity -“control” for possible confounders in analysis - but…what about confounding with variables we don’t know exist or can’t measure?
An observational study observes individuals and measures variables of interest but does not attempt to influence the responses. Association between variables a response variable, even if it is very strong, is not good evidence of a cause and effect link between variables Observational Studies Correlation is not causation
Randomized Experiments - impose treatment and observe response - subjects/animals randomly assigned to treatments and control - randomization should result in groups that are similar with respect to any possible confounding variables - difference in outcome must be due to treatment (OR the play of chance in random assignment)
Basic principles of experimental design 1.Formulate question/goal in advance 2. Comparison/control 3. Replication 4. Randomization 5. Stratification (or blocking) 6. Factorial experiments
Replication
Jackson et al. Nutr.Cancer, 1998 Dietary fat and mammary tumors in Sprague-Dawley rats (n=30 per diet group) Randomized Design
Stratification Suppose that some measurements will be made in males and females AND You anticipate a difference in responses between males and females – Randomize within males and females separately - any systematic difference by sex removed - this is sometimes called “blocking”. -Take account of the difference between males and females in analysis: - helps control variability
Randomization and stratification If you can (and want to), fix a variable. – e.g., study only men or women or a single strain of animal If you don’t fix a variable, stratify on it. – e.g., randomize treatment men and women If you can neither fix nor stratify a variable, randomize to treatment.
Dietary fat and fiber and mammary tumors in Sprague-Dawley rats (n=30) Factorial Experiment
Diet and Breast Cancer Prevention Study 4793 high risk women followed for 7-17 years (not yet published) Women’s Health Initiative (US) 48,835 postmenopausal women followed for 8-12 years reported in 2006 Randomized Clinical Trials in Humans - Dietary Fat and Breast Cancer
Eligible Subjects Identified (> 50% density) Prerandomization Assessment Intervention Control (n=2,343) (n=2,350) Annual Visits demo/anthro data diet records non fasting serum Follow up until Dec 2005 (7-17 years per subject) breast cancer incidence
Women’s Health Initiative - Postmenopausal women (50-79 years of age) - n=48,835; follow-up 8-12 years - randomized 40:60 intervention and control - group dietary counselling - follow up for breast cancer
Copyright restrictions may apply. Prentice, R. L. et al. JAMA 2006;295: Kaplan-Meier Estimates of the Cumulative Hazard for Invasive Breast Cancer
Practical Issues: - long (particularly for cancer outcomes!) - expensive - limited in “treatment” options Randomized Clinical Trials in Humans
- highly selected subjects - selection criteria and motivation - subject/investigator blinding - subjects drop out -compliance? Randomized Clinical Trials in Humans Other issues:
Main Points - primary interest is causal relationships between variables - observational studies show associations only - randomized studies best for causation but are not without challenges - totality of evidence important
What’s in the dataset? What are the observations (individuals)? Eg people, animals, cells, countries How many observations are in the dataset? How many observations should there be? Are the observations independent? - repeated in an individudal?
What are the variables? What is their exact definition? How were they measured? What are the units of measurement? What type of variables? What’s in the dataset?
Main Types of Variables Categorical: - include nominal and dichotomous variables - qualitative difference between values - eg sex (male/female), smoker/non smoker Continuous: - quantitative - equal distance between each value - eg blood pressure, age, dietary fat Ordinal variables can be ordered but they do not have specific numeric values, eg scales, ratings
Continuous Variables
Stem and Leaf Plots - displays distribution of small/moderate amounts of data - includes the actual numerical values Example data: Blood pressure data in 21 patients : 8 10 : : : : 06 Stem (all but last digit) Leaf (last digit)
9 : 10 : 11 : 12 : 13 : Stem and Leaf Plot Blood Pressure Data: Stem Add leaves Order leaves 9 : 8 10 : : : : : 8 10 : : : : 0 6
1. Divide data into classes of equal width. 2. Count the number in each class. 3. Plot bars with heights proportional to number or percent of data points in each interval. Frequency Histograms - like a stem plot but leaves (individual data points) are not distinguished - usuually plotted horizontally How to make a histogram?
Similarity of Histogram and Stem Leaf Plot Blood Pressure Data: n= 21 measurements 9 : 8 10 : : : : 0 6
Effect of Using Different Intervals Blood Pressure Data: n= 21 measurements
Describing Distributions with Numbers
Blood Pressure Data: n= 21 measurements mean = 2395/21 = 114 median = observation 11 =
Mean versus Median - skewed data 0: : : 039 3: 1 4: 4 5: 6: 2 Stem Plot Mean = 16.7 Median = 11 Remove highest observation (62): mean = 14.1 median = 10
BP data; n = 10 Min Q1 Median Q3 Max
75% quantile 25% quantile Median IQR 1.5xIQR Everything above or below are considered outliers
Measures of Spread - range of data set: largest - smallest value - interquartile range (IQR): 3rd minus 1st quartile - sample variance and standard deviation
Deviation from the Mean
Extreme Observations or Outliers - rule of thumb 1.5 x IQR for potential outliers - observations that stand apart from the overall pattern (not just extreme values) - do not automatically delete outliers - try to explain them - an error in measurement or in recording data - an usual occurrence - describe outliers, what you do with them and what their effect is
1.5 x 3.5(IQR) = th (11.46) = MJ Energy expenditure in 29 women measured by doubly labelled water (MJ per day).
What did we do about the outlier? - checked recording/calculations/data entry - unusual occurrence? - biological plausible? - re-measured laboratory samples - analysis with and without outlier - described all above in paper
Data Relationships
% Dietary Fat Intervention Control Dietary fat intake in the intervention and control groups (n=150 intervention and 187 control)
Dot Plot
How to Display Data Badly H Wainer (1984) How to display data badly. American Statistician 38(2): posted at website -use of Microsoft Excel and Powerpoint has resulted in remarkable advances in the field (of poor data display)
The aim of good data graphics: Display data accurately and clearly. Some rules for displaying data badly: – Display as little information as possible. – Obscure what you do show (with chart junk). – Use pseudo-3d and color gratuitously. – Make a pie chart (preferably in color and 3d). – Use a poorly chosen scale. General principles
Pay attention to scale! Same data, different scale
Displaying data well Be accurate and clear. Let the data speak. – Show as much information as possible, taking care not to obscure the message. Science not sales. – Avoid unnecessary frills — esp. gratuitous 3d. In tables, every digit should be meaningful.
Further reading – Data Display ER Tufte (1983) The visual display of quantitative Information. (1990) Envisioning information. (1997) Visual explanations. WS Cleveland (1993) Visualizing data. Hobart Press. WS Cleveland (1994) The elements of graphing data. CRC Press.