Statistics in Biomedical Research RISE Program 2012 Los Angeles Biomedical Research Institute at Harbor-UCLA Medical Center January 19, 2012 Peter D. Christenson.

Slides:



Advertisements
Similar presentations
1 COMM 301: Empirical Research in Communication Lecture 15 – Hypothesis Testing Kwan M Lee.
Advertisements

Hypothesis Testing An introduction. Big picture Use a random sample to learn something about a larger population.
More About Type I and Type II Errors. O.J. Simpson trial: the situation O.J. is assumed innocent. Evidence collected: size 12 Bruno Magli bloody footprint,
Inference Sampling distributions Hypothesis testing.
Introduction to Hypothesis Testing Chapter 8. Applying what we know: inferential statistics z-scores + probability distribution of sample means HYPOTHESIS.
1 1 Slide STATISTICS FOR BUSINESS AND ECONOMICS Seventh Edition AndersonSweeneyWilliams Slides Prepared by John Loucks © 1999 ITP/South-Western College.
Testing Hypotheses About Proportions Chapter 20. Hypotheses Hypotheses are working models that we adopt temporarily. Our starting hypothesis is called.
Estimation of Sample Size
Introduction to Statistics
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Lecture 2: Thu, Jan 16 Hypothesis Testing – Introduction (Ch 11)
Understanding Statistics in Research
PSY 1950 Confidence and Power December, Requisite Quote “The picturing of data allows us to be sensitive not only to the multiple hypotheses that.
Inferences About Process Quality
Chapter 9 Hypothesis Testing.
Sample Size Determination
Sample size and study design
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.
Overview Definition Hypothesis
Hypothesis Testing.
Inference in practice BPS chapter 16 © 2006 W.H. Freeman and Company.
14. Introduction to inference
Biostatistics in Clinical Research Peter D. Christenson Biostatistician January 12, 2005IMSD U*STAR RISE.
Biostatistics for Coordinators Peter D. Christenson REI and GCRC Biostatistician GCRC Lecture Series: Strategies for Successful Clinical Trials Session.
Inference for a Single Population Proportion (p).
CHAPTER 16: Inference in Practice. Chapter 16 Concepts 2  Conditions for Inference in Practice  Cautions About Confidence Intervals  Cautions About.
Jan 17,  Hypothesis, Null hypothesis Research question Null is the hypothesis of “no relationship”  Normal Distribution Bell curve Standard normal.
1 1 Slide Slides Prepared by JOHN S. LOUCKS St. Edward’s University © 2002 South-Western/Thomson Learning.
Hypothesis Testing: One Sample Cases. Outline: – The logic of hypothesis testing – The Five-Step Model – Hypothesis testing for single sample means (z.
Chapter 8 Introduction to Hypothesis Testing
Biostatistics: An Introduction RISE Program 2010 Los Angeles Biomedical Research Institute at Harbor-UCLA Medical Center January 15, 2010 Peter D. Christenson.
A Broad Overview of Key Statistical Concepts. An Overview of Our Review Populations and samples Parameters and statistics Confidence intervals Hypothesis.
Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.
Biostatistics: Study Design Peter D. Christenson Biostatistician Summer Fellowship Program July 2, 2004.
10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.
Chapter 20 Testing hypotheses about proportions
Lecture 16 Dustin Lueker.  Charlie claims that the average commute of his coworkers is 15 miles. Stu believes it is greater than that so he decides to.
No criminal on the run The concept of test of significance FETP India.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 20 Testing Hypotheses About Proportions.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 4: Study Size and Power.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Biostatistics in Practice Peter D. Christenson Biostatistician Session 4: Study Size and Power.
Chapter 8 Delving Into The Use of Inference 8.1 Estimating with Confidence 8.2 Use and Abuse of Tests.
The z test statistic & two-sided tests Section
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 8 Hypothesis Testing.
Chapter 221 What Is a Test of Significance?. Chapter 222 Thought Question 1 The defendant in a court case is either guilty or innocent. Which of these.
Economics 173 Business Statistics Lecture 4 Fall, 2001 Professor J. Petry
Statistics in Biomedical Research RISE Program 2011 Los Angeles Biomedical Research Institute at Harbor-UCLA Medical Center January 13, 2011 Peter D. Christenson.
Ch 10 – Intro To Inference 10.1: Estimating with Confidence 10.2 Tests of Significance 10.3 Making Sense of Statistical Significance 10.4 Inference as.
Biostatistics in Practice Peter D. Christenson Biostatistician Session 3: Testing Hypotheses.
Type I and Type II Errors. An analogy may help us to understand two types of errors we can make with inference. Consider the judicial system in the US.
Fall 2002Biostat Statistical Inference - Confidence Intervals General (1 -  ) Confidence Intervals: a random interval that will include a fixed.
Rejecting Chance – Testing Hypotheses in Research Thought Questions 1. Want to test a claim about the proportion of a population who have a certain trait.
Biostatistics in Practice Peter D. Christenson Biostatistician Session 4: Study Size for Precision or Power.
Review I A student researcher obtains a random sample of UMD students and finds that 55% report using an illegally obtained stimulant to study in the past.
Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 3: Testing Hypotheses.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
Chapter 12 Tests of Hypotheses Means 12.1 Tests of Hypotheses 12.2 Significance of Tests 12.3 Tests concerning Means 12.4 Tests concerning Means(unknown.
6.2 Large Sample Significance Tests for a Mean “The reason students have trouble understanding hypothesis testing may be that they are trying to think.”
Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.
Slide 20-1 Copyright © 2004 Pearson Education, Inc.
Biostatistics Case Studies 2006 Peter D. Christenson Biostatistician Session 1: Demonstrating Equivalence of Active Treatments:
Copyright © 2010 Pearson Education, Inc. Slide
Chapter 9 Introduction to the t Statistic
Inference for a Single Population Proportion (p)
Biostatistics Case Studies 2007
Chapter 9 Hypothesis Testing.
Objectives 6.1 Estimating with confidence Statistical confidence
Presentation transcript:

Statistics in Biomedical Research RISE Program 2012 Los Angeles Biomedical Research Institute at Harbor-UCLA Medical Center January 19, 2012 Peter D. Christenson

Statistics in Biomedical Research We will not cover all of these slides in class. Skipped slides are side points. Complete slides available at:

This and other biostat talks posted

Scientific Decision Making Setting: Two groups of animals: one gets a new molecule you have created, the other group doesn't. Measure the relevant outcome in all animals. How do we decide if the molecule has an effect?

Balancing Risk: Business You have a vending machine business. The machines have a dollar bill reader. You can set the reader to be loose or strict. Setting too high → rejects valid bills, lose customers. Setting too low → accepts bogus bills, lose $. Need to balance both errors in some way.

Balancing Risk: U.S. Legal System Need to decide guilty or innocent. Jury or judge measures degree of guilt. Setting degree high → frees suspects who are guilty. Setting degree low → jails suspects who are innocent. Need to balance both errors in some way. Civil case: lower degree needed than legal case.

Balancing Risk: Personal Investments have expected returns of 1% to 20%. As expected returns ↑, volatility ↑, chance of loss ↑. You choose your degree of risk. Setting degree high → chances of losing ↑. Setting degree low → chances of missing big $ ↑. Need to balance both errors in some way.

Balancing Risk: Scientific Research Perform experiment on 20 mice to measure an effect. Only 20 mice may not be representative. Need to decide if observed effect is real or random. You choose a minimal degree of effect, call it Min. If effect > Min then claim effect is real. Setting degree high → chance* of missing effect ↑. Setting degree low → chance** of wrong + result ↑. Need to balance both errors in some way. What would you want chances for * and for ** to be?

Scientific Decision Making Setting: Two groups, one gets drug A, one gets placebo (B). Measure outcome. Subjects may respond very differently. How do we decide if the drug has “an effect”? Perhaps: Say yes if the mean outcome of those receiving drug is greater than the mean of the others? Or twice as great? Or the worst responder on drug was better than the best on placebo? Other?

Meaning or Randomness?

This is the goal of science in general. The role of statistics is to give an objective way to make those decisions.

Meaning or Randomness? Scientific inference: Perform experiment. Make a decision. Is it correct? Quantify chances that our decision is correct or not. Other areas of life: Suspect guilty? Nobel laureate's opinion? Make a decision. Is it correct? Usually cannot quantify.

Specialness of Scientific Research Scientific method: Assume the opposite of what we think. Design the experiment so that our opinions cannot influence the outcome. Say exactly how we will make a conclusion, i.e. making a decision from the experiment. Tie our hands behind our back. Do experiment. Make decision. Find the chances (from calculations, not opinion) that we are wrong. Experimental conclusions are not expert opinion.

Decision Making We first discuss using a medical device to make decisions about a patient. These decisions could be right or wrong. We then make an analogy to using an experiment to make decisions about a scientific question. These decisions could be right or wrong.

Decision Making The next nine slides will make an analogy to how conclusions or decisions from experiments are made. The numbers are made-up. Mammograms are really better than this.

Decision Making: Diagnosis Mammogram Spot Darkness 100 Definitely Not Cancer Definitely Cancer How is the decision made for intermediate darkness? A particular woman with cancer may not have a 10. Another woman without cancer may not have 0.

Intermediate darkness ↔ Considerable overlap of CA and non-CA: True Non-CA Patients True CA Patients Mammogram Spot Darkness Area under curve = Probability ↑ # of women Decision Making: Overlap

Decision Making: Diagnosis Mammogram Spot Darkness 100 Suppose a study found the mammogram rating (0-10) for 1000 women who definitely have cancer by biopsy (truth). Proportion of 1000 Women: 1000/1000 0/ / / / / Use What Cutoff?

Decision Making: Sensitivity Cutoff for Spot DarknessMammogram Sensitivity ≥0≥0100% >299% >490% >660% >810% >100% Sensitivity = Chances of correctly detecting disease. Why not just choose a low cutoff and detect almost everyone with disease?

Decision Making Continued Mammogram Spot Darkness 100 Suppose a study found the mammogram rating (0-10) for 1000 women who definitely do NOT have cancer by biopsy (truth). Proportion of 1000 Women: 1000/1000 0/ / / / / Use What Cutoff?

Decision Making: Specificity Cutoff for Spot DarknessMammogram Specificity <0<00% ≤2≤235% ≤4≤470% ≤6≤690% ≤8≤895% ≤10100% Specificity=Chances of correctly NOT detecting disease.

Decision Making: Tradeoff CutoffSensitivitySpecificity 0100%0% 299%35% 490%70% 660%90% 810%95% 100%100% Choice of cutoff depends on whether the diagnosis is a screening or a final one. For example: Cutoff=6 : Call disease in 60% with it and 10% without.

Make Decision: If Spot>6, Decide CA. If Spot≤6, Decide Not CA. True Non-CA Patients True CA Patients Mammogram Spot Darkness \\\ = Specificity = 90%. /// = Sensitivity = 60%. Graphical Representation of Tradeoffs Area under curve = Probability ↑ # of women 90%60% cutoff Decide not CA Decide CA

95% 10% Tradeoffs From a Stricter Cutoff cutoff Mammogram Spot Darkness Decide not CA Decide CA

Decision Making for Diagnosis: Summary As sensitivity increases, specificity decreases and vice-versa. Cannot increase both sensitivity and specificity together. We now develop sensitivity and specificity to test or decide scientific claims. Analogy: True Disease ↔ True claim, real effect. Decide Disease ↔ Decide effect is real. But, can both increase sensitivity and specificity together.

Decision Making End of analogy. Back to our original problem in experiments.

Scientific Decision Making Setting: Two groups, one gets drug A, one gets placebo (B). Measure outcome. Subjects may respond very differently. How do we decide if the drug has “an effect”? Perhaps: Say yes if the mean outcome of those receiving drug is greater than the mean of the others? Or twice as great? Or the worst responder on drug was better than the best on placebo? Other?

Scientific Decision Making Setting: Two groups, one gets drug A, one gets placebo (B). Measure outcome. How do we decide if the drug has an effect? Perhaps: Say yes if the mean of those receiving drug is greater than the mean of the placebo group? Other decision rules? Let’s just try an arbitrary decision rule: Let Δ = Group A Mean minus Group B Mean Decide that A is effective if Δ>2. [Not just Δ>0.]

Make Decision: If Δ>2, then Decide Effective. If Δ≤2, then Decide Not. True No Effect (A=B) True Effect (A≈B+2.2) Eventual Graphical Representation 1. Where do these curves come from? 2. What are the consequences of using cutoff=2? Δ = Group A Mean minus Group B Mean %60% Decide Not Effective Decide Effective

Question 2 First 2. What are the consequences of using cutoff=2? Answer: If the effect is real (A≠B), there is a 60% chance of deciding so. [Actually, if in particular A is 2.2 more than B.] This is the experiment’s sensitivity, more often called power. If effect is not real (A=B), there is a 90% of deciding so. This is the experiment’s specificity. More often, 100-specificity is called the level of significance.

Question 2 Continued What if cutoff=1 was used instead? If the effect is real ( Δ=A-B=2.2 ), there is about a 85% chance of deciding so. Sensitivity ↑ (from 60%). If effect is not real ( Δ=A-B=0 ), there is about a 60% of deciding so. Specificity ↓ (from about 90%). Δ: % 85%

Typical Choice of Cutoff Δ = Group A Mean minus Group B Mean Require specificity to be 95%. This means there is only a 5% chance of wrongly declaring an effect. → Need overwhelming evidence, beyond a reasonable (5%) doubt, to make a claim. ~45% Power 95% Specificity

Strength of the Scientific Method Scientists (and their journals and FDA) require overwhelming evidence, beyond a reasonable (5%) doubt, not just “preponderance of the truth” which would be specificity=50%. So much stronger than expert opinion. ~45% Power Only 5% chance of a false positive claim How can we increase power above this 45%, but maintain the chances of a false positive conclusion at ≤5%? Are we just stuck with knowing that many true conjectures will be thrown away as collateral damage to this rigor?

How to Increase Power How can we increase power above this 45%, but maintain the chances of a false positive conclusion at ≤5%? Are we just stuck with knowing that many true conjectures will be thrown away as collateral damage to this rigor? To answer this, we need to go into how the curves are made: So, we take a detour for the next 9 slides to show this.

Short Answer – Skip Next 8 Slides The curves are for the means of the groups, not individuals. The spread of a curve depends on the natural variability in subject response (SD) and the # of subjects (N). As N ↑, the spread ↓. Mean is less extreme with more subjects. Smaller NLarger N

Back to Question 1 1.Where do the curves in the last figure come from? Answer: You specify three quantities: (1) where their peaks are (the experiment’s detectable difference), and how wide they are (which is determined by (2) natural variation and (3) the # of subjects or animals or tissue samples, N). Those specifications give a unique set of “bell- shaped” curves. How?

A “Law of Large Numbers” Suppose individuals have values ranging from Lo to Hi, but the % with any particular value could be anything, say: You choose a sample of 2 of these individuals, and find their average. What value do you expect the average to have? Lo Hi Prob ↑ N = 1

A “Law of Large Numbers” In both cases, values near the center will be more likely: Now choose a sample of 4 of these individuals, and find their average. What value do you expect the average to have? Lo Hi Prob ↑ N = 2

A “Law of Large Numbers” In both cases, values near the center will be more likely: Now choose a sample of 10 of these individuals, and find their average. What value do you expect the average to have? Lo Hi Prob ↑ N = 4

A “Law of Large Numbers” In both cases, values near the center will be more likely: Now choose a sample of 50 of these individuals, and find their average. What value do you expect the average to have? Lo Hi Prob ↑ N = 10

A “Law of Large Numbers” In both cases, values near the center will be more likely: A remarkable fact is that not only is the mean of the sample is expected to be close to the mean of “everyone” if N is large enough, but we know exact probabilities of how close, and the shape of the curve. Lo Hi Prob ↑ N = 50

Summary: Law of Large Numbers Lo Hi Prob ↑ N = 1 SD ↔ ↔ Lo Value of the mean of N subjects Hi ↔ SD(Mean) = SD/√N SD is about 1/6 of the total range. SD ≈ 1.25 x average deviation from the center. Large N

Law of Large Numbers: Another View rescaled You can make the range of possible values for a mean as small as you like by choosing a large enough sample. Also the shape will always be a bell curve if the sample is large enough.

Scientific Decision Making So, where are we? We can now answer the basic dilemma we raised. Repeat earlier slide:

Strength of the Scientific Method Scientists (and their journals and FDA) require overwhelming evidence, beyond a reasonable (5%) doubt, not just “preponderance of the truth” which would be specificity=50%. Similar to US court of law. So much stronger than expert opinion. ~45% Power Only 5% chance of a false positive claim How can we increase power, but maintain the chances of a false positive conclusion at ≤5%? Are we just stuck with knowing that many true conjectures will be thrown away as collateral damage to this rigor? N = 50

Scientific Decision Making So, the answer is that by choosing N large enough, the mean has to be in a small range. That narrow the curves. That in turn increases the chances that we will find the effect in our study, i.e., its power. The next slide shows this.

Fix the max chances of a false positive claim at 5% 80% Power N = 75 N = 50 95%45% 95% N = 88 74% 80% 74% Power 45% Power Find N that gives the power you want.

In many experiments, five factors are inter-related. Specifying four of these determines the fifth: 1.Study size, N. 2.Power, usually 80% to 90% is used. 3.Acceptable false positive chance, usually 5%. 4.Magnitude of the effect to be detected (Δ). 5.Heterogeneity among subjects or units (SD). The next 2 slides show how these factors are typically examined, and easy software to do the calculations. Putting it All Together

Quote from An LA BioMed Protocol Thus, with a total of the planned 80 subjects, we are 80% sure to detect (p<0.05) group differences if treatments actually differ by at least 5.2 mm Hg in MAP change, or by a mean 0.34 change in number of vasopressors.

Software for Previous Slide Pilot data: SD=8.19 for ΔMAP in 36 subjects. For p-value<0.05, power=80%, N=40/group, the detectable Δ of 5.2 in the previous table is found as:

Study Size : May Not be Based on Power Precision refers to how well a measure is estimated. Margin of error = the ± value (half-width) of the 95% confidence interval (sorry – not discussed here). Smaller margin of error ←→ greater precision. To achieve a specified margin of error, solve the CI formula for N. Polls: N ≈ 1000→ margin of error on % ≈ 1/√N ≈ 3%. Pilot Studies, Phase I, Some Phase II: Power not relevant; may have a goal of obtaining an SD for future studies.

Study Design Considerations Statistical Components of Protocols Target population / source of subjects. Quantification of aims, hypotheses. Case definitions, endpoints quantified. Randomization plan, if one will be used. Masking, if used. Study size: screen, enroll, complete. Use of data from non-completers. Justification of study size (power, precision, other). Methods of analysis. Mid-study analyses.

Resources, Software, and References

Professional Statistics Software Package Output Enter code; syntax. Stored data; access- ible. Comprehensive, but steep learning curve: SAS, SPSS, Stata.

Microsoft Excel for Statistics Primarily for descriptive statistics. Limited output.

Typical Statistics Software Package Select Methods from Menus Output after menu selection Data in spreadsheet $100 - $500

Free Statistics Software: Mystat

Free Study Size Software

This and other biostat talks posted

Recommended Textbook: Making Inference Design issues Biases How to read papers Meta-analyses Dropouts Non-mathematical Many examples

Thank You Nils Simonson, in Furberg & Furberg, Evaluating Clinical Research

Outline Meaning or randomness? Decisions, truth and errors. Sensitivity and specificity. Laws of large numbers. Experiment size and study power. Study design considerations. Resources, software, and references.