Type 1 Error and Power Calculation for Association Analysis Pak Sham & Shaun Purcell Advanced Workshop Boulder, CO, 2005.

Slides:



Advertisements
Similar presentations
Planning breeding programs for impact
Advertisements

Association Tests for Rare Variants Using Sequence Data
Genetic research designs in the real world Vishwajit L Nimgaonkar MD, PhD University of Pittsburgh
Meta-analysis for GWAS BST775 Fall DEMO Replication Criteria for a successful GWAS P
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Basics of Linkage Analysis
Linkage Analysis: An Introduction Pak Sham Twin Workshop 2001.
Association Mapping David Evans. Outline Definitions / Terminology What is (genetic) association? How do we test for association? When to use association.
Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical.
Multiple Testing, Permutation, False Discovery Benjamin Neale Pak Sham Shaun Purcell.
Power in QTL linkage: single and multilocus analysis Shaun Purcell 1,2 & Pak Sham 1 1 SGDP, IoP, London, UK 2 Whitehead Institute, MIT, Cambridge, MA,
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
Sample size computations Petter Mostad
Differentially expressed genes
Association analysis Shaun Purcell Boulder Twin Workshop 2004.
Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.
PSY 1950 Confidence and Power December, Requisite Quote “The picturing of data allows us to be sensitive not only to the multiple hypotheses that.
Chi-square test Pearson's chi-square (χ 2 ) test is the best-known of several chi-square tests. It is mostly used to assess the tests of goodness of fit.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Statistical Power Calculations Boulder, 2007 Manuel AR Ferreira Massachusetts General Hospital Harvard Medical School Boston.
Shaun Purcell & Pak Sham Advanced Workshop Boulder, CO, 2003
Chi-Squared Test.
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
Essential Statistics in Biology: Getting the Numbers Right
IAP workshop, Ghent, Sept. 18 th, 2008 Mixed model analysis to discover cis- regulatory haplotypes in A. Thaliana Fanghong Zhang*, Stijn Vansteelandt*,
Introduction to Linkage Analysis Pak Sham Twin Workshop 2003.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Experimental Design and Data Structure Supplement to Lecture 8 Fall
Quantitative Genetics. Continuous phenotypic variation within populations- not discrete characters Phenotypic variation due to both genetic and environmental.
Quantitative Genetics
Power of linkage analysis Egmond, 2006 Manuel AR Ferreira Massachusetts General Hospital Harvard Medical School Boston.
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Regression-Based Linkage Analysis of General Pedigrees Pak Sham, Shaun Purcell, Stacey Cherny, Gonçalo Abecasis.
Tutorial #10 by Ma’ayan Fishelson. Classical Method of Linkage Analysis The classical method was parametric linkage analysis  the Lod-score method. This.
1 B-b B-B B-b b-b Lecture 2 - Segregation Analysis 1/15/04 Biomath 207B / Biostat 237 / HG 207B.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Population structure at QTL d A B C D E Q F G H a b c d e q f g h The population content at a quantitative trait locus (backcross, RIL, DH). Can be deduced.
A Transmission/disequilibrium Test for Ordinal Traits in Nuclear Families and a Unified Approach for Association Studies Heping Zhang, Xueqin Wang and.
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
Epistasis / Multi-locus Modelling Shaun Purcell, Pak Sham SGDP, IoP, London, UK.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Linkage Disequilibrium Mapping of Complex Binary Diseases Two types of complex traits Quantitative traits–continuous variation Dichotomous traits–discontinuous.
Lecture 22: Quantitative Traits II
Lecture 23: Quantitative Traits III Date: 11/12/02  Single locus backcross regression  Single locus backcross likelihood  F2 – regression, likelihood,
Powerful Regression-based Quantitative Trait Linkage Analysis of General Pedigrees Pak Sham, Shaun Purcell, Stacey Cherny, Gonçalo Abecasis.
Linkage. Announcements Problem set 1 is available for download. Due April 14. class videos are available from a link on the schedule web page, and at.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Efficient calculation of empirical p- values for genome wide linkage through weighted mixtures Sarah E Medland, Eric J Schmitt, Bradley T Webb, Po-Hsiu.
Hypothesis Testing and Statistical Significance
Boulder 2008 Benjamin Neale I have the power and multiple testing.
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Lecture 17: Model-Free Linkage Analysis Date: 10/17/02  IBD and IBS  IBD and linkage  Fully Informative Sib Pair Analysis  Sib Pair Analysis with Missing.
Power in QTL linkage analysis
Regression Models for Linkage: Merlin Regress
upstream vs. ORF binding and gene expression?
Genome Wide Association Studies using SNP
Hypothesis Testing: Hypotheses
I Have the Power in QTL linkage: single and multilocus analysis
Regression-based linkage analysis
Linkage in Selected Samples
Power to detect QTL Association
Statistical Power and Meta-analysis
Lecture 9: QTL Mapping II: Outbred Populations
Power Calculation for QTL Association
Presentation transcript:

Type 1 Error and Power Calculation for Association Analysis Pak Sham & Shaun Purcell Advanced Workshop Boulder, CO, 2005

Statistical Tests Standard test theory Type 1: Rejecting the null hypothesis when it is true (  ). Type 2: Not rejecting the null hypothesis when it is false (  ). Fix  (e.g. genome wide  of 0.05 for linkage). Optimise 1-  Gold standard: REPLICATION

Problem: Low Replication Rate Hirschhorn et al. 2002: Reviewed 166 putative single allelic association with 2 or more replication attempts: 6 reliably replicated (≥75% positive replications) 97 with at least 1 replication 63 with no subsequent replications Other such surveys have similar findings (Ioannidis 2003; Ioannidis et al. 2003; Lohmueller et al. 2003)

Reasons for Non-Replication The original finding is false positive Systematic bias (e.g. artefacts, confounding) Chance (type 1 error) The attempted replication is false negative Systematic bias (e.g. artifacts, confounding) Heterogeneity (population, phenotypic) Chance (inadequate power)

Type 1 Error Rate vs False Positive Rate Type 1 error rate = probability of significant result when there is no association False positive rate = probability of no association among significant results

Why so many false positives? Multiple testing Multiple studies Multiple phenotypes Multiple polymorphisms Multiple test statistics Not setting a sufficiently small critical p-value Inadequate Power Small sample size Small effect size  High false positive rate

Both error rates affect false positive rate 1000 Tests H0 H SNS S 990  10(1-  ) 1-  S  S

Multiple testing correction Bonferroni correction: Probability of a type 1 error among k independent tests each with type 1 error rate of   * = 1-(1-  ) k  k  Permutation Procedures Permute case-control status, obtain empirical distribution of maximum test statistic under null hypothesis

False Discovery Rate (FDR) Under H0: P-values should be distributed uniformly between 0 and 1. Under H1: P-values should be distributed near 0. Observed distribution of P-values is a mixture of these two distributions. FDR method finds a cut-off P-value, such that results with smaller P-values will likely (e.g. 95%) to belong to the H1 distribution.

False Discovery Rate (FDR) Ranked P-valueFDRRankFDR*Rank / / / / / / /70.05

Multi-stage strategies All SNPs SNSSample 1 Top ranking SNPs S NS Positive SNPs Sample 2

Meta-Analysis Combine results from multiple published studies to: enhance power obtain more accurate effect size estimates assess evidence for publication bias assess evidence for heterogeneity explore predictors of effect size

QuantitativeThresholdDiscrete Variance components TDT Case-control TDT High Low A n 1 n 2 a n 3 n 4 Aff UnAff A n 1 n 2 a n 3 n 4 Tr UnTr A n 1 n 2 a n 3 n 4 Tr UnTr A n 1 n 2 a n 3 n 4

Discrete trait calculation pFrequency of high-risk allele KPrevalence of disease R AA Genotypic relative risk for AA genotype R Aa Genotypic relative risk for Aa genotype N, ,  Sample size, Type I & II error rate

Risk is P(D|G) g AA = R AA g aa g Aa = R Aa g aa K = p 2 g AA + 2pq g Aa + q 2 g aa g aa = K / ( p 2 R AA + 2pq R Aa + q 2 ) Odds ratios (e.g. for AA genotype) = g AA / (1- g AA ) g aa / (1- g aa )

Need to calculate P(G|D) Expected proportion d of genotypes in cases d AA = g AA p 2 / (g AA p 2 + g Aa 2pq + g aa q 2 ) d Aa = g Aa 2pq / (g AA p 2 + g Aa 2pq + g aa q 2 ) d aa = g aa q 2 / (g AA p 2 + g Aa 2pq + g aa q 2 ) Expected number of A alleles for cases 2N Case ( d AA + d Aa / 2 ) Expected proportion c of genotypes in controls c AA = (1-g AA ) p 2 / ( (1-g AA ) p 2 + (1-g Aa ) 2pq + (1-g aa ) q 2 )

Full contingency table “A” allele“a” allele Case2N Case ( d AA + d Aa / 2 ) 2N Case ( d aa + d Aa / 2 ) Control2N Control ( c AA + c Aa / 2 )2N Control ( c aa + c Aa / 2 )

Incomplete LD Effect of incomplete LD between QTL and marker Aa Mpm 1 + δqm 1 - δ mpm 2 – δqm 2 + δ δ = D’ × D MAX D MAX = min{pm 2, qm 1 } Note that linkage disequilibrium will depend on both D’ and QTL & marker allele frequencies

Incomplete LD Consider genotypic risks at marker: P(D|MM) = [ (pm 1 + δ) 2 P(D|AA) + 2(pm 1 + δ)(qm 1 - δ) P(D|Aa) + (qm 1 - δ) 2 P(D|aa) ] / m 1 2 Calculation proceeds as before, but at the marker AM/AM AM/aM or aM/AM aM/aM AAMM AaMM aaMM Haplo. Geno. MM

Fulker association model sibship genotypic mean deviation from sibship genotypic mean The genotypic score (1,0,-1) for sibling i is decomposed into between and within components:

NCPs of B and W tests Approximation for between test Approximation for within test Sham et al (2000) AJHG 66

GPC Usual URL for GPC Purcell S, Cherny SS, Sham PC. (2003) Genetic Power Calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics, 19(1):149-50

Exercise 1: Candidate gene case-control study Disease prevalence 2% Multiplicative model genotype risk ratio Aa = 2 genotype risk ratio AA = 4 Frequency of high risk disease allele = 0.05 Frequency of associated marker allele = 0.1 Linkage disequilibrium D-Prime = 0.8 Sample size: 500 cases, 500 controls Type 1 error rate: 0.01 Calculate Parker allele frequencies in cases and controls NCP, Power

Exercise 2 For a discrete trait TDT study Assumptions same models as in Exercise 1 Sample size: 500 parent-offspring trios Type 1 error rate: 0.01 Calculate: Ratio of transmission of marker alleles from heterozygous parents NCP, Power

Exercise 3: Candidate gene TDT study of a threshold trait 200 affected offspring trios “Affection” = scoring > 2 SD above mean Candidate allele, frequency 0.05, assumed additive Type 1 error rate: 0.01 Desired power: 0.8 What is the minimum detectable QTL variance?

Exercise 4: An association study of a quantitative trait QTL additive variance 0.05, no dominance QTL allele frequency 0.1 Marker allele frequency 0.2 D-Prime 0.8 Sib correlation: 0.4 Type 1 error rate = Sample: 500 sib-pairs Find NCP and power for between-sibship, within-sibship and overall association tests. What is the impact of adding 100 sibships of size 3 on the NCP and power of the overall association test?

Exercise 5: Using GPC for case-control design Disease prevalence: 0.02 Assume multiplicative model genotype risk ratio Aa = 2 genotype risk ratio AA = 4 Frequency of high risk allele = 0.05 Frequency of marker allele = 0.05, D-prime =1 Find the type 1 error rates that correspond to 80% power 500 cases, 500 controls 1000 cases, 1000 controls 2000 cases, 2000 controls.

Exploring power of association using GPC Linkage versus association difference in required sample sizes for specific QTL size TDT versus case-control difference in efficiency? Quantitative versus binary traits loss of power from artificial dichotomisation?

Linkage versus association QTL linkage: 500 sib pairs, r=0.5 QTL association: 1000 individuals

Case-control versus TDT p = 0.1; RAA = RAa = 2

Quantitative versus discrete K=0.5K=0.2K=0.05 To investigate: use threshold-based association Fixed QTL effect (additive, 5%, p=0.5) 500 individuals For prevalence K Group 1 has N and T Group 2 has N and T

Quantitative versus discrete KT (SD)

Quantitative versus discrete

Incomplete LD what is the impact of D’ values less than 1? does allele frequency affect the power of the test? (using discrete case-control calculator) Family-based VC association: between and within tests what is the impact of sibship size? sibling correlation? (using QTL VC association calculator)

Incomplete LD Case-control for discrete traits DiseaseK = 0.1 QTLR AA = R Aa = 2p = 0.05 Marker1m = 0.05 D’ = { 1, 0.8, 0.6, 0.4, 0.2, 0} Marker2m = 0.25 D’ = { 1, 0.8, 0.6, 0.4, 0.2, 0} Sample250 cases, 250 controls

Incomplete LD Genotypic risk at marker1 (left) and marker2 (right) as a function of D’

Incomplete LD Expected likelihood ratio test as a function of D’

Family-based association Sibship type 1200 individuals, 600 pairs, 400 trios, 300 quads Sibling correlation r = 0.2, 0.5, 0.8 QTL (diallelic, equal allele frequency) 2%, 10% of trait variance

Between-sibship association

Within-sibship association

Total association