Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical.

Similar presentations


Presentation on theme: "Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical."— Presentation transcript:

1 Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical Care Medicine Brigham and Women’s Hospital Boston, Massachusetts Department of Biostatistics Harvard School of Public Health Boston, Massachusetts

2 Overview: What are genome-wide association studies? What are genome-wide association studies? What are the statistical requirements for a successful What are the statistical requirements for a successful genome-wide association study? genome-wide association study? Sufficient sample sizes Sufficient sample sizes LD coverage LD coverage Genotype quality Genotype quality Design of genome-wide association studies / Design of genome-wide association studies / Handling of the multiple testing problem Handling of the multiple testing problem

3 The human genome 22 chromosomes many possible genes ~30,000-50,000 genes ~8,000,000 SNPs How can we find disease genes?

4 The human genome How can we find disease genes? Genotyping all loci is not possible (not yet! ) => Utilization of 2 concepts: 1.) Linkage disequilibrium (LD): Correlation of alleles at two loci 2.) Genetic association: a particular form of a DNA polymorphism occurs more frequently in subjects with a phenotype of interest

5 Genetic Association DSL: disease susceptibility locus Disease Phenotype Test for genetic association between the phenotype and the DSL Marker LD / correlation Test for association between phenotype and marker locus

6 Genome-wide association study Definition: Association analysis performed with a panel of polymorphic markers adequately spaced to capture most of the linkage disequilibrium information in the entire genome in the study population. Usually: 100,000 SNPs and more Human Genome Disease Phenotype ? => Test for association

7 What are the statistical requirements for a successful genome-wide association study? Sufficient sample sizes Sufficient sample sizes LD coverage LD coverage Genotyping quality Genotyping quality Design of genome-wide association studies / Design of genome-wide association studies / Handling of the multiple testing problem Handling of the multiple testing problem

8 Sample size requirements: DSL: disease susceptibility locus Disease Phenotype Test for genetic association between the phenotype and the DSL Marker LD / correlation Test for association between phenotype and marker locus Sufficient statistical power is needed to detect the association

9 Example for required sample sizes Allele freqOdds ratio 1.251.51.75 0.18,8592,6081,350 0.25,2831,616869 0.34,2811,342727 0.43,8861,301750 Required sample sizes to achieve 80% power in a case/control study for a significance level of 10 -7

10 What are the statistical requirements for a successful genome-wide association study? Sufficient sample sizes Sufficient sample sizes LD coverage LD coverage Genotyping quality Genotyping quality Design of genome-wide association studies / Design of genome-wide association studies / Handling of the multiple testing problem Handling of the multiple testing problem

11 Linkage disequilibrium (LD): DSL: disease susceptibility locus Disease Phenotype Test for genetic association between the phenotype and the DSL Marker LD / correlation Test for association between phenotype and marker locus The set of markers has to contain a marker that is “sufficiently” correlated with the DSL so that the genetic association at the DSL is also visible that the marker locus

12 Measures of genetic correlation between markers NameMeasureFormula Lewontin’s D’D’D AB / D max Hill&Weir (1994) R 2 or Δ 2 D AB 2 /{p A p B (1-p A )(1-p B )} Levin (1953) δ D AB /{p B p ab } Yule’s Q (1900) Q,yD AB /{p AA p BB+ p Ab p aB }

13 The interpretation of r^2 r 2 N is the “effective sample size” If a marker M and causal gene G are in LD, then a study with N cases and controls which measures M (but not G) will have the same power to detect an association as a study with r 2 N cases and controls that directly measured G Goal: The markers that are genotyped should be selected so that they have high r^2-values (preferable at least 80%) with the marker that are not genotyped A good SNPs selection will be key for the success of GWAs

14 SNP Selection for GWA Studies Really a challenge for industry development, not an investigator’s laboratory However, need to select a panel with adequate LD coverage for study population Assessment of Illumina Sentrix HumanHap300 BeadChip (R. Lazarus) –Studied LD coverage of ENCODE regions: Ten 500 kb regions that were completely sequenced in HapMap in 60 CEPH parents –Assessed LD coverage of 6226 common ENCODE regions SNPs (MAF > 0.1) –Found maximum r 2 of each ENCODE SNP with a SNP on HumanHap300 Panel

15

16 Genotyping quality (QC): DSL: disease susceptibility locus Disease Phenotype Test for genetic association between the phenotype and the DSL Marker LD / correlation Test for association between phenotype and marker locus The genotype quality has to be sufficient to so that the genetic association at the DSL is also visible that the marker locus that are in LD with the DSL.

17 For example, the dependence of the power of a GWA on the call rate Scenario: Case/control study: 1,500 cases & controls Odds-ratio: 1.5 Overall significance level: 5% Adjustment for multiple comparisons: Bonferroni 5%/500,000 = 10 -7 => Power as a function of allele frequency and call rates

18 Power levels and avg number of false positives: Avg call rate by genotype: 100%, 100%,100% Allele freqPowerAvg # false positives 0.1027 %0.16 0.2071 %0.28 0.3091 %0.26 0.4093 %0.18

19 Power levels and avg number of false positives: Avg call rate by genotype: 99%, 99%, 99% Allele freqPowerAvg # false positives 0.1025 %902.36 0.2067 %900.07 0.3082 %907.72 0.4089 %908.12

20 Power levels and avg number of false positives: Avg call rate by genotype: 98%, 98%, 98% Allele freqPowerAvg # false positives 0.1024 % 2211.46 0.2064 %2205.91 0.3081 %2204.21 0.4088 %2197.55

21 Power levels and avg number of false positives: Avg call rate by genotype: 99%, 95%, 99% Allele freqPowerAvg # false positives 0.1026 %3835.94 0.2067 %3845.24 0.3084 %3840.75 0.4088 %3836.39

22 For example, the dependence of the power of a GWA on the call rate Conclusion: Call rate has moderate effect on power (for nearly perfect call rates) Call rate has large effect on number of false positives (for nearly perfect call rates) Situation even worse for multi-stage designs!

23 Genotyping quality (QC): DSL: disease susceptibility locus Disease Phenotype Test for genetic association between the phenotype and the DSL Marker LD / correlation Test for association between phenotype and marker locus The genotype quality has to be sufficient so that false positive rate does not dilute the “real” signals

24 Design of genome-wide association studies/Handling of the multiple testing problem Design of genome-wide association studies/Handling of the multiple testing problem:

25 “Using the same data set for screening and testing”: An approach for family-based designs Balance false-negatives with false-positives We don’t want to test all SNPs –“You break it, you buy it” –Genomic screening and testing using the same data set Test the “promising” SNPs Ignore the “less-promising” SNPs

26 PBAT PBAT* screening approach –Family-based studies, quantitative traits –Address multiple-comparisons –Screen and test using the same dataset *Van Steen K, McQueen MB, Herbert A et al. (2005). Genomic screening and replication using the same data set in family-based association testing. Nat Genet 37:683-691.

27 PBAT: Screening Step 1. Screen –Use ‘between-family’ information E(X|S) to estimate the strength of the genetic association –Based on the estimate a b, calculate conditional power for –Select top N SNPs on the basis of power

28 PBAT: Testing Step 2. Test –Use ‘within-family’ information FBAT statistic (independent of ‘between-family’ info) –Adjust for N tests (not 500K!)

29 The 3 steps of the screening technique (Nature Genetics (2005)): Trait SNP 1SNP 2SNP 3SNP 4SNP 5SNP 6 Step 1: Replace X by E(X) and estimate power/effect size 15%89%35% 23%15% Step 2: Select combination with maximal power 85% Step 3: Replace E(X) by X and compute FBAT test statistic for SNP2 and Trait P-value for FBAT statistic: 0.5% This p-value does not need to be adjusted for multiple comparisons!!! E(X1|P)E(X2|P)E(X3|P)E(X4|P)E(X5|P)E(X6|P)

30

31 PBAT Software implementation –family-based studies –quantitative traits & dichotomous traits –Single marker, haplotype, multi-marker –Time-to-onset, multivariate data, time-series data –Professional version distributed by Golden Helix…

32 Golden Helix Software for Illumina Whole Genome Analysis Golden Helix is Harvard’s PBAT commercialization partner –Easy-to-use, user-friendly graphical interface –Professional PBAT training and consulting –Rapid customer support “Accelerating the Quest for Significance” –Powerful methods for both family and unrelated individuals –Run on hundreds of processors with distributed computing –Illumina data import directly supported –“I was able to do in 3 days what it has taken our lab 2 years to try and do with [other] collaborations.” – Golden Helix customer www.goldenhelix.com


Download ppt "Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical."

Similar presentations


Ads by Google