Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

Slides:



Advertisements
Similar presentations
15 The Genetic Basis of Complex Inheritance
Advertisements

Attaching statistical weight to DNA test results 1.Single source samples 2.Relatives 3.Substructure 4.Error rates 5.Mixtures/allelic drop out 6.Database.
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
Population Genetics 1 Chapter 23 in Purves 7 th edition, or more detail in Chapter 15 of Genetics by Hartl & Jones (in library) Evolution is a change in.
Inference in the Simple Regression Model
The genetic dissection of complex traits
Generalized Regional Admixture Mapping (RAM) and Structured Association Testing (SAT) David T. Redden, Associate Professor, Department of Biostatistics,
Chapter 18: The Chi-Square Statistic
1 BI3010H08 Population genetics Halliburton chapter 9 Population subdivision and gene flow If populations are reproductible isolated their genepools tend.
Association Tests for Rare Variants Using Sequence Data
Lab 3 : Exact tests and Measuring Genetic Variation.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Basics of Linkage Analysis
. Parametric and Non-Parametric analysis of complex diseases Lecture #6 Based on: Chapter 25 & 26 in Terwilliger and Ott’s Handbook of Human Genetic Linkage.
Human Genetics Genetic Epidemiology.
Quantitative Genetics Theoretical justification Estimation of heritability –Family studies –Response to selection –Inbred strain comparisons Quantitative.
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
Inference.ppt - © Aki Taanila1 Sampling Probability sample Non probability sample Statistical inference Sampling error.
Inferences About Process Quality
Data Analysis Statistics. Inferential statistics.
Today Concepts underlying inferential statistics
Quantitative Genetics
 Read Chapter 6 of text  We saw in chapter 5 that a cross between two individuals heterozygous for a dominant allele produces a 3:1 ratio of individuals.
Linkage Analysis in Merlin
Forensic Statistics From the ground up…. Basics Interpretation Hardy-Weinberg equations Random Match Probability Likelihood Ratio Substructure.
1 STATISTICAL HYPOTHESES AND THEIR VERIFICATION Kazimieras Pukėnas.
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
Simple Linear Regression
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
Broad-Sense Heritability Index
Understanding Statistics
QBM117 Business Statistics Estimating the population mean , when the population variance  2, is known.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
PARAMETRIC STATISTICAL INFERENCE
Population Stratification
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Non-Mendelian Genetics
Course outline HWE: What happens when Hardy- Weinberg assumptions are met Inheritance: Multiple alleles in a population; Transmission of alleles in a family.
Biostatistics Class 6 Hypothesis Testing: One-Sample Inference 2/29/2000.
Gene Mapping Quantitative Traits using IBD sharing References: Introduction to Quantitative Genetics, by D.S. Falconer and T. F.C. Mackay (1996) Longman.
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
Type 1 Error and Power Calculation for Association Analysis Pak Sham & Shaun Purcell Advanced Workshop Boulder, CO, 2005.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
1 B-b B-B B-b b-b Lecture 2 - Segregation Analysis 1/15/04 Biomath 207B / Biostat 237 / HG 207B.
Inferential Statistics. The Logic of Inferential Statistics Makes inferences about a population from a sample Makes inferences about a population from.
Lecture 24: Quantitative Traits IV Date: 11/14/02  Sources of genetic variation additive dominance epistatic.
An quick overview of human genetic linkage analysis
A Transmission/disequilibrium Test for Ordinal Traits in Nuclear Families and a Unified Approach for Association Studies Heping Zhang, Xueqin Wang and.
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Populations: defining and identifying. Two major paradigms for defining populations Ecological paradigm A group of individuals of the same species that.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Lecture 22: Quantitative Traits II
1 Probability and Statistics Confidence Intervals.
Chapter 13 Understanding research results: statistical inference.
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
This Week Review of estimation and hypothesis testing
Stratification Lon Cardon University of Oxford
Genome Wide Association Studies using SNP
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
CONCEPTS OF ESTIMATION
What are their purposes? What kinds?
A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants  Andrew.
Linkage Analysis Problems
Chapter 5: Sampling Distributions
Presentation transcript:

Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

Case/control association tests are becoming increasingly popular to identify genes contributing to human disease. Case/control association tests are becoming increasingly popular to identify genes contributing to human disease. These tests can be susceptible to false positives if the underlying statistical assumptions are violated, i.e. independence among all sampled alleles used in the test for association. These tests can be susceptible to false positives if the underlying statistical assumptions are violated, i.e. independence among all sampled alleles used in the test for association. It is well appreciated that population structure results in false positives (Knowler et al., 1988; Lander and Schork, 1994). It is well appreciated that population structure results in false positives (Knowler et al., 1988; Lander and Schork, 1994). Methods exist which correct for this effect (Devlin and Roeder, 1999; Pritchard and Rosenberg, 1999; Pritchard et al. 2000). Methods exist which correct for this effect (Devlin and Roeder, 1999; Pritchard and Rosenberg, 1999; Pritchard et al. 2000). Importance Case/control association tests are becoming increasingly popular to identify genes contributing to human disease. Case/control association tests are becoming increasingly popular to identify genes contributing to human disease. These tests can be susceptible to false positives if the underlying statistical assumptions are violated, i.e. independence among all sampled alleles used in the test for association. These tests can be susceptible to false positives if the underlying statistical assumptions are violated, i.e. independence among all sampled alleles used in the test for association. It is well appreciated that population structure results in false positives (Knowler et al., 1988; Lander and Schork, 1994). It is well appreciated that population structure results in false positives (Knowler et al., 1988; Lander and Schork, 1994). Methods exist which correct for this effect (Devlin and Roeder, 1999; Pritchard and Rosenberg, 1999; Pritchard et al. 2000). Methods exist which correct for this effect (Devlin and Roeder, 1999; Pritchard and Rosenberg, 1999; Pritchard et al. 2000).

Cases are not independent draws from the population allele frequencies. Problem: the relatedness is cryptic, so the investigator does not know about the relationships in advance. Your (favorite) Population Obtain a sample of affected cases from the population. Obtain a sample of affected cases from the population. Cases are not independent draws from the population allele frequencies. Problem: the relatedness is cryptic, so the investigator does not know about the relationships in advance.

Devlin and Roeder (1999) have argued that if one is doing a genetic association study, then surely one must believe that the trait of interest has a genetic basis that is at least (partially) shared among affected individuals. Devlin and Roeder (1999) have argued that if one is doing a genetic association study, then surely one must believe that the trait of interest has a genetic basis that is at least (partially) shared among affected individuals. Given that cases share a set of risk factors by descent, then presumably they are more related to one another than to random controls. Given that cases share a set of risk factors by descent, then presumably they are more related to one another than to random controls. These authors presented numerical examples which suggested that this effect may be an important factor, in practice. These authors presented numerical examples which suggested that this effect may be an important factor, in practice. However, these examples were artificially constructed, and not modeled on any population-based process. However, these examples were artificially constructed, and not modeled on any population-based process. Few empirical data to suggest if cryptic relatedness negatively impacts association studies. In a founder population, non- independence resulting from relatedness does matter. (Newman et al., 2001). Few empirical data to suggest if cryptic relatedness negatively impacts association studies. In a founder population, non- independence resulting from relatedness does matter. (Newman et al., 2001). Devlin and Roeder (1999) have argued that if one is doing a genetic association study, then surely one must believe that the trait of interest has a genetic basis that is at least (partially) shared among affected individuals. Devlin and Roeder (1999) have argued that if one is doing a genetic association study, then surely one must believe that the trait of interest has a genetic basis that is at least (partially) shared among affected individuals. Given that cases share a set of risk factors by descent, then presumably they are more related to one another than to random controls. Given that cases share a set of risk factors by descent, then presumably they are more related to one another than to random controls. These authors presented numerical examples which suggested that this effect may be an important factor, in practice. These authors presented numerical examples which suggested that this effect may be an important factor, in practice. However, these examples were artificially constructed, and not modeled on any population-based process. However, these examples were artificially constructed, and not modeled on any population-based process. Few empirical data to suggest if cryptic relatedness negatively impacts association studies. In a founder population, non- independence resulting from relatedness does matter. (Newman et al., 2001). Few empirical data to suggest if cryptic relatedness negatively impacts association studies. In a founder population, non- independence resulting from relatedness does matter. (Newman et al., 2001). Importance

Determine whether, or when, cryptic relatedness is likely to be a problem for general applications. Determine whether, or when, cryptic relatedness is likely to be a problem for general applications. Develop a formal model for cryptic relatedness in a population genetics framework. Develop a formal model for cryptic relatedness in a population genetics framework. In a founder population, estimate the inflation factor due to (cryptic) relatedness, and compare to analytical results. In a founder population, estimate the inflation factor due to (cryptic) relatedness, and compare to analytical results. Avoid staring at “x” in front of a chalkboard. Avoid staring at “x” in front of a chalkboard. Goals Determine whether, or when, cryptic relatedness is likely to be a problem for general applications. Determine whether, or when, cryptic relatedness is likely to be a problem for general applications. Develop a formal model for cryptic relatedness in a population genetics framework. Develop a formal model for cryptic relatedness in a population genetics framework. In a founder population, estimate the inflation factor due to (cryptic) relatedness, and compare to analytical results. In a founder population, estimate the inflation factor due to (cryptic) relatedness, and compare to analytical results. Avoid staring at “x” in front of a chalkboard. Avoid staring at “x” in front of a chalkboard.

~ m affected individuals and m random controls, sampled in the current generation. m affected individuals and m random controls, sampled in the current generation. Pairs of chromosomes coalesce in a previous generation t = 1, 2, … t with the usual probabilities. Pairs of chromosomes coalesce in a previous generation t = 1, 2, … t with the usual probabilities. All samples are typed at a single bi-allelic locus, unlinked to disease, with alleles B and b, at frequencies p and (1-p) in the population. All samples are typed at a single bi-allelic locus, unlinked to disease, with alleles B and b, at frequencies p and (1-p) in the population. Modeling Definitions m affected individuals and m random controls, sampled in the current generation. m affected individuals and m random controls, sampled in the current generation. Pairs of chromosomes coalesce in a previous generation t = 1, 2, … t with the usual probabilities. Pairs of chromosomes coalesce in a previous generation t = 1, 2, … t with the usual probabilities. All samples are typed at a single bi-allelic locus, unlinked to disease, with alleles B and b, at frequencies p and (1-p) in the population. All samples are typed at a single bi-allelic locus, unlinked to disease, with alleles B and b, at frequencies p and (1-p) in the population. ~

Define: Define: K p – population prevalence of disease. K p – population prevalence of disease. K t – probability that an relative of type t (or t ) of an affected proband is also affected. K t – probability that an relative of type t (or t ) of an affected proband is also affected. t – recurrence risk ratio, K t /K p (Risch, 1990). t – recurrence risk ratio, K t /K p (Risch, 1990). G i (a) – indicator (0 or 1) for the B allele on homologous chromosome a for the i-th case. (with a  for diploid individuals) G i (a) – indicator (0 or 1) for the B allele on homologous chromosome a for the i-th case. (with a  for diploid individuals) H j (a) – as above, but for a j-th random control. H j (a) – as above, but for a j-th random control. Define: Define: K p – population prevalence of disease. K t – probability that an relative of type t (or t ) of an affected proband is also affected. t – recurrence risk ratio, K t /K p (Risch, 1990). G i (a) – indicator (0 or 1) for the B allele on homologous chromosome a for the i-th case. (with a  for diploid individuals) H j (a) – as above, but for a j-th random control. Definitions ~

Define a test statistic which measure the difference in allele counts between cases and controls (slightly modified from Devlin and Roeder, 1999): Define a test statistic which measure the difference in allele counts between cases and controls (slightly modified from Devlin and Roeder, 1999): Under the null hypothesis of no association between the marker and phenotype, an allele has a genotype B with probability p, independently for all alleles Under the null hypothesis of no association between the marker and phenotype, an allele has a genotype B with probability p, independently for all alleles in the sample. If so, If cryptic relatedness exists in the sample, then the variance of the test – call this Var * [T ] – may exceed the variance under the null. We measure the deviation from the null variance using the “inflation factor”  : If cryptic relatedness exists in the sample, then the variance of the test – call this Var * [T ] – may exceed the variance under the null. We measure the deviation from the null variance using the “inflation factor”  : Define a test statistic which measure the difference in allele counts between cases and controls (slightly modified from Devlin and Roeder, 1999): Define a test statistic which measure the difference in allele counts between cases and controls (slightly modified from Devlin and Roeder, 1999): Under the null hypothesis of no association between the marker and phenotype, an allele has a genotype B with probability p, independently for all alleles in the sample. If so, Under the null hypothesis of no association between the marker and phenotype, an allele has a genotype B with probability p, independently for all alleles in the sample. If so, If cryptic relatedness exists in the sample, then the variance of the test – call this Var * [T ] – may exceed the variance under the null. We measure the deviation from the null variance using the “inflation factor”  : If cryptic relatedness exists in the sample, then the variance of the test – call this Var * [T ] – may exceed the variance under the null. We measure the deviation from the null variance using the “inflation factor”  :

 Type-I nominal (  ) Fold-Error Rate  Type-I nominal (  ) Fold-Error Rate ~ ~3.55  Type-I nominal (  ) Fold-Error Rate ~ ~6.88

Recall that we want the variance to our test, T, under a model of cryptic relatedness: Use the following non-dodgy assumptions: 1. Draws of alleles from the population are simple Bernoulli trials. (Variance terms) 2. Controls are a random sample from the population. (Covariance terms with H j ’s are 0) 3. Allow the possibility that cases and controls depart from Hardy-Weinberg proportions by some factor, call this F. (Covariance terms for alleles in the same individual) 4. For the mutational model, a. Suppose the mutation process is the same for cases and random controls. b. Conditional on a case and random chromosome having a very recent coalescent time (on the order of 1-10 generations), assume that the chance that the alleles are in different states is  0. Recall that we want the variance to our test, T, under a model of cryptic relatedness: Use the following non-dodgy assumptions: 1. Draws of alleles from the population are simple Bernoulli trials. (Variance terms) 2. Controls are a random sample from the population. (Covariance terms with H j ’s are 0) 3. Allow the possibility that cases and controls depart from Hardy-Weinberg proportions by some factor, call this F. (Covariance terms for alleles in the same individual) 4. For the mutational model, a. Suppose the mutation process is the same for cases and random controls. b. Conditional on a case and random chromosome having a very recent coalescent time (on the order of 1-10 generations), assume that the chance that the alleles are in different states is  0.

Then after … JKP attempts desperately to keep me honest. Me, after many hours of intensive thought processing Smoke from my brain

Var * [T ] can be simplified to: Var * [T ] can be simplified to: where i≠i´. And now, we evaluate the covariance term under a model of cryptic relatedness. This covariance term is fairly complicated, but it is related to the following probability: And now, we evaluate the covariance term under a model of cryptic relatedness. This covariance term is fairly complicated, but it is related to the following probability: which denotes the probability that allele copy a and a´ from individuals i and i´ coalesce in time, conditional on the proposition that individuals i and i´ are both affected (with i≠i´). So what’s this probability?

Depends on the population model (not on phenotype) Depends on the genetic model Apply some Bayesian Trickery: … and after some plug and play we finally get:

Under an additive model Handy relationship between any r ’s and the sibling recurrence risk ratio, a single parameter under an additive model (Risch, 1990): Handy relationship between any r ’s and the sibling recurrence risk ratio, a single parameter under an additive model (Risch, 1990): where  r is the kinship coefficient for type-r relatives, which is ¼ for r = 1, and decays by ½ for each increment to r. Using this relationship we can simplify

Simulations Use Wright-Fisher forward simulation to assess analytical results: Use Wright-Fisher forward simulation to assess analytical results: Simulate 1,000 bi-allelic unlinked loci forward in time 4N generations, with mutation parameter  = 4N  = 1. (†) Simulate 1,000 bi-allelic unlinked loci forward in time 4N generations, with mutation parameter  = 4N  = 1. (†) Choose a single locus with the desired disease allele frequency, and assign phenotypes to all members of the population under an additive genetic model. Choose a single locus with the desired disease allele frequency, and assign phenotypes to all members of the population under an additive genetic model. Select m cases and m random controls, use all non-disease loci to infer the inflation factor based on the mean of all tests. Select m cases and m random controls, use all non-disease loci to infer the inflation factor based on the mean of all tests. (†) because WF simulations are notoriously slow to simulate, we use a speed-up by simulating a smaller population with a proportionally higher mutation rate, and then rescale the population size and mutation rate to the desired levels.

Simulation Results 95% central interval about the mean was at least.001 in each case.

“Tautological” Hutterite Analysis Quick-note on the Hutterites Quick-note on the Hutterites 13,000 member pedigree where the genealogy is known, with ~800 members phenotyped/genotyped at many markers across the genome. 13,000 member pedigree where the genealogy is known, with ~800 members phenotyped/genotyped at many markers across the genome. Target (for each phenotype): Target (for each phenotype): a. Estimate coalescent probabilities for cases and random controls based on the genealogy – “allele-walking” simulations b. Calculate the inflation factor (  ) for each phenotype, and compare to the analytic prediction.

Note increased probabilities in cases over random controls for recent coalescent times

Hutterite Analysis Quick-note on the Hutterites Quick-note on the Hutterites 13,000 member pedigree where the genealogy is known, with ~800 members phenotyped/genotyped at many markers across the genome. 13,000 member pedigree where the genealogy is known, with ~800 members phenotyped/genotyped at many markers across the genome. Target (for each phenotype): Target (for each phenotype): a. Estimate coalescent probabilities for cases and random controls based on the genealogy – “allele-walking” simulations b. Calculate the inflation factor (  ) for each phenotype, and compare to the analytic prediction.

Empirical  ’s in a Founder Population The inbreeding coefficient (F) was estimated at.048 and was included in the calculation.

We modeled cryptic relatedness using population-based processes. Surprisingly, these expressions are functions of directly observable parameters (population size, sample size, and the genetic model parameterized by r ). We modeled cryptic relatedness using population-based processes. Surprisingly, these expressions are functions of directly observable parameters (population size, sample size, and the genetic model parameterized by r ). Our analytical results indicate that increased false positives due to cryptic relatedness will usually be negligible for outbred populations. Our analytical results indicate that increased false positives due to cryptic relatedness will usually be negligible for outbred populations. We applied out technique to a founder population as an example. For six different phenotypes we found evidence for inflation, which matched analytic predictions. We applied out technique to a founder population as an example. For six different phenotypes we found evidence for inflation, which matched analytic predictions. Summary We modeled cryptic relatedness using population-based processes. Surprisingly, these expressions are functions of directly observable parameters (population size, sample size, and the genetic model parameterized by r ). Our analytical results indicate that increased false positives due to cryptic relatedness will usually be negligible for outbred populations. We applied out technique to a founder population as an example. For six different phenotypes we found evidence for inflation, which matched analytic predictions.

Acknowledgements JK Pritchard and NJ Cox (thesis advisors) JK Pritchard and NJ Cox (thesis advisors) Carole Ober (access to the empirical data) Carole Ober (access to the empirical data) $/£ : $/£ : NIH, NIH/NIGMS Genetics Training Grant In the bar at the conference during the week Fine, name that tune: from memory, recite of the first 1677 words of Kingman’s 1982 paper and I’ll get the next round.