Download presentation
Presentation is loading. Please wait.
Published byRodney Lambert Modified over 9 years ago
1
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL FastANOVA: an Efficient Algorithm for Genome-Wide Association Study Xiang Zhang Fei Zou Wei Wang University of North Carolina at Chapel Hill Speaker: Xiang Zhang
2
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Genotype-phenotype association study Goal: finding genetic factors causing phenotypic difference Mouse genome Phenotype variation http://www.bcgsc.ca http://www.jax.org/
3
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Genotype-phenotype association study Single Nucleotide Polymorphism Mutation of a single nucleotide (A,C,T,G) The most abundant source of genotypic variation Server as genetic markers of locations in the genome High throughput genotyping -- thousands to millions of SNPs …… A A A C G …… A A T C C …… …… A A A C G …… A A T C G …… …… A A T C G …… A A T C C …… …… A A T C G …… A A T C G …… …… A A T C G …… A A T C C …… Chrom1 bp3,568,717 Thousands to millions of SNPs Chrom6 bp120,323,342
4
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Genotype-phenotype association study Genotype SNPs can be represented as binary {0,1} (e.g. inbred mouse strains) Quantitative phenotypes Body weight, blood pressure, tumor size, cancer susceptibility, …… Question Which SNPs are the most highly associated with the phenotype? …… 0 0 0 1 0 1 …… 8 …… 0 0 0 0 0 0 …… 7 …… 0 1 1 0 0 1 …… 12 …… 0 1 0 0 1 0 …… 11 …… 0 1 0 1 0 1 …… 9 …… 0 1 0 0 0 0 …… 13 …… 1 0 1 1 1 1 …… 6 …… 1 0 0 0 1 0 …… 4 …… 1 1 1 1 1 1 …… 2 …… 1 0 0 1 0 0 …… 5 …… 1 0 0 1 0 1 …… 0 …… 1 0 1 1 0 0 …… 3 SNPs Phenotype value
5
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL A simple example: single marker association study Partition individuals into groups according to genotype of a SNP Do a statistic (t, ANOVA) test Repeat for each SNP SNPs Phenotype value …… 0 0 0 1 0 1 …… 8 …… 0 0 0 0 0 0 …… 7 …… 0 1 1 0 0 1 …… 12 …… 0 1 0 0 1 0 …… 11 …… 0 1 0 1 0 1 …… 9 …… 0 1 0 0 0 0 …… 13 …… 1 0 1 1 1 1 …… 6 …… 1 0 0 0 1 0 …… 4 …… 1 1 1 1 1 1 …… 2 …… 1 0 0 1 0 0 …… 5 …… 1 0 0 1 0 1 …… 0 …… 1 0 1 1 0 0 …… 3
6
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Two-locus association mapping Many phenotypes are complex traits Due to the joint effect of multiple genes Single marker approach may not suffice Consider SNP-SNP interactions Four possible genotype combinations for each SNP-pair: 00, 01, 10, 11 Split mice into four groups according to the genotype of each SNP-pair Do statistic test for each SNP-pair
7
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Statistical issue Multiple test problem Do n tests with Type I error, the family-wise error rate is Example Performing 20 tests with Type I error=0.05, family-wise error rate = 0.64 64% probability to get at least one spurious result Solution permutation test
8
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Permutation test K permutations of phenotype values For each permutation, find the maximum test value Given Type I error α, the critical value F α is αK-th largest value among K maximum values SNP-pairs whose test values are greater than F α are significant
9
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Genome-wide association study What’s GWA? Simple Idea: search for the associations in the whole genome Hard to implement Enormous search space: 10,000 SNPs and 1,000 permutations, number of SNP-pairs need to be tested: 5 ×10 10
10
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Preliminary: ANOVA test and F-statistic ANOVA test To determine whether the group means are significantly different Partition Total sum of squares into Between-group sum of squares and Within-group sum of squares F-statistic SNPs {X 1, X 2, …, X N }, a quantitative phenotype Y Single SNP test -- F(X i, Y) SNP-pair test -- F(X i X j, Y) SS T SS B SS W
11
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Problem Formalization Dataset: M individuals, N SNPs {X 1, X 2, …, X N }, a quantitative phenotype Y, and its K permutations {Y 1, Y 2, …, Y k }. Maximum ANOVA test (F-statistic) value of permutation Y k F Yk = max {F(X i X j, Y k )|1≤i<j≤N} Problem 1: Given Type I error threshold α, find critical value F α, which is α K-th largest value among {F Yk |1≤k≤K} Problem 2: Given the threshold F α, find all significant SNP- pairs such that F(X i X j, Y)≥ F α
12
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Problem 1: Permutation test to find critical value For permutation Y k, test all SNP-pairs to find the maximum test value F Yk Repeat for all permutations Report αK-th largest value in {F Yk |1≤k≤K} Problem 2: Finding significant SNP-pairs For phenotype Y, test all SNP-pairs and report the SNP- pairs whose test values are above F α Brute force approach Problem 1 is more demanding due to large number of permutations
13
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Overview of FastANOVA Goal: Scale large permutation test to genome-wide Question: Do we have to perform ANOVA tests for every SNP-pair and repeat for all permutations? Idea: Develop an upper bound: to filter out SNP-pairs having no chance to become significant (all nodes on the same level of the search tree, no sub-tree pruning, how?) Efficiently compute the upper bound: calculate the upper bound for a group of SNP-pairs together (possible?) Identify redundant computations in the permutation tests (reuse computations, how?)
14
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL The upper bound For any SNP-pair (X i X j ) F(X i X j, Y) ≥ F α SS B (X i X j, Y) ≥ θ equivalent Bound on SS B Need to be greater than θ for (X i X j ) to be significant Fixed for given F α
15
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL The upper bound Constantf(na)f(na)f(nb)f(nb) Given X i,X j, and Y Only depend on the genotype of X j XiXi XjXj 00 00 01 01 01 01 10 10 11 10 10 10
16
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Applying the upper bound X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 000101 000000 011001 010010 010101 010000 101111 100010 111111 100100 100101 101100 For a given X i, let AP= {(X i X j )|i+1≤j≤N}. Index the SNP-pairs in AP in the 2D space of ( n a, n b ). (2,1) (X 1 X 2 ) (X 1 X 4 ) (X 1 X 3 ) (1,3) (X 1 X 5 ) (3,3) (X 1 X 6 )
17
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Key properties Maximum possible size: Many SNP-pairs share the same entry All SNP-pairs in the same entry have the same upper bound The indexing structure does not depend on the phenotype permutations Same upper bound value f(na)f(na)f(nb)f(nb)
18
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL For each X i, index the SNP-pairs {(X i X j )|i+1≤j≤N} in the 2D space of (n a, n b ) For each permutation, find the candidate SNP-pairs by accessing the indexing structure Candidates are SNP-pairs whose upper bounds are above the threshold. The dynamic threshold is the maximum test value found so far. Schema of FastANOVA (for permutation test)
19
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Complexity of FastANOVA Time complexity FastANOVA: O(N 2 M + KNM 2 +CM) Brute force: O(KN 2 M) Space complexity O((N+K)M) N = # SNPs M = # individuals K = # permutations C = # candidates M << N
20
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Brute force v.s. FastANOVA Two orders of magnitude faster than the brute force alternative #SNPs = 44k, #individuals = 26, phenotype: metabolism (water intake) SNP and phenotype data available at http://www.jax.org
21
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Pruning power of the bound
22
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Runtime of each component One time cost
23
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Future work Association study involving more than two SNPs Computationally much more demanding Three loci VS. two loci: in the order of number of SNPs Association study for heterozygous case SNPs are encoded as ternary variables {0, 1, 2}
24
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Thank You ! Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.