The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL FastANOVA: an Efficient Algorithm for Genome-Wide Association Study Xiang Zhang Fei Zou Wei Wang University.

Slides:



Advertisements
Similar presentations
15 The Genetic Basis of Complex Inheritance
Advertisements

Gene-by-Environment and Meta-Analysis Eleazar Eskin University of California, Los Angeles.
CZ5225 Methods in Computational Biology Lecture 9: Pharmacogenetics and individual variation of drug response CZ5225 Methods in Computational Biology.
Association Tests for Rare Variants Using Sequence Data
SNP Applications statwww.epfl.ch/davison/teaching/Microarrays/snp.ppt.
METHODS FOR HAPLOTYPE RECONSTRUCTION
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Basics of Linkage Analysis
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
1 15 The Genetic Basis of Complex Inheritance. 2 Multifactorial Traits Multifactorial traits are determined by multiple genetic and environmental factors.
1 QTL mapping in mice Lecture 10, Statistics 246 February 24, 2004.
© 2010 Pearson Prentice Hall. All rights reserved Single Factor ANOVA.
Modeling Depression in Mice to Identify Genetic Mechanisms of Mood Disorder Cristina Santos 1, Brooke Miller 2, Matthew Pletcher 2, Andrew Su 4, Lisa Tarantino.
Analysis of Phenotypic Variations in the Mouse Genome Caused by Single Nucleotide Polymorphisms Corey Harada Advisor: Eleazar Eskin.
Simulation/theory With modest marker spacing in a human study, LOD of 3 is 9% likely to be a false positive.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis Dinu et al, J. Biomedical.
Human Migrations Saeed Hassanpour Spring Introduction Population Genetics Co-evolution of genes with language and cultural. Human evolution: genetics,
2050 VLSB. Dad phase unknown A1 A2 0.5 (total # meioses) Odds = 1/2[(1-r) n r k ]+ 1/2[(1-r) n r k ]odds ratio What single r value best explains the data?
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Graph Regularized Dual Lasso for Robust eQTL Mapping Wei Cheng 1 Xiang Zhang 2 Zhishan Guo 1 Yu Shi 3 Wei.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Statistical Methods in Computer Science Hypothesis Testing II: Single-Factor Experiments Ido Dagan.
Genetic Algorithm.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Gene, Allele, Genotype, and Phenotype
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
What is Genetic Programming? Genetic programming is a model of programming which uses the ideas (and some of the terminology) of biological evolution to.
Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel VLDB2008 Mining Non-Redundant High Order Correlations in Binary Data.
CS177 Lecture 10 SNPs and Human Genetic Variation
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Experimental Design and Data Structure Supplement to Lecture 8 Fall
Quantitative Genetics. Continuous phenotypic variation within populations- not discrete characters Phenotypic variation due to both genetic and environmental.
Complex Traits Most neurobehavioral traits are complex Multifactorial
Quantitative Genetics
Fast Tag SNP Selection Wang Yue Joint work with Postdoc Guimei Liu and Prof Limsoon Wong.
On Computing Top-t Influential Spatial Sites Authors: T. Xia, D. Zhang, E. Kanoulas, Y.Du Northeastern University, USA Appeared in: VLDB 2005 Presenter:
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Lecture 22: Quantitative Traits II
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
EDUC 200C Section 9 ANOVA November 30, Goals One-way ANOVA Least Significant Difference (LSD) Practice Problem Questions?
Association tests. Basics of association testing Consider the evolutionary history of individuals proximal to the disease carrying mutation.
Analysis of Next Generation Sequence Data BIOST /06/2015.
From: Scheinfeld A (1965) Your heredity and environment. JB Lippincott Company, Philadelphia Phenotypic variation among humans is enormous.
The Haplotype Blocks Problems Wu Ling-Yun
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
A Multi-stage Approach to Detect Gene-gene Interactions Associated with Multiple Correlated Phenotypes Zhou Xiangdong,Keith Chan, Danhong Zhu Department.
Inferring Missing Genotypes in Large SNP Panels
upstream vs. ORF binding and gene expression?
Genome Wide Association Studies using SNP
15 The Genetic Basis of Complex Inheritance
Mapping Quantitative Trait Loci
Discovery tools for human genetic variations
Linking Genetic Variation to Important Phenotypes
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Ch10 Analysis of Variance.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Evan G. Williams, Johan Auwerx  Cell 
Cancer as a Complex Genetic Trait
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL FastANOVA: an Efficient Algorithm for Genome-Wide Association Study Xiang Zhang Fei Zou Wei Wang University of North Carolina at Chapel Hill Speaker: Xiang Zhang

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Genotype-phenotype association study Goal: finding genetic factors causing phenotypic difference Mouse genome Phenotype variation

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Genotype-phenotype association study Single Nucleotide Polymorphism  Mutation of a single nucleotide (A,C,T,G)  The most abundant source of genotypic variation  Server as genetic markers of locations in the genome  High throughput genotyping -- thousands to millions of SNPs …… A A A C G …… A A T C C …… …… A A A C G …… A A T C G …… …… A A T C G …… A A T C C …… …… A A T C G …… A A T C G …… …… A A T C G …… A A T C C …… Chrom1 bp3,568,717 Thousands to millions of SNPs Chrom6 bp120,323,342

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Genotype-phenotype association study Genotype  SNPs can be represented as binary {0,1} (e.g. inbred mouse strains) Quantitative phenotypes  Body weight, blood pressure, tumor size, cancer susceptibility, …… Question  Which SNPs are the most highly associated with the phenotype? …… …… 8 …… …… 7 …… …… 12 …… …… 11 …… …… 9 …… …… 13 …… …… 6 …… …… 4 …… …… 2 …… …… 5 …… …… 0 …… …… 3 SNPs Phenotype value

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL A simple example: single marker association study Partition individuals into groups according to genotype of a SNP Do a statistic (t, ANOVA) test Repeat for each SNP SNPs Phenotype value …… …… 8 …… …… 7 …… …… 12 …… …… 11 …… …… 9 …… …… 13 …… …… 6 …… …… 4 …… …… 2 …… …… 5 …… …… 0 …… …… 3

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Two-locus association mapping Many phenotypes are complex traits  Due to the joint effect of multiple genes  Single marker approach may not suffice Consider SNP-SNP interactions  Four possible genotype combinations for each SNP-pair: 00, 01, 10, 11  Split mice into four groups according to the genotype of each SNP-pair  Do statistic test for each SNP-pair

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Statistical issue Multiple test problem  Do n tests with Type I error, the family-wise error rate is Example  Performing 20 tests with Type I error=0.05, family-wise error rate = 0.64  64% probability to get at least one spurious result Solution  permutation test

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Permutation test K permutations of phenotype values For each permutation, find the maximum test value Given Type I error α, the critical value F α is αK-th largest value among K maximum values SNP-pairs whose test values are greater than F α are significant

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Genome-wide association study What’s GWA?  Simple Idea: search for the associations in the whole genome Hard to implement  Enormous search space: 10,000 SNPs and 1,000 permutations, number of SNP-pairs need to be tested: 5 ×10 10

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Preliminary: ANOVA test and F-statistic ANOVA test  To determine whether the group means are significantly different  Partition Total sum of squares into Between-group sum of squares and Within-group sum of squares F-statistic  SNPs {X 1, X 2, …, X N },  a quantitative phenotype Y  Single SNP test -- F(X i, Y)  SNP-pair test -- F(X i X j, Y) SS T SS B SS W

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Problem Formalization Dataset: M individuals, N SNPs {X 1, X 2, …, X N }, a quantitative phenotype Y, and its K permutations {Y 1, Y 2, …, Y k }. Maximum ANOVA test (F-statistic) value of permutation Y k F Yk = max {F(X i X j, Y k )|1≤i<j≤N} Problem 1: Given Type I error threshold α, find critical value F α, which is α K-th largest value among {F Yk |1≤k≤K} Problem 2: Given the threshold F α, find all significant SNP- pairs such that F(X i X j, Y)≥ F α

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Problem 1: Permutation test to find critical value  For permutation Y k, test all SNP-pairs to find the maximum test value F Yk  Repeat for all permutations  Report αK-th largest value in {F Yk |1≤k≤K} Problem 2: Finding significant SNP-pairs  For phenotype Y, test all SNP-pairs and report the SNP- pairs whose test values are above F α Brute force approach Problem 1 is more demanding due to large number of permutations

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Overview of FastANOVA Goal: Scale large permutation test to genome-wide Question: Do we have to perform ANOVA tests for every SNP-pair and repeat for all permutations? Idea:  Develop an upper bound: to filter out SNP-pairs having no chance to become significant (all nodes on the same level of the search tree, no sub-tree pruning, how?)  Efficiently compute the upper bound: calculate the upper bound for a group of SNP-pairs together (possible?)  Identify redundant computations in the permutation tests (reuse computations, how?)

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL The upper bound For any SNP-pair (X i X j ) F(X i X j, Y) ≥ F α SS B (X i X j, Y) ≥ θ equivalent Bound on SS B Need to be greater than θ for (X i X j ) to be significant Fixed for given F α

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL The upper bound Constantf(na)f(na)f(nb)f(nb) Given X i,X j, and Y Only depend on the genotype of X j XiXi XjXj

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Applying the upper bound X1X1 X2X2 X3X3 X4X4 X5X5 X6X For a given X i, let AP= {(X i X j )|i+1≤j≤N}. Index the SNP-pairs in AP in the 2D space of ( n a, n b ). (2,1) (X 1 X 2 ) (X 1 X 4 ) (X 1 X 3 ) (1,3) (X 1 X 5 ) (3,3) (X 1 X 6 )

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Key properties Maximum possible size: Many SNP-pairs share the same entry All SNP-pairs in the same entry have the same upper bound The indexing structure does not depend on the phenotype permutations Same upper bound value f(na)f(na)f(nb)f(nb)

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL For each X i, index the SNP-pairs {(X i X j )|i+1≤j≤N} in the 2D space of (n a, n b ) For each permutation, find the candidate SNP-pairs by accessing the indexing structure  Candidates are SNP-pairs whose upper bounds are above the threshold.  The dynamic threshold is the maximum test value found so far. Schema of FastANOVA (for permutation test)

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Complexity of FastANOVA Time complexity  FastANOVA: O(N 2 M + KNM 2 +CM)  Brute force: O(KN 2 M) Space complexity  O((N+K)M) N = # SNPs M = # individuals K = # permutations C = # candidates M << N

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Brute force v.s. FastANOVA Two orders of magnitude faster than the brute force alternative #SNPs = 44k, #individuals = 26, phenotype: metabolism (water intake) SNP and phenotype data available at

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Pruning power of the bound

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Runtime of each component One time cost

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Future work Association study involving more than two SNPs  Computationally much more demanding  Three loci VS. two loci: in the order of number of SNPs Association study for heterozygous case  SNPs are encoded as ternary variables {0, 1, 2}

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Thank You ! Questions?