BGRS 2006 SEARCH FOR MULTI-SNP DISEASE ASSOCIATION D. Brinza, A. Perelygin, M. Brinton and A. Zelikovsky Georgia State University, Atlanta, GA, USA 123.

Slides:



Advertisements
Similar presentations
Lecture 2 Strachan and Read Chapter 13
Advertisements

What is an association study? Define linkage disequilibrium
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Note that the genetic map is different for men and women Recombination frequency is higher in meiosis in women.
Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Basics of Linkage Analysis
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
MALD Mapping by Admixture Linkage Disequilibrium.
The role of variation in finding functional genetic elements Andy Clark – Cornell Dave Begun – UC Davis.
Dr. Almut Nebel Dept. of Human Genetics University of the Witwatersrand Johannesburg South Africa Significance of SNPs for human disease.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
Computational Challenges in Whole-Genome Association Studies Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.
Genomewide Association Studies.  1. History –Linkage vs. Association –Power/Sample Size  2. Human Genetic Variation: SNPs  3. Direct vs. Indirect Association.
Genotype Susceptibility And Integrated Risk Factors for Complex Diseases Weidong Mao Dumitru Brinza Nisar Hundewale Stefan Gremalshi Alexander Zelikovsky.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
A single-nucleotide polymorphism tagging set for human drug metabolism and transport Kourosh R Ahmadi, Mike E Weale, Zhengyu Y Xue, Nicole Soranzo, David.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
Informative SNP Selection Based on Multiple Linear Regression
From Genome-Wide Association Studies to Medicine Florian Schmitzberger - CS 374 – 4/28/2009 Stanford University Biomedical Informatics
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Fast Tag SNP Selection Wang Yue Joint work with Postdoc Guimei Liu and Prof Limsoon Wong.
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Polymorphism Haixu Tang School of Informatics. Genome variations underlie phenotypic differences cause inherited diseases.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
Combinatorial Search Methods for Genotypes Associated with Lung Cancer Dumitru Brinza Department of Computer Science Georgia State University.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Genome Wide Haplotype analyses of human.
Association analysis Genetics for Computer Scientists Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Genome wide association studies (A Brief Start)
The International Consortium. The International HapMap Project.
Design and Validation of Methods Searching for Risk Factors in Genotype Case- Control Studies Dumitru Brinza Alexander Zelikovsky Department of Computer.
Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.
Risk Factor Searching Heuristics for SNP Case-Control Studies Dumitru Brinza February 21, 2008 Department of Computer Science & Engineering University.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility.
Multiple-Locus Genome-Wide Association Testing David Dean CSE280A.
Linkage. Announcements Problem set 1 is available for download. Due April 14. class videos are available from a link on the schedule web page, and at.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
SNPs and complex traits: where is the hidden heritability?
Genome Wide Association Studies using SNP
High level GWAS analysis
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Medical genomics BI420 Department of Biology, Boston College
Medical genomics BI420 Department of Biology, Boston College
SNPs and CNPs By: David Wendel.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

BGRS 2006 SEARCH FOR MULTI-SNP DISEASE ASSOCIATION D. Brinza, A. Perelygin, M. Brinton and A. Zelikovsky Georgia State University, Atlanta, GA, USA Exhaustive Search (ES): In order to find a multi-SNP combination with the p-value of the frequency distribution below 0.05, it checks all one-SNP, two- SNP,..., m-SNP combinations. Runtime is O(n3 m ) making complete searching unfeasible even for small numbers of SNPs m We restrict searching to 1,2,3,4,5 SNPs Searching level – number of SNPs which participate in MSC Indexed Exhaustive Search (IES): Exhaustive search on the indexed datasets obtained by extracting k indexed SNPs with MLR based tagging method. MLR - multiple linear regression based tagging method (He and Zelikovsky, 2006). The tradeoff between the number of chosen indexing SNPs and quality of reconstruction requires choosing the maximum number of index SNPs that can be handled by ES in a reasonable computational time. Can perform complete searching for the larger datasets For wide-genome study number of tags can’t be reduced to 5-10 tags. Therefore, IES will not be able to perform complete search Combinatorial Search (CS): Similar to ES check all one-SNP, two-SNP,..., m-SNP disease- closed combinations. Disease-closure of a multi-SNP combination C is a multi-SNP combination C’, with maximum number of SNPs, which consists of the same set of disease individuals and minimum number of nondisease individuals healthy Searching for genetic risk factors for diseases Monogenic diseases A mutated gene is entirely responsible for the disease Typically rare in population: < 0.1% Practically all cases are already reported Complex diseases Affected by the interaction of multiple genes Significance of risk factor is usually measured by Risk Rate or _ _ _ Odds Ratio We measure significance by the p-value of the set of genotypes _ defined by risk factor Genetic epidemiology Length of Human Genome  3  10 9 base pairs Difference between any two people  0.1% of genome Total number of single nucleotide polymorphisms (SNP)  3  10 6 SNP - single nucleotide site where two or more different nucleotides occur in a large percentage of population 0 = willde type/major (frequency) allele 1 = mutation/minor (frequency) allele International HapMap project: SNP maps are constructed across the human genome with density of about one SNP per thousand nucleotides. HapMap tries to identify 1 million tag SNP’s providing almost as much mapping informa- tion as entire 10 million SNP’s Unfortunately, not as much known about SNP combinations HapMap initial budget was 100Million dollars Due today around 1.5Million SNPs are typed Most of the data are trio High-throughput genotyping technology Affymetrix GeneChip for gene genotyping ( 500k microarray chip ) Human Genome and SNP Multi-SNP combination (MSC) define a set of disease and nondisese individuals MSC is considered statistically significant if the frequency of disease and nondisese distribution has p-value < 0.05 A lot of reported findings are frequently not reproducible on different populations. It is believed that this happens because the p-values are unadjusted to multiple testing Statistical significance Disease association analysis Analysis of variation in suspected genes in disease and nondisease individuals is aimed at identifying SNPs with considerably higher frequencies among the disease individuals than among the nondisease individuals Most searches are done on a SNP-by-SNP basis Recently two-SNP analysis shows promising results (Marchini et al, 2005) Multi-SNP analyses are expected to find even stronger disease associations Common diseases can be caused by combinations of several unlinked gene (SNPs) variations We address the computational challenge of searching for such multi-gene causal combinations The number of multi-SNP combinations is infeasible high (3 100 for 100 SNPs). How to find associated multi-SNP combinations without total checking? Disease association analysis searches for a SNPs or multi-SNP combinations with frequency among disease individuals considerably higher than among nondisease individuals. Our contributions A novel combinatorial method for finding disease- associated multi-SNP combinations was developed. Multi-SNP combinations significantly associating with diseases were found. For Crohn's disease data (Daly, et al., 2001), a few associated multi-SNP combinations with multiple-testing-adjusted to p < 0.05 were found, while no single SNP or pair of SNPs showed significant association. For a dataset for an autoimmune disorder (Ueda, et al., 2003), a few previously unknown associated multi-SNP combinations were found. For tick-borne encephalitis virus-induced disease, a multi-SNP combination within a group of genes showing a high degree of linkage disequilibrium significantly associated with the severity of the disease was found. Disease-Associated Multi-SNP Combinations Search Given: a population of n genotypes (or haplotypes) each containing values of m SNPs from {0,1,2} and disease status (diseased or nondisease) Find: all multi-SNP combinations with multiple testing adjusted p-value of the frequency distribution below 0.05 Discussion The relative qualities of the searching methods are compared using the number of statistically significant multi-SNP combinations found. The statistical significance was adjusted to multiple testing and the adjusted 0.05 threshold is shown (third column). In the 4th, 5th and 6th columns, we give the frequencies of the best multi-SNP combination among disease and nondisease populations and the unadjusted p-value, respectively. Results/comparison of searching methods Crohn's disease : 387 genotypes with 103 SNPs derived from the 616 KB region of human Chromosome 5q31, 144 disease genotypes and 243 nondisease genotypes. (Daly et al., 2001). Autoimmune disorder : 1024 genotypes with 108 SNPs containing gene CD28, CTLA4 and ICONS, 378 disease genotypes and 646 nondisease genotypes. (Ueda et al., 2003). Tick-borne encephalitis : 75 genotypes with 41 SNPs containing gene TLR3, PKR, OAS1, OAS2, and OAS3, 21 disease genotypes and 54 nondisease genotypes. (Barkash et al., 2006). Data Sets Proposed searching methods Comparing indexed counterparts with ES and CS shows that indexing is quite successful. Indeed, the indexed searches found the same multi-SNP combinations as the non-indexed searches but were much faster and the multiple-testing adjusted 0.05-threshold was higher and easier to meet. Comparing the CS with the ES counterparts is advantageous to the former. Indeed, for the Crohn's disease data (Daly.et al., 2001), the ES on the first and second search levels is unsuccessful while the CS finds several statistically significant multi-SNP combinations. Similarly, for the tick-borne encephalitis virus-induced disease data, the CS and ICS(20) found a significant association on the first level while no association was found by the ES or IES(20). For the autoimmune disorder data (Ueda.et al., 2003), the CS found many more statistically significant multi-SNP combinations then the ES. We conclude that the proposed indexing approach and the combinatorial search method are very promising techniques for searching for statistically significant diseases-associated multi-SNP combinations and disease susceptibility prediction sick sick sick sick sick healthy x x 1 x x 2 x x x MSC 4 sick : 1 healthy check significance If the reported SNP is found among 100 SNPs then the probability that the SNP is associated with a disease by mere chance becomes 100 times larger (Bonferroni). Bonferroni is too crude (e.g., 3-SNP combinations among 100 SNPs, p < 0.05×10 -6 ) We adjust resulted p-values via randomization Unadjusted p-value: Probability of case/control distribution in a set defined by MSC, computed by binomial distribution Multiple-testing adjusted p-value : randomization Randomly permute the disease status of the population to generate 1000 instances. Apply searching methods on each instance to get MSCs. Compute the probability of MSCs that have a higher unadjusted p-value than the observed p-value. In our search we report only MSC with adjusted p-value < 0.05 Disease-closure allow finding of the statistically significant MSC on the earlier stage of searching. Trivial MSCs and MSCs which coincide after disease- closure are avoided. That significantly speedups the searching. Faster than ES Finds more significant association on the early stage of searching Still slow for wide-genome studies Searching level – number of SNPs which define MSC before disease-closure Indexed Combinatorial Search (ICS): Combinatorial search on the indexed datasets obtained by extracting k indexed SNPs with MLR based tagging method. Can perform complete searching for the larger datasets