123 654 Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:

Slides:



Advertisements
Similar presentations
What is an association study? Define linkage disequilibrium
Advertisements

Genetic research designs in the real world Vishwajit L Nimgaonkar MD, PhD University of Pittsburgh
METHODS FOR HAPLOTYPE RECONSTRUCTION
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Basics of Linkage Analysis
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
MALD Mapping by Admixture Linkage Disequilibrium.
Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical.
1 Cladistic Clustering of Haplotypes in Association Analysis Jung-Ying Tzeng Aug 27, 2004 Department of Statistics & Bioinformatics Research Center North.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Mutual Information Mathematical Biology Seminar
Applying haplotype models to association study design Natalie Castellana June 7, 2005.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
Genomewide Association Studies.  1. History –Linkage vs. Association –Power/Sample Size  2. Human Genetic Variation: SNPs  3. Direct vs. Indirect Association.
Genotype Susceptibility And Integrated Risk Factors for Complex Diseases Weidong Mao Dumitru Brinza Nisar Hundewale Stefan Gremalshi Alexander Zelikovsky.
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Disease Models and Association Statistics Nicolas Widman CS 224- Computational Genetics Nicolas Widman CS 224- Computational Genetics.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
CS177 Lecture 10 SNPs and Human Genetic Variation
Informative SNP Selection Based on Multiple Linear Regression
From Genome-Wide Association Studies to Medicine Florian Schmitzberger - CS 374 – 4/28/2009 Stanford University Biomedical Informatics
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
BGRS 2006 SEARCH FOR MULTI-SNP DISEASE ASSOCIATION D. Brinza, A. Perelygin, M. Brinton and A. Zelikovsky Georgia State University, Atlanta, GA, USA 123.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Combinatorial Search Methods for Genotypes Associated with Lung Cancer Dumitru Brinza Department of Computer Science Georgia State University.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Association analysis Genetics for Computer Scientists Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
The International Consortium. The International HapMap Project.
Design and Validation of Methods Searching for Risk Factors in Genotype Case- Control Studies Dumitru Brinza Alexander Zelikovsky Department of Computer.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.
Risk Factor Searching Heuristics for SNP Case-Control Studies Dumitru Brinza February 21, 2008 Department of Computer Science & Engineering University.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility.
Multiple-Locus Genome-Wide Association Testing David Dean CSE280A.
Admixture Mapping Controlled Crosses Are Often Used to Determine the Genetic Basis of Differences Between Populations. When controlled crosses are not.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Date of download: 7/2/2016 Copyright © 2016 American Medical Association. All rights reserved. From: How to Interpret a Genome-wide Association Study JAMA.
SNPs and complex traits: where is the hidden heritability?
Of Sea Urchins, Birds and Men
Constrained Hidden Markov Models for Population-based Haplotyping
Genome Wide Association Studies using SNP
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Estimating Recombination Rates
Discovery tools for human genetic variations
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants  Andrew.
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association: checks all one-SNP, two-SNP,..., m-SNP case-closed MSCs Case-closure of a MSC C is an MSC C’, with maximum number of SNPs, which consists of the same set of cases and minimum number of controls. Case-closure allow finding of the statistically significant MSC on the earlier stage of searching. Trivial MSCs and MSCs which coincide after case-closure are avoided. That significantly speedups the searching. Faster than exhaustive search Finds more significant association on the early stage of searching Still slow for wide-genome studies healthy SNP - single nucleotide polymorphism where two or more different nucleotides occur in a large percentage of population 0 = willde type/major (frequency) allele 1 = mutation/minor (frequency) allele 2 = heterozygous allele Searching for genetic risk factors for diseases Monogenic diseases A mutated gene is entirely responsible for the disease Complex diseases Affected by the interaction of multiple genes Significance of risk factor is usually measured by Risk Rate or _ _ _ Odds Ratio We measure significance by the p-value of the set of genotypes _ defined by risk factor SNP and Disease Multi-SNP combination (MSC) define a set of case and control individuals MSC is considered statistically significant if the frequency of cases and controls distribution has p-value < 0.05 A lot of reported findings are frequently not reproducible on different populations. It is believed that this happens because the p-values are unadjusted to multiple testing Statistical significance Disease association analysis Analysis of variation in suspected genes in case and controls individuals is aimed at identifying SNPs with considerably higher frequencies among the case individuals than among the control individuals Most searches are done on a SNP-by-SNP basis Recently two-SNP analysis shows promising results (Marchini et al, 2005) Multi-SNP analyses are expected to find even stronger disease associations Common diseases can be caused by combinations of several unlinked gene (SNPs) variations We address the computational challenge of searching for such multi-gene causal combinations The number of multi-SNP combinations is infeasible high (3 100 for 100 SNPs). How to find associated multi-SNP combinations without total checking? Disease association analysis searches for a SNPs or multi-SNP combinations with frequency among cases considerably higher than among controls. Our contributions Disease-Associated Multi-SNP Combinations Search Given: a population of n genotypes (or haplotypes) each containing values of m SNPs from {0,1,2} and disease status (case or control) Find: all multi-SNP combinations with multiple testing adjusted p-value of the frequency distribution below 0.05 Results for Disease Susceptibility Prediction Results/comparison of searching methods Data Sets Maximum Case(Control)-Free Cluster Problem sick sick sick sick sick healthy x x 1 x x 2 x x x MSC 4 sick : 1 healthy check significance If the reported SNP is found among 100 SNPs then the probability that the SNP is associated with a disease by mere chance becomes 100 times larger (Bonferroni). Bonferroni is too crude (e.g., 3-SNP combinations among 100 SNPs, p < 0.05×10 -6 ) We adjust resulted p-values via randomization Unadjusted p-value: Probability of case/control distribution in a set defined by MSC, computed by binomial distribution Multiple-testing adjusted p-value : randomization Randomly permute the disease status of the population to generate instances. Apply searching methods on each instance to get MSCs. Compute the probability of MSCs that have a higher unadjusted p-value than the observed p-value. In our search we report only MSC with adjusted p-value < 0.05 Clustering-based Model-Fitting Algorithm for Disease Susceptibility Prediction: For the given training dataset and tested genotype consider two cases: tested genotype is added to the training dataset as a sick tested genotype is added to the training dataset as a healthy For the both cases obtain clustering by applying CGS to find: the most disease-associated MSC (defines a set of sick genotypes) the most disease-resistant MSC (defines a set of healthy genotypes) Remove from the original dataset one which is larger Repeat this procedure until all genotypes are removed Predict susceptibility of the tested genotype according to the case which has lower entropy of clustering. Disease Susceptibility Prediction Problem Given a sample population S (a training set) and one more individual t  S with the known SNPs but unknown disease status (testing individual), find (predict) the unknown disease status Disease Clustering Problem: Given a population sample S, find a partition P of S into clusters S = S 1 ..  S k, with disease status 0 or 1 assigned to each cluster S i, minimizing entropy(P) for a given bound on the number of individuals who are assigned incorrect status in clusters of the partition P, error(P)<  *|P|. Find a maximum size cluster C containing only cases or controls Complimentary Greedy Search (CGS): 1. Find SNP with allele value removing a set of genotypes with highest ratio of controls over cases. 2. Add the SNP to resulted MSC 3. Repeat 1-2 until all controls are removed. Resultant MSC defines a subset of sick genotypes. 4. Adjust to multiple testing the p-value of the resultant MSC. Comparison of three methods for searching the disease-associated and disease-resistant multi-SNPs combinations with the largest PPV. Leave-one-out cross validation results A novel combinatorial method for finding disease- associated multi-SNP combinations was developed. Multi-SNP combinations significantly associating with diseases were found. For Crohn's disease data (Daly, et al., 2001), a few associated multi-SNP combinations with multiple-testing-adjusted to p < 0.05 were found, while no single SNP or pair of SNPs showed significant association. For a dataset for an autoimmune disorder (Ueda, et al., 2003), a few previously unknown associated multi-SNP combinations were found. For tick-borne encephalitis virus-induced disease, a multi-SNP combination within a group of genes showing a high degree of linkage disequilibrium significantly associated with the severity of the disease was found. A model-fitting disease susceptibility prediction methods based on the developed search methods were proposed. [3] Crohn's disease : 387 genotypes with 103 SNPs derived from the 616 KB region of human Chromosome 5q31, 144 disease genotypes and 243 nondisease genotypes. (Daly et al., 2001). [10] Autoimmune disorder : 1024 genotypes with 108 SNPs containing gene CD28, CTLA4 and ICONS, 378 disease genotypes and 646 nondisease genotypes. (Ueda et al., 2003). [4] Tick-borne encephalitis : 75 genotypes with 41 SNPs containing gene TLR3, PKR, OAS1, OAS2, and OAS3, 21 disease genotypes and 54 nondisease genotypes. (Barkash et al., 2006). Quality measure Combinatorial search is able to find statistically significant multi-gene interactions, for data where no significant association was detected before Complimentary greedy search can be used in susceptibility prediction Optimization approach to prediction New susceptibility prediction is by 8% higher than the best previously known MLR-tagging efficiently reduces the datasets allowing to find associated multi-SNP combinations and predict susceptibility Comparison of 5 prediction methods on [4] data on all SNPs. Area under the CSP’s ROC curve is 0.87 vs 0.52 under the SVM’s curve