Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer.

Slides:

Advertisements

Similar presentations

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.

Advertisements

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,

Model Assessment, Selection and Averaging

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.

More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.

Computational Challenges in Whole-Genome Association Studies Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Applying haplotype models to association study design Natalie Castellana June 7, 2005.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.

Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.

Ensemble Learning: An Introduction

Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.

Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.

Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.

Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.

BNFO 602 Lecture 2 Usman Roshan. Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem,

Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.

Modeling Gene Interactions in Disease CS 686 Bioinformatics.

Genotype Susceptibility And Integrated Risk Factors for Complex Diseases Weidong Mao Dumitru Brinza Nisar Hundewale Stefan Gremalshi Alexander Zelikovsky.

Multiple testing correction

A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.

Efficient Model Selection for Support Vector Machines

Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.

Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.

Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:

GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.

National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Informative SNP Selection Based on Multiple Linear Regression

From Genome-Wide Association Studies to Medicine Florian Schmitzberger - CS 374 – 4/28/2009 Stanford University Biomedical Informatics

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Contingency tables Brian Healy, PhD. Types of analysis-independent samples OutcomeExplanatoryAnalysis ContinuousDichotomous t-test, Wilcoxon test ContinuousCategorical.

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.

BGRS 2006 SEARCH FOR MULTI-SNP DISEASE ASSOCIATION D. Brinza, A. Perelygin, M. Brinton and A. Zelikovsky Georgia State University, Atlanta, GA, USA 123.

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,

1 Risk Assessment Tests Marina Kondratovich, Ph.D. OIVD/CDRH/FDA March 9, 2011 Molecular and Clinical Genetics Panel for Direct-to-Consumer (DTC) Genetic.

Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Evaluating Results of Learning Blaž Zupan

Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.

Combinatorial Search Methods for Genotypes Associated with Lung Cancer Dumitru Brinza Department of Computer Science Georgia State University.

Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.

De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.

Association mapping for mendelian, and complex disorders January 16Bafna, BfB.

Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.

Design and Validation of Methods Searching for Risk Factors in Genotype Case- Control Studies Dumitru Brinza Alexander Zelikovsky Department of Computer.

Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.

Risk Factor Searching Heuristics for SNP Case-Control Studies Dumitru Brinza February 21, 2008 Department of Computer Science & Engineering University.

Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.

International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.

The Haplotype Blocks Problems Wu Ling-Yun

Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.

Canadian Bioinformatics Workshops

 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.

Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.

Constrained Hidden Markov Models for Population-based Haplotyping

Trees, bagging, boosting, and stacking

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer Science

2 Outline SNPs, Haplotypes and Genotypes Disease Association Search Genome-wide association search challenges Problem formulation Exhaustive & Combinatorial Search Optimization formulation & complimentary greedy search Predicting susceptibility to complex diseases Problem formulation/cross-validation Previous methods: SVM, RF, LP Optimum clustering and prediction via model-fitting Conclusions

3 Length of Human Genome  3  10 9 #Single nucleotide polymorphism (SNPs)  1  10 7 SNPs are mostly biallelic, e.g., A  C Minor allele frequency should be considerable e.g. >.1% Difference b/w ALL people  0.25% (b/w any 2  0.1%) Diploid = two different copies of each chromosome Haplotype = description of a single copy (expensive) example: (0 is for major, 1 is for minor allele) Genotype = description of the mixed two copies example (0=00, 1=11, 2=01) International Hapmap project: SNP, Haplotypes, Genotypes

4 Challenges of Disease Association Monogenic disease A mutated gene is entirely responsible for the disease. Typically rare in population: < 0.1%. Complex disease Interaction of multiple genes 2-SNP interaction analysis for a genome-wide scan with 1 million SNPs (3 kb coverage) has pairwise tests Multiple independent causes Each cause explains < 10-20% of cases Common: > 0.1%. In NY city, 12% of the population has Type 2 Diabetes Multiple testing adjustment Reason for non-reproducible findings

5 Disease Association Search Problem Disease Status Non-diseased genotypes: H Sample population S of individual genotypes Risk/resistance factor = multi-SNP combination (MSC) a subset of SNP-columns of S the values of these SNPs, 0, 1, or 2 Cluster C= subset of S with an MSC, d(C) = diseased, h(C) = non-diseased PROBLEM: Find all MSCs significantly associated with the disease Diseased genotypes: D SNPs

6 Significance of Risk/Resistance Factors Measured by Relative risk (RR) Odds ratio (OR) Their p-values Unadjusted p-value: Probability of case/control distribution among exposed to risk factor, computed by binomial distribution Multiple-testing adjustement: Bonferroni easy to compute overly conservative Randomization computationally expensive more accurate

7 Exhaustive & Combinatorial Search Exhaustive search is infeasible sample with n genotypes/m SNPs requires O(n 3m ) Combinatorial search Definition: Disease-closure of a multi-SNP combination C is a multi-SNP combination C’, with maximum number of SNPs, which consists of the same set of disease individuals and minimum number of nondisease individuals. Searches only closed clusters Closure of cluster C = C’ d(C’)=d(C) and h(C’) is minimized Avoids checking of trivial MSCs Small d(C) implies not looking in subclusters Finds faster associated MSCs but still too slow Tagging: compress S by extracting most informative SNPs restore other SNPs from tag SNPs multiple regression method

8 MLR Tagging Stepwise Index SNP Algorithm: Choose as a tag the SNP which best predicts all other SNPs Choose the next one which together with a first best predicts all other SNPs and so on. Prediction method is based on Multiple Linear Regression (MLR) So far beats in quality other methods (STAMPA)

9 Data Sets Crohn's disease (Daly et al ): inflammatory bowel disease (IBD). Location: 5q31 Number of SNPs: 103 Population Size: 387 case: 144 control: 243 Autoimmune disorders (Ueda et al) : Location: containing gene CD28, CTLA4 and ICONS Number of SNPs: 108 Population Size: 1024 case: 378 control: 646 Tick-borne encephalitis dataset of (Barkash et al) : Location: containing gene TLR3, PKR, OAS1, OAS2, and OAS3. Number of SNPs: 41 Population Size: 75 case: 21 control: 54

10 Disease association search results IES(30): exhaustive search 30 indexed SNPs with MLR based tagging method ICS(30): combinatorial search 30 indexed SNPs with MLR based tagging method.

11 Disease Association Search Optimum Association Search Problem: Find MSC that is the most associated with the disease Measure: positive predictive value = find (non-)diseased-free cluster of maximum size Bad news: Generalization of max independent set NP complete and cannot be well approximated Hope: sample S is not arbitrary

12 Complimentary Greedy Algorithm Algorithm Start with C=S (resp. MCS is empty) Repeat until h(C)=0 (non-diseased-free) Find 1-SC s maximizing (h(C)-h(C  {s})) / (d(C) – d(C  {s})) = minimize payment with diseased for removal of non-diseased Add s to SNPs of C’s MSC Analogy: finding independent set by greedy removing highest degree vertecies Extremely fast but inaccurate Can be used in susceptibility prediction

13 Most disease-associated & disease-resistant MSC Comparison of three methods for searching the disease-associated and disease- resistant multi-SNPs combinations with the largest PPV. The starred values refer to results of the runtime-constrained exhaustive search

14 Genetic Susceptibility Prediction Given: Genotypes of diseased and non-diseased individuals, Genotype of a testing person. Find: The disease status of the testing person GenotypeDisease Status healthy sick testing - g t s(g t )

15 Cross-validation Leave-one-out test: The disease status of each genotype in the data set is predicted while the rest of the data is regarded as the training set Leave-many-out test: Repeat randomly picking 2/3 of the population as training set and predict the other 1/3. 1 Genotype Real Disease Status 1 Predicted Disease Status Accuracy = 80%

16 Quality Measures of Prediction (confusion table) Sensitivity: The ability to correctly detect disease. sensitivity = TP/(TP+FN) Specificity: The ability to avoid calling normal as disease. specificity = TN/(FP+TN) Accuracy = (TP +TN)/(TP+FP+FN+TN) Risk Rate: Measurements for risk factors. Prediction Disease +- Test + True PositiveFalse Positive (TP)(FP) - False NegativeTrue Negative (FN)(TN)

17 Prediction Methods Support vector machine Random forest LP-based prediction

18 Prediction via Clustering Drawback of the prediction problem formulation = need of cross-validation  no optimization Clustering P = partition into clusters defined by MSC’s Minimizing number of errors S.t. bounded information entropy –∑(S i /S) log(S i /S) Model-fitting prediction Set status of testing genotype to diseased Find number of errors Set status of testing genotype to diseased Find number of errors Predict status that implies lesser number of errors

19 Leave-1-out cross-validation results Leave-one-out cross-validation for combinatorial search-based prediction (CSP) and complimentary greedy search-based prediction (CGSP) are given when 20, 30, or all SNPs are chosen as informative SNPs.

20 ROC curve Comparison of 5 prediction methods on (Barkash et. al,2006 ) data on all SNPs. Area under the CSP ’ s curve is 0.81 vs 0.52 under the SVM ’ s curve.

21 Conclusions Combinatorial search is able to find statistically significant multi-gene interactions, for data where no significant association was detected before Complimentary greedy search can be used in susceptibility prediction Optimization approach to prediction New susceptibility prediction is by 8% higher than the best previously known MLR-tagging efficiently reduces the datasets allowing to find associated multi-SNP combinations and predict susceptibility

22 International Symposium on Bioinformatics Research and Applications May 6-9, 2007, Georgia State University, Atlanta, Georgia Submissions must and must not exceed 12 pages in Springer LNCS style The proceedings of ISBRA 2007 will be published in LNBI Important Dates Submission DeadlineNotification of AcceptanceFinal Version Submission December 20, 2006January 31, 2007February 21, 2007 Symposium Organizers General Chairs: Dan Gusfield (University of California, Davis) and Yi Pan (Georgia State University) Program Chairs: Ion Mandoiu (University of Connecticut) and Alexander Zelikovsky (Georgia State University) ISBRA provides a forum for the exchange of ideas and results among researchers, developers, and practitioners working on all aspects of bioinformatics and computational biology and their applications

23 History of ISBRA ISBRA is the successor of the International Workshop on Bioinformatics Research and Applications (IWBRA), held on - May 22-25, 2005 in Atlanta, GA and - May 28-31, 2006 in Reading, UK in conjunction with the International Conference on Computational Science The two editions of IWBRA have enjoyed a great success, with special issues devoted to full versions of selected papers in Springer LNCS Transactions on Computational Systems Biology and IEEE/ACM Transactions on Computational Biology and Bioinformatics

24 Support Vector Machine (SVM) Algorithm Learning Task Given: Genotypes of patients and healthy persons. Compute: A model distinguishing if a person has the disease. Classification Task Given: Genotype of a new patient + a learned model Determine: If a patient has the disease or not. Linear SVMNon-Linear SVM

25 Random Forest Algorithm Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down to each tree in the forest. Each tree gives a classification, and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest). Growing Tree, Split selection and Prediction. Random sub-sample of training data, Random splitter selection.

26 LP-based Prediction Algorithm Model: Certain haplotypes are susceptible to the disease while others are resistant to the disease. The genotype susceptibility is assumed to be a sum of susceptibilities of its two haplotypes. Assign a positive weight to susceptible haplotypes and a negative weight to resistant haplotypes such that for any control genotype the sum of weights of its haplotypes is negative and for any case genotype it is positive. For each vertex-haplotype h i assign the weight p i, such that for any genotype-edge e i,j =(h i,,h j ) where s(e i,j )  {-1,1} is the disease status of genotype e i,j. The sum of absolute values of genotype weights is maximized.