Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility.

Slides:

Advertisements

Similar presentations

On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach

Advertisements

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.

Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.

Minimum Redundancy and Maximum Relevance Feature Selection

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.

Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Mutual Information Mathematical Biology Seminar

WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.

Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.

Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.

Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis Dinu et al, J. Biomedical.

Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.

Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.

Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.

BNFO 602 Lecture 2 Usman Roshan. Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem,

Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.

Modeling Gene Interactions in Disease CS 686 Bioinformatics.

Genotype Susceptibility And Integrated Risk Factors for Complex Diseases Weidong Mao Dumitru Brinza Nisar Hundewale Stefan Gremalshi Alexander Zelikovsky.

On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.

Multiple testing correction

A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.

Efficient Model Selection for Support Vector Machines

Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.

Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.

Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer.

Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:

GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.

National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Informative SNP Selection Based on Multiple Linear Regression

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Contingency tables Brian Healy, PhD. Types of analysis-independent samples OutcomeExplanatoryAnalysis ContinuousDichotomous t-test, Wilcoxon test ContinuousCategorical.

National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.

BGRS 2006 SEARCH FOR MULTI-SNP DISEASE ASSOCIATION D. Brinza, A. Perelygin, M. Brinton and A. Zelikovsky Georgia State University, Atlanta, GA, USA 123.

Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.

Combinatorial Search Methods for Genotypes Associated with Lung Cancer Dumitru Brinza Department of Computer Science Georgia State University.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.

Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.

Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.

De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.

Association mapping for mendelian, and complex disorders January 16Bafna, BfB.

Design and Validation of Methods Searching for Risk Factors in Genotype Case- Control Studies Dumitru Brinza Alexander Zelikovsky Department of Computer.

Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.

Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.

Risk Factor Searching Heuristics for SNP Case-Control Studies Dumitru Brinza February 21, 2008 Department of Computer Science & Engineering University.

Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs

Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.

D Nagesh Kumar, IIScOptimization Methods: M8L5 1 Advanced Topics in Optimization Evolutionary Algorithms for Optimization and Search.

Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.

International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.

The Haplotype Blocks Problems Wu Ling-Yun

Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Canadian Bioinformatics Workshops

 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.

Constrained Hidden Markov Models for Population-based Haplotyping

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Approximation Algorithms for the Selection of Robust Tag SNPs

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Outlines Introduction & Objectives Methodology & Workflow

Presentation transcript:

Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility Prediction

Outline  SNPs, Haplotypes and Genotypes  Disease Association Analysis  Multiple-testing adjustment  MLR indexing for data compression  Optimum data clustering  Predicting susceptibility to complex diseases  Conclusions

SNP, Haplotypes, Genotypes Human Genome – all the genetic material in the chromosomes, length 3×10 9 base pairs Difference between any two people occur in 0.1% of genome SNP – single nucleotide polymorphism site where two or more different nucleotides occur in a large percentage of population. Diploid – two different copies of each chromosome Haplotype – description of a single copy (expensive) example: (0 is for major, 1 is for minor allele) Genotype – description of the mixed two copies example: (0=00, 1=11, 2=01)

Types of Diseases Monogenic disease  Mutated gene is entirely responsible for the disease  Break the pathway, no another compensatory pathway  Typically rare in population: < 0.1%. Complex disease  Interaction of multiple genes One mutation does not cause disease Breakage of all compensatory pathways cause disease Hard to analyze - 2-gene interaction analysis for a genome- wide scan with 1 million SNPs has pair wise tests  Multiple independent causes There are different causes and each of these causes can be result of interaction of several genes Each cause explains certain percentage of cases Common diseases are Complex : > 0.1%. In NY city, 12% of the population has Type 2 Diabetes

Case/Control study Disease Status Case genotypes: Control genotypes: SNPs Disease association analysis searches for risk (resistance) factor with frequency among case (control) individuals considerably higher than among control (case) individuals. Given: a population of n genotypes each containing values of m SNPs and disease status.

Risk/Resistance factors Risk/resistance factor = one SNP with fixed allele value control case case case case case control Third SNP with fixed allele value 1 is a risk factor with frequency among case individuals higher than among control individuals. present in 5 cases : 1 control We generalize risk/resistance factor to multi-SNP combination

Multi-SNP extension multi-SNP combination (MSC)  a subset of SNP-columns of S (set of SNPs)  With fixed values of these SNPs, 0, 1, or control case case case case case control x x 1 x x 2 x x x MSC present in 4 cases : 1 control check significance

Significance of Risk/Resistance Factors Measured P-value  probability that case/control distribution among exposed to risk factor happened by chance  compute by binomial distribution Searching for risk factors among many SNPs requires multiple testing adjustment of the p-value

Multiple-testing adjustment Bonferroni  easy to compute  overly conservative  If the reported SNP is found among 100 SNPs then the probability that the SNP is associated with a disease by mere chance becomes 100 times larger Randomization  Randomly permute the disease status of the population to generate samples  Apply searching methods to each sample and get MSCs  Count # of MSCs that have smaller unadjusted p-value than the observed p-value  If this # < 500 then the observed MSC is significant  computationally expensive  more accurate In our search we report only MSC with adjusted p-value < 0.05

Disease Association Search Problem Formulation: Given: a case/control study data consisting of n genotypes each containing values of m SNPs and disease status Find: all Risk/Resistance factors (MSCs) with multiple testing adjusted p-value below 0.05

Searching Approaches Exhaustive search (ES)  computationally infeasible  searching for 3-SNP MSC on the sample with n genotypes and m SNPs requires O(n 3m ) Case-closure of a MSC C is an MSC C’, with maximum number of SNPs with fixed values, which consists of the same set of cases and minimum number of controls. Efficient way for finding case-closure: Extend MSC with those SNPs that have common values in all cases control case case case control x x 1 x x 2 x x xMSC Present in 2 cases : 2 controls Case-closure control case case case control x x 1 x x 2 x 0 xMSC’ Present in 2 cases : 1 controls i i Cluster C = subset of genotypes which share the same MSC

Combinatorial Search Combinatorial Search Method (CS):  Searches only among case-closed MSCs  Avoids checking of clusters with small number of cases  Finds significant MSCs faster than ES  Still too slow for large data  Further speedup by reducing number of SNPs Indexing: compress S by extracting most informative SNPs  Use multiple regression method

Problem formulation  Given the full pattern of all SNPs in a sample  Find the minimum number of index SNPs that will allow the reconstruction of the complete genotype for each individual Index SNPs Selection Algorithm SNP Prediction Algorithm Step 1: Find index (SNP position) in sample: Find index (0, 1, 2) Step 2: Reconstruct complete genotype Computation Methods Indexing

MLR Indexing SNP Prediction Algorithm  Based on Multiple Linear Regression (MLR) Index SNPs Selection Algorithm:  Choose as an index SNP the SNP which best predicts all other SNPs  Choose the next one which together with a first best predicts all other SNPs and so on.

5 Data Sets Crohn's disease (Daly et al ): inflammatory bowel disease (IBD). Location: 5q31 Number of SNPs: 103 Population Size: 387 case: 144 control: 243 Autoimmune disorders (Ueda et al) : Location: containing gene CD28, CTLA4 and ICONS Number of SNPs: 108 Population Size: 1024 case: 378 control: 646 Tick-borne encephalitis (Barkash et al) : Location: containing gene TLR3, PKR, OAS1, OAS2, and OAS3. Number of SNPs: 41 Population Size: 75 case: 21 control: 54 Lung cancer (Dragani et al) : Number of SNPs: 141 Population Size: 500 case: 260 control: 240 Rheumatoid Arthritis (GAW15) : Number of SNPs: 2300 Population Size: 920 case: 460 control: 460

Results of Disease association search Indexed versus original  The number of statistically significant MSCs found on indexed data is more than on the non-indexed CS versus ES  Over all datasets CS finds no less MSCs than ES  For some datasets ES could not find any significant MSC in reasonable amount of time, while CS found Conclusion  We conclude that the proposed indexing approach and the CS method are very promising techniques  CS is still slow  Alternatively we can search not for all MSCs but for the best MSC

The most associated MSC Optimum Association Search Problem  Given: case/control study data  Find: MSC that is the most associated with the disease  =MSC which is present in control-free cluster of maximum size Complexity  Generalization of max independent set  NP complete and cannot be well approximated Hope  Sample S is not arbitrary  Biological structure Cluster C = subset of genotypes which share the same MSC

Complimentary Greedy Search (CGS) Intuition: Greedy algorithm for finding maximum independent set by removing highest degree vertices Algorithm: 1. Start with empty MSC that is present in all genotypes 2. Find SNP with allele value removing a set of genotypes with highest ratio of controls over cases (Max(controls/cases)) 3. Add the SNP to resulted MSC 4. Repeat 2-3 until all controls are removed 5. Output resulted MSC 6. Adjust to multiple testing the p-value of the resulted MSC Extremely fast but inaccurate CasesControls

CGS Results CGS finds MSCs with non-trivially high association on real data CGS finds more significant MSCs on full dataset than CS on indexed in reasonable amount of time

Future Work CGS is fast, it can be used as basic operation in case/control data analysis  Cover data with clusters corresponding to MSCs found by CGS and analyze SNPs which belongs to many MSCs  Build classifier (prediction) based on MSCs found by CGS We plan to randomize CGS using simulated annealing to find more significant MSCs with smaller number of SNPs

Genetic Susceptibility Prediction Given: Case/Control study data S & Genotype of a testing individual t Predict: The disease status of the testing individual testing - g t Disease Status Case genotypes: Control genotypes: SNPs ? Problem formulation

Cross-validation Leave-one-out test  The disease status of each genotype in the data set is predicted while the rest of the data is regarded as the training set Leave-many-out test  Repeat randomly picking 2/3 of the population as training set and predict the other 1/3 1 Genotype Real Disease Status 1 Predicted Disease Status Accuracy = 80%

Quality Measures of Prediction Sensitivity: The ability to correctly detect cases sensitivity = TP/(TP+FN) Specificity: The ability to avoid calling control as case specificity = TN/(FP+TN) Accuracy = (TP +TN)/(TP+FP+FN+TN) Risk Rate: Measurements for risk factors. Original CaseControl Predicted Case True PositiveFalse Positive (TP)(FP) Control False NegativeTrue Negative (FN)(TN) (confusion table)

Prediction Methods Support vector machine Random forest LP-based prediction Drawback of the prediction problem formulation = need of cross-validation  no optimization

Optimum Clustering Problem Given: Case/Control study data represented by a population sample S Find: a partition P of S into clusters S = S 1 ..  S k, with disease status 0 or 1 assigned to each cluster S i, minimizing entropy(P) assuming 0 errors Clustering P = partition into clusters defined by MSC’s

From Clustering to Prediction Intuition  If tested genotype is predicted correctly then optimum clustering will have smaller entropy Model-fitting prediction Algorithm  Set status of testing genotype to diseased  Add it to training dataset  Find optimum clustering of the dataset  Set status of testing genotype to non-diseased  Add it to training dataset  Find optimum clustering of the dataset  Predict status according to the case with smaller entropy

Leave-one-out cross-validation for combinatorial search-based prediction (CSP) and complimentary greedy search-based prediction (CGSP) are given when 20, 30, or all SNPs are chosen as informative SNPs. Results of Prediction Methods Leave-One-Out Cross Validation

ROC curve Comparison of 5 prediction methods on (Barkash et. al,2006 ) data on all SNPs. Area under the CSP ’ s curve is 0.81 vs 0.52 under the SVM ’ s curve.

Conclusions Combinatorial search is able to find statistically significant multi-gene interactions, for data where no significant association was detected before Complimentary greedy search can be used in susceptibility prediction Optimization approach to prediction New susceptibility prediction is by 15% higher than the best previously known MLR-tagging efficiently reduces the datasets allowing to find associated multi-SNP combinations and predict susceptibility

Thank You! Poster #14  Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Algorithmic Biology 2006 Paper  Combinatorial Methods for Disease Association Search and Susceptibility Prediction WABI 2006