Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility Prediction
Outline SNPs, Haplotypes and Genotypes Disease Association Analysis Multiple-testing adjustment MLR indexing for data compression Optimum data clustering Predicting susceptibility to complex diseases Conclusions
SNP, Haplotypes, Genotypes Human Genome – all the genetic material in the chromosomes, length 3×10 9 base pairs Difference between any two people occur in 0.1% of genome SNP – single nucleotide polymorphism site where two or more different nucleotides occur in a large percentage of population. Diploid – two different copies of each chromosome Haplotype – description of a single copy (expensive) example: (0 is for major, 1 is for minor allele) Genotype – description of the mixed two copies example: (0=00, 1=11, 2=01)
Types of Diseases Monogenic disease Mutated gene is entirely responsible for the disease Break the pathway, no another compensatory pathway Typically rare in population: < 0.1%. Complex disease Interaction of multiple genes One mutation does not cause disease Breakage of all compensatory pathways cause disease Hard to analyze - 2-gene interaction analysis for a genome- wide scan with 1 million SNPs has pair wise tests Multiple independent causes There are different causes and each of these causes can be result of interaction of several genes Each cause explains certain percentage of cases Common diseases are Complex : > 0.1%. In NY city, 12% of the population has Type 2 Diabetes
Case/Control study Disease Status Case genotypes: Control genotypes: SNPs Disease association analysis searches for risk (resistance) factor with frequency among case (control) individuals considerably higher than among control (case) individuals. Given: a population of n genotypes each containing values of m SNPs and disease status.
Risk/Resistance factors Risk/resistance factor = one SNP with fixed allele value control case case case case case control Third SNP with fixed allele value 1 is a risk factor with frequency among case individuals higher than among control individuals. present in 5 cases : 1 control We generalize risk/resistance factor to multi-SNP combination
Multi-SNP extension multi-SNP combination (MSC) a subset of SNP-columns of S (set of SNPs) With fixed values of these SNPs, 0, 1, or control case case case case case control x x 1 x x 2 x x x MSC present in 4 cases : 1 control check significance
Significance of Risk/Resistance Factors Measured P-value probability that case/control distribution among exposed to risk factor happened by chance compute by binomial distribution Searching for risk factors among many SNPs requires multiple testing adjustment of the p-value
Multiple-testing adjustment Bonferroni easy to compute overly conservative If the reported SNP is found among 100 SNPs then the probability that the SNP is associated with a disease by mere chance becomes 100 times larger Randomization Randomly permute the disease status of the population to generate samples Apply searching methods to each sample and get MSCs Count # of MSCs that have smaller unadjusted p-value than the observed p-value If this # < 500 then the observed MSC is significant computationally expensive more accurate In our search we report only MSC with adjusted p-value < 0.05
Disease Association Search Problem Formulation: Given: a case/control study data consisting of n genotypes each containing values of m SNPs and disease status Find: all Risk/Resistance factors (MSCs) with multiple testing adjusted p-value below 0.05
Searching Approaches Exhaustive search (ES) computationally infeasible searching for 3-SNP MSC on the sample with n genotypes and m SNPs requires O(n 3m ) Case-closure of a MSC C is an MSC C’, with maximum number of SNPs with fixed values, which consists of the same set of cases and minimum number of controls. Efficient way for finding case-closure: Extend MSC with those SNPs that have common values in all cases control case case case control x x 1 x x 2 x x xMSC Present in 2 cases : 2 controls Case-closure control case case case control x x 1 x x 2 x 0 xMSC’ Present in 2 cases : 1 controls i i Cluster C = subset of genotypes which share the same MSC
Combinatorial Search Combinatorial Search Method (CS): Searches only among case-closed MSCs Avoids checking of clusters with small number of cases Finds significant MSCs faster than ES Still too slow for large data Further speedup by reducing number of SNPs Indexing: compress S by extracting most informative SNPs Use multiple regression method
Problem formulation Given the full pattern of all SNPs in a sample Find the minimum number of index SNPs that will allow the reconstruction of the complete genotype for each individual Index SNPs Selection Algorithm SNP Prediction Algorithm Step 1: Find index (SNP position) in sample: Find index (0, 1, 2) Step 2: Reconstruct complete genotype Computation Methods Indexing
MLR Indexing SNP Prediction Algorithm Based on Multiple Linear Regression (MLR) Index SNPs Selection Algorithm: Choose as an index SNP the SNP which best predicts all other SNPs Choose the next one which together with a first best predicts all other SNPs and so on.
5 Data Sets Crohn's disease (Daly et al ): inflammatory bowel disease (IBD). Location: 5q31 Number of SNPs: 103 Population Size: 387 case: 144 control: 243 Autoimmune disorders (Ueda et al) : Location: containing gene CD28, CTLA4 and ICONS Number of SNPs: 108 Population Size: 1024 case: 378 control: 646 Tick-borne encephalitis (Barkash et al) : Location: containing gene TLR3, PKR, OAS1, OAS2, and OAS3. Number of SNPs: 41 Population Size: 75 case: 21 control: 54 Lung cancer (Dragani et al) : Number of SNPs: 141 Population Size: 500 case: 260 control: 240 Rheumatoid Arthritis (GAW15) : Number of SNPs: 2300 Population Size: 920 case: 460 control: 460
Results of Disease association search Indexed versus original The number of statistically significant MSCs found on indexed data is more than on the non-indexed CS versus ES Over all datasets CS finds no less MSCs than ES For some datasets ES could not find any significant MSC in reasonable amount of time, while CS found Conclusion We conclude that the proposed indexing approach and the CS method are very promising techniques CS is still slow Alternatively we can search not for all MSCs but for the best MSC
The most associated MSC Optimum Association Search Problem Given: case/control study data Find: MSC that is the most associated with the disease =MSC which is present in control-free cluster of maximum size Complexity Generalization of max independent set NP complete and cannot be well approximated Hope Sample S is not arbitrary Biological structure Cluster C = subset of genotypes which share the same MSC
Complimentary Greedy Search (CGS) Intuition: Greedy algorithm for finding maximum independent set by removing highest degree vertices Algorithm: 1. Start with empty MSC that is present in all genotypes 2. Find SNP with allele value removing a set of genotypes with highest ratio of controls over cases (Max(controls/cases)) 3. Add the SNP to resulted MSC 4. Repeat 2-3 until all controls are removed 5. Output resulted MSC 6. Adjust to multiple testing the p-value of the resulted MSC Extremely fast but inaccurate CasesControls
CGS Results CGS finds MSCs with non-trivially high association on real data CGS finds more significant MSCs on full dataset than CS on indexed in reasonable amount of time
Future Work CGS is fast, it can be used as basic operation in case/control data analysis Cover data with clusters corresponding to MSCs found by CGS and analyze SNPs which belongs to many MSCs Build classifier (prediction) based on MSCs found by CGS We plan to randomize CGS using simulated annealing to find more significant MSCs with smaller number of SNPs
Genetic Susceptibility Prediction Given: Case/Control study data S & Genotype of a testing individual t Predict: The disease status of the testing individual testing - g t Disease Status Case genotypes: Control genotypes: SNPs ? Problem formulation
Cross-validation Leave-one-out test The disease status of each genotype in the data set is predicted while the rest of the data is regarded as the training set Leave-many-out test Repeat randomly picking 2/3 of the population as training set and predict the other 1/3 1 Genotype Real Disease Status 1 Predicted Disease Status Accuracy = 80%
Quality Measures of Prediction Sensitivity: The ability to correctly detect cases sensitivity = TP/(TP+FN) Specificity: The ability to avoid calling control as case specificity = TN/(FP+TN) Accuracy = (TP +TN)/(TP+FP+FN+TN) Risk Rate: Measurements for risk factors. Original CaseControl Predicted Case True PositiveFalse Positive (TP)(FP) Control False NegativeTrue Negative (FN)(TN) (confusion table)
Prediction Methods Support vector machine Random forest LP-based prediction Drawback of the prediction problem formulation = need of cross-validation no optimization
Optimum Clustering Problem Given: Case/Control study data represented by a population sample S Find: a partition P of S into clusters S = S 1 .. S k, with disease status 0 or 1 assigned to each cluster S i, minimizing entropy(P) assuming 0 errors Clustering P = partition into clusters defined by MSC’s
From Clustering to Prediction Intuition If tested genotype is predicted correctly then optimum clustering will have smaller entropy Model-fitting prediction Algorithm Set status of testing genotype to diseased Add it to training dataset Find optimum clustering of the dataset Set status of testing genotype to non-diseased Add it to training dataset Find optimum clustering of the dataset Predict status according to the case with smaller entropy
Leave-one-out cross-validation for combinatorial search-based prediction (CSP) and complimentary greedy search-based prediction (CGSP) are given when 20, 30, or all SNPs are chosen as informative SNPs. Results of Prediction Methods Leave-One-Out Cross Validation
ROC curve Comparison of 5 prediction methods on (Barkash et. al,2006 ) data on all SNPs. Area under the CSP ’ s curve is 0.81 vs 0.52 under the SVM ’ s curve.
Conclusions Combinatorial search is able to find statistically significant multi-gene interactions, for data where no significant association was detected before Complimentary greedy search can be used in susceptibility prediction Optimization approach to prediction New susceptibility prediction is by 15% higher than the best previously known MLR-tagging efficiently reduces the datasets allowing to find associated multi-SNP combinations and predict susceptibility
Thank You! Poster #14 Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Algorithmic Biology 2006 Paper Combinatorial Methods for Disease Association Search and Susceptibility Prediction WABI 2006