Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility.

Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility Prediction

Outline  SNPs, Haplotypes and Genotypes  Disease Association Analysis  Multiple-testing adjustment  MLR indexing for data compression  Optimum data clustering  Predicting susceptibility to complex diseases  Conclusions

SNP, Haplotypes, Genotypes Human Genome – all the genetic material in the chromosomes, length 3×10 9 base pairs Difference between any two people occur in 0.1% of genome SNP – single nucleotide polymorphism site where two or more different nucleotides occur in a large percentage of population. Diploid – two different copies of each chromosome Haplotype – description of a single copy (expensive) example: 00110101 (0 is for major, 1 is for minor allele) Genotype – description of the mixed two copies example: 01122110 (0=00, 1=11, 2=01)

Types of Diseases Monogenic disease  Mutated gene is entirely responsible for the disease  Break the pathway, no another compensatory pathway  Typically rare in population: < 0.1%. Complex disease  Interaction of multiple genes One mutation does not cause disease Breakage of all compensatory pathways cause disease Hard to analyze - 2-gene interaction analysis for a genome- wide scan with 1 million SNPs has 10 12 pair wise tests  Multiple independent causes There are different causes and each of these causes can be result of interaction of several genes Each cause explains certain percentage of cases Common diseases are Complex : > 0.1%. In NY city, 12% of the population has Type 2 Diabetes

Case/Control study 0101201020102210 0220110210120021 0200120012221110 0020011002212101 1101202020100110 0120120010100011 0210220002021112 0021011000212120 1 Disease Status Case genotypes: Control genotypes: SNPs Disease association analysis searches for risk (resistance) factor with frequency among case (control) individuals considerably higher than among control (case) individuals. Given: a population of n genotypes each containing values of m SNPs and disease status.

Risk/Resistance factors Risk/resistance factor = one SNP with fixed allele value 0 1 1 0 1 2 0 0 2 control 0 1 1 0 1 2 1 0 2 case 0 1 1 1 0 2 0 0 1 case 0 0 1 0 0 0 0 2 1 case 0 1 1 1 1 2 0 0 1 case 0 0 1 0 1 2 1 0 2 case 0 1 0 0 1 1 0 0 2 control Third SNP with fixed allele value 1 is a risk factor with frequency among case individuals higher than among control individuals. present in 5 cases : 1 control We generalize risk/resistance factor to multi-SNP combination

Multi-SNP extension multi-SNP combination (MSC)  a subset of SNP-columns of S (set of SNPs)  With fixed values of these SNPs, 0, 1, or 2 0 1 1 0 1 2 0 0 2 control 0 1 1 0 1 2 1 0 2 case 0 1 1 1 0 2 0 0 1 case 0 0 1 0 0 0 0 2 1 case 0 1 1 1 1 2 0 0 1 case 0 0 1 0 1 2 1 0 2 case 0 1 0 0 1 1 0 0 2 control x x 1 x x 2 x x x MSC present in 4 cases : 1 control check significance

Significance of Risk/Resistance Factors Measured P-value  probability that case/control distribution among exposed to risk factor happened by chance  compute by binomial distribution Searching for risk factors among many SNPs requires multiple testing adjustment of the p-value

Multiple-testing adjustment Bonferroni  easy to compute  overly conservative  If the reported SNP is found among 100 SNPs then the probability that the SNP is associated with a disease by mere chance becomes 100 times larger Randomization  Randomly permute the disease status of the population to generate 10000 samples  Apply searching methods to each sample and get MSCs  Count # of MSCs that have smaller unadjusted p-value than the observed p-value  If this # < 500 then the observed MSC is significant  computationally expensive  more accurate In our search we report only MSC with adjusted p-value < 0.05

Disease Association Search Problem Formulation: Given: a case/control study data consisting of n genotypes each containing values of m SNPs and disease status Find: all Risk/Resistance factors (MSCs) with multiple testing adjusted p-value below 0.05

Searching Approaches Exhaustive search (ES)  computationally infeasible  searching for 3-SNP MSC on the sample with n genotypes and m SNPs requires O(n 3m ) Case-closure of a MSC C is an MSC C’, with maximum number of SNPs with fixed values, which consists of the same set of cases and minimum number of controls. Efficient way for finding case-closure: Extend MSC with those SNPs that have common values in all cases 0 1 1 0 1 2 0 0 2 control 0 1 1 0 1 2 1 0 2 case 2 0 1 1 0 2 0 0 1 case 0 0 1 0 0 0 0 2 1 case 0 1 1 0 1 2 0 0 2 control x x 1 x x 2 x x xMSC Present in 2 cases : 2 controls Case-closure 0 2 1 0 1 2 0 1 2 control 0 1 1 0 1 2 1 0 2 case 2 0 1 1 0 2 0 0 1 case 0 0 1 0 0 0 0 2 1 case 0 1 1 0 1 2 0 0 2 control x x 1 x x 2 x 0 xMSC’ Present in 2 cases : 1 controls i i Cluster C = subset of genotypes which share the same MSC

Combinatorial Search Combinatorial Search Method (CS):  Searches only among case-closed MSCs  Avoids checking of clusters with small number of cases  Finds significant MSCs faster than ES  Still too slow for large data  Further speedup by reducing number of SNPs Indexing: compress S by extracting most informative SNPs  Use multiple regression method

Problem formulation  Given the full pattern of all SNPs in a sample  Find the minimum number of index SNPs that will allow the reconstruction of the complete genotype for each individual Index SNPs Selection Algorithm SNP Prediction Algorithm Step 1: Find index (SNP position) in sample: Find index (0, 1, 2) Step 2: Reconstruct complete genotype Computation Methods Indexing

MLR Indexing SNP Prediction Algorithm  Based on Multiple Linear Regression (MLR) Index SNPs Selection Algorithm:  Choose as an index SNP the SNP which best predicts all other SNPs  Choose the next one which together with a first best predicts all other SNPs and so on.

5 Data Sets Crohn's disease (Daly et al ): inflammatory bowel disease (IBD). Location: 5q31 Number of SNPs: 103 Population Size: 387 case: 144 control: 243 Autoimmune disorders (Ueda et al) : Location: containing gene CD28, CTLA4 and ICONS Number of SNPs: 108 Population Size: 1024 case: 378 control: 646 Tick-borne encephalitis (Barkash et al) : Location: containing gene TLR3, PKR, OAS1, OAS2, and OAS3. Number of SNPs: 41 Population Size: 75 case: 21 control: 54 Lung cancer (Dragani et al) : Number of SNPs: 141 Population Size: 500 case: 260 control: 240 Rheumatoid Arthritis (GAW15) : Number of SNPs: 2300 Population Size: 920 case: 460 control: 460

Results of Disease association search Indexed versus original  The number of statistically significant MSCs found on indexed data is more than on the non-indexed CS versus ES  Over all datasets CS finds no less MSCs than ES  For some datasets ES could not find any significant MSC in reasonable amount of time, while CS found Conclusion  We conclude that the proposed indexing approach and the CS method are very promising techniques  CS is still slow  Alternatively we can search not for all MSCs but for the best MSC

The most associated MSC Optimum Association Search Problem  Given: case/control study data  Find: MSC that is the most associated with the disease  =MSC which is present in control-free cluster of maximum size Complexity  Generalization of max independent set  NP complete and cannot be well approximated Hope  Sample S is not arbitrary  Biological structure Cluster C = subset of genotypes which share the same MSC

Complimentary Greedy Search (CGS) Intuition: Greedy algorithm for finding maximum independent set by removing highest degree vertices Algorithm: 1. Start with empty MSC that is present in all genotypes 2. Find SNP with allele value removing a set of genotypes with highest ratio of controls over cases (Max(controls/cases)) 3. Add the SNP to resulted MSC 4. Repeat 2-3 until all controls are removed 5. Output resulted MSC 6. Adjust to multiple testing the p-value of the resulted MSC Extremely fast but inaccurate CasesControls

CGS Results CGS finds MSCs with non-trivially high association on real data CGS finds more significant MSCs on full dataset than CS on indexed in reasonable amount of time

Future Work CGS is fast, it can be used as basic operation in case/control data analysis  Cover data with clusters corresponding to MSCs found by CGS and analyze SNPs which belongs to many MSCs  Build classifier (prediction) based on MSCs found by CGS We plan to randomize CGS using simulated annealing to find more significant MSCs with smaller number of SNPs

Genetic Susceptibility Prediction Given: Case/Control study data S & Genotype of a testing individual t Predict: The disease status of the testing individual testing - g t 0110211101211201 0101201020102210 0220110210120021 0200120012221110 0020011002212101 1101202020100110 0120120010100011 0210220002021112 0021011000212120 1 Disease Status Case genotypes: Control genotypes: SNPs ? Problem formulation

Cross-validation Leave-one-out test  The disease status of each genotype in the data set is predicted while the rest of the data is regarded as the training set 0101201020102210 0220110210120021 0200120012221110 0020011002212101 Leave-many-out test  Repeat randomly picking 2/3 of the population as training set and predict the other 1/3 1 Genotype Real Disease Status 1 Predicted Disease Status 1 0020011002212101 1 1 Accuracy = 80%

Quality Measures of Prediction Sensitivity: The ability to correctly detect cases sensitivity = TP/(TP+FN) Specificity: The ability to avoid calling control as case specificity = TN/(FP+TN) Accuracy = (TP +TN)/(TP+FP+FN+TN) Risk Rate: Measurements for risk factors. Original CaseControl Predicted Case True PositiveFalse Positive (TP)(FP) Control False NegativeTrue Negative (FN)(TN) (confusion table)

Prediction Methods Support vector machine Random forest LP-based prediction Drawback of the prediction problem formulation = need of cross-validation  no optimization

Optimum Clustering Problem Given: Case/Control study data represented by a population sample S Find: a partition P of S into clusters S = S 1 ..  S k, with disease status 0 or 1 assigned to each cluster S i, minimizing entropy(P) assuming 0 errors Clustering P = partition into clusters defined by MSC’s

From Clustering to Prediction Intuition  If tested genotype is predicted correctly then optimum clustering will have smaller entropy Model-fitting prediction Algorithm  Set status of testing genotype to diseased  Add it to training dataset  Find optimum clustering of the dataset  Set status of testing genotype to non-diseased  Add it to training dataset  Find optimum clustering of the dataset  Predict status according to the case with smaller entropy

Leave-one-out cross-validation for combinatorial search-based prediction (CSP) and complimentary greedy search-based prediction (CGSP) are given when 20, 30, or all SNPs are chosen as informative SNPs. Results of Prediction Methods Leave-One-Out Cross Validation

ROC curve Comparison of 5 prediction methods on (Barkash et. al,2006 ) data on all SNPs. Area under the CSP ’ s curve is 0.81 vs 0.52 under the SVM ’ s curve.

Conclusions Combinatorial search is able to find statistically significant multi-gene interactions, for data where no significant association was detected before Complimentary greedy search can be used in susceptibility prediction Optimization approach to prediction New susceptibility prediction is by 15% higher than the best previously known MLR-tagging efficiently reduces the datasets allowing to find associated multi-SNP combinations and predict susceptibility

Thank You! Poster #14  Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Algorithmic Biology 2006 Paper  Combinatorial Methods for Disease Association Search and Susceptibility Prediction WABI 2006

Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility.

Similar presentations

Presentation on theme: "Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility.

Similar presentations

Presentation on theme: "Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility."— Presentation transcript:

Similar presentations

About project

Feedback