Presentation is loading. Please wait.

Presentation is loading. Please wait.

123 654 Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:

Similar presentations


Presentation on theme: "123 654 Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:"— Presentation transcript:

1 123 654 Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association: checks all one-SNP, two-SNP,..., m-SNP case-closed MSCs Case-closure of a MSC C is an MSC C’, with maximum number of SNPs, which consists of the same set of cases and minimum number of controls. Case-closure allow finding of the statistically significant MSC on the earlier stage of searching. Trivial MSCs and MSCs which coincide after case-closure are avoided. That significantly speedups the searching. Faster than exhaustive search Finds more significant association on the early stage of searching Still slow for wide-genome studies 0 1 1 0 1 2 0 0 2 healthy SNP - single nucleotide polymorphism where two or more different nucleotides occur in a large percentage of population 0 = willde type/major (frequency) allele 1 = mutation/minor (frequency) allele 2 = heterozygous allele Searching for genetic risk factors for diseases Monogenic diseases A mutated gene is entirely responsible for the disease Complex diseases Affected by the interaction of multiple genes Significance of risk factor is usually measured by Risk Rate or _ _ _ Odds Ratio We measure significance by the p-value of the set of genotypes _ defined by risk factor SNP and Disease Multi-SNP combination (MSC) define a set of case and control individuals MSC is considered statistically significant if the frequency of cases and controls distribution has p-value < 0.05 A lot of reported findings are frequently not reproducible on different populations. It is believed that this happens because the p-values are unadjusted to multiple testing Statistical significance Disease association analysis Analysis of variation in suspected genes in case and controls individuals is aimed at identifying SNPs with considerably higher frequencies among the case individuals than among the control individuals Most searches are done on a SNP-by-SNP basis Recently two-SNP analysis shows promising results (Marchini et al, 2005) Multi-SNP analyses are expected to find even stronger disease associations Common diseases can be caused by combinations of several unlinked gene (SNPs) variations We address the computational challenge of searching for such multi-gene causal combinations The number of multi-SNP combinations is infeasible high (3 100 for 100 SNPs). How to find associated multi-SNP combinations without total checking? Disease association analysis searches for a SNPs or multi-SNP combinations with frequency among cases considerably higher than among controls. Our contributions Disease-Associated Multi-SNP Combinations Search Given: a population of n genotypes (or haplotypes) each containing values of m SNPs from {0,1,2} and disease status (case or control) Find: all multi-SNP combinations with multiple testing adjusted p-value of the frequency distribution below 0.05 Results for Disease Susceptibility Prediction Results/comparison of searching methods Data Sets Maximum Case(Control)-Free Cluster Problem 0 1 1 0 1 2 1 0 2 sick 0 1 1 1 0 2 0 0 1 sick 0 0 1 0 0 0 0 2 1 sick 0 1 1 1 1 2 0 0 1 sick 0 0 1 0 1 2 1 0 2 sick 0 1 0 0 1 1 0 0 2 healthy x x 1 x x 2 x x x MSC 4 sick : 1 healthy check significance If the reported SNP is found among 100 SNPs then the probability that the SNP is associated with a disease by mere chance becomes 100 times larger (Bonferroni). Bonferroni is too crude (e.g., 3-SNP combinations among 100 SNPs, p < 0.05×10 -6 ) We adjust resulted p-values via randomization Unadjusted p-value: Probability of case/control distribution in a set defined by MSC, computed by binomial distribution Multiple-testing adjusted p-value : randomization Randomly permute the disease status of the population to generate 10000 instances. Apply searching methods on each instance to get MSCs. Compute the probability of MSCs that have a higher unadjusted p-value than the observed p-value. In our search we report only MSC with adjusted p-value < 0.05 Clustering-based Model-Fitting Algorithm for Disease Susceptibility Prediction: For the given training dataset and tested genotype consider two cases: tested genotype is added to the training dataset as a sick tested genotype is added to the training dataset as a healthy For the both cases obtain clustering by applying CGS to find: the most disease-associated MSC (defines a set of sick genotypes) the most disease-resistant MSC (defines a set of healthy genotypes) Remove from the original dataset one which is larger Repeat this procedure until all genotypes are removed Predict susceptibility of the tested genotype according to the case which has lower entropy of clustering. Disease Susceptibility Prediction Problem Given a sample population S (a training set) and one more individual t  S with the known SNPs but unknown disease status (testing individual), find (predict) the unknown disease status Disease Clustering Problem: Given a population sample S, find a partition P of S into clusters S = S 1 ..  S k, with disease status 0 or 1 assigned to each cluster S i, minimizing entropy(P) for a given bound on the number of individuals who are assigned incorrect status in clusters of the partition P, error(P)<  *|P|. Find a maximum size cluster C containing only cases or controls Complimentary Greedy Search (CGS): 1. Find SNP with allele value removing a set of genotypes with highest ratio of controls over cases. 2. Add the SNP to resulted MSC 3. Repeat 1-2 until all controls are removed. Resultant MSC defines a subset of sick genotypes. 4. Adjust to multiple testing the p-value of the resultant MSC. Comparison of three methods for searching the disease-associated and disease-resistant multi-SNPs combinations with the largest PPV. Leave-one-out cross validation results A novel combinatorial method for finding disease- associated multi-SNP combinations was developed. Multi-SNP combinations significantly associating with diseases were found. For Crohn's disease data (Daly, et al., 2001), a few associated multi-SNP combinations with multiple-testing-adjusted to p < 0.05 were found, while no single SNP or pair of SNPs showed significant association. For a dataset for an autoimmune disorder (Ueda, et al., 2003), a few previously unknown associated multi-SNP combinations were found. For tick-borne encephalitis virus-induced disease, a multi-SNP combination within a group of genes showing a high degree of linkage disequilibrium significantly associated with the severity of the disease was found. A model-fitting disease susceptibility prediction methods based on the developed search methods were proposed. [3] Crohn's disease : 387 genotypes with 103 SNPs derived from the 616 KB region of human Chromosome 5q31, 144 disease genotypes and 243 nondisease genotypes. (Daly et al., 2001). [10] Autoimmune disorder : 1024 genotypes with 108 SNPs containing gene CD28, CTLA4 and ICONS, 378 disease genotypes and 646 nondisease genotypes. (Ueda et al., 2003). [4] Tick-borne encephalitis : 75 genotypes with 41 SNPs containing gene TLR3, PKR, OAS1, OAS2, and OAS3, 21 disease genotypes and 54 nondisease genotypes. (Barkash et al., 2006). Quality measure Combinatorial search is able to find statistically significant multi-gene interactions, for data where no significant association was detected before Complimentary greedy search can be used in susceptibility prediction Optimization approach to prediction New susceptibility prediction is by 8% higher than the best previously known MLR-tagging efficiently reduces the datasets allowing to find associated multi-SNP combinations and predict susceptibility Comparison of 5 prediction methods on [4] data on all SNPs. Area under the CSP’s ROC curve is 0.87 vs 0.52 under the SVM’s curve


Download ppt "123 654 Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:"

Similar presentations


Ads by Google