BGRS 2006 SEARCH FOR MULTI-SNP DISEASE ASSOCIATION D. Brinza, A. Perelygin, M. Brinton and A. Zelikovsky Georgia State University, Atlanta, GA, USA Exhaustive Search (ES): In order to find a multi-SNP combination with the p-value of the frequency distribution below 0.05, it checks all one-SNP, two- SNP,..., m-SNP combinations. Runtime is O(n3 m ) making complete searching unfeasible even for small numbers of SNPs m We restrict searching to 1,2,3,4,5 SNPs Searching level – number of SNPs which participate in MSC Indexed Exhaustive Search (IES): Exhaustive search on the indexed datasets obtained by extracting k indexed SNPs with MLR based tagging method. MLR - multiple linear regression based tagging method (He and Zelikovsky, 2006). The tradeoff between the number of chosen indexing SNPs and quality of reconstruction requires choosing the maximum number of index SNPs that can be handled by ES in a reasonable computational time. Can perform complete searching for the larger datasets For wide-genome study number of tags can’t be reduced to 5-10 tags. Therefore, IES will not be able to perform complete search Combinatorial Search (CS): Similar to ES check all one-SNP, two-SNP,..., m-SNP disease- closed combinations. Disease-closure of a multi-SNP combination C is a multi-SNP combination C’, with maximum number of SNPs, which consists of the same set of disease individuals and minimum number of nondisease individuals healthy Searching for genetic risk factors for diseases Monogenic diseases A mutated gene is entirely responsible for the disease Typically rare in population: < 0.1% Practically all cases are already reported Complex diseases Affected by the interaction of multiple genes Significance of risk factor is usually measured by Risk Rate or _ _ _ Odds Ratio We measure significance by the p-value of the set of genotypes _ defined by risk factor Genetic epidemiology Length of Human Genome 3 10 9 base pairs Difference between any two people 0.1% of genome Total number of single nucleotide polymorphisms (SNP) 3 10 6 SNP - single nucleotide site where two or more different nucleotides occur in a large percentage of population 0 = willde type/major (frequency) allele 1 = mutation/minor (frequency) allele International HapMap project: SNP maps are constructed across the human genome with density of about one SNP per thousand nucleotides. HapMap tries to identify 1 million tag SNP’s providing almost as much mapping informa- tion as entire 10 million SNP’s Unfortunately, not as much known about SNP combinations HapMap initial budget was 100Million dollars Due today around 1.5Million SNPs are typed Most of the data are trio High-throughput genotyping technology Affymetrix GeneChip for gene genotyping ( 500k microarray chip ) Human Genome and SNP Multi-SNP combination (MSC) define a set of disease and nondisese individuals MSC is considered statistically significant if the frequency of disease and nondisese distribution has p-value < 0.05 A lot of reported findings are frequently not reproducible on different populations. It is believed that this happens because the p-values are unadjusted to multiple testing Statistical significance Disease association analysis Analysis of variation in suspected genes in disease and nondisease individuals is aimed at identifying SNPs with considerably higher frequencies among the disease individuals than among the nondisease individuals Most searches are done on a SNP-by-SNP basis Recently two-SNP analysis shows promising results (Marchini et al, 2005) Multi-SNP analyses are expected to find even stronger disease associations Common diseases can be caused by combinations of several unlinked gene (SNPs) variations We address the computational challenge of searching for such multi-gene causal combinations The number of multi-SNP combinations is infeasible high (3 100 for 100 SNPs). How to find associated multi-SNP combinations without total checking? Disease association analysis searches for a SNPs or multi-SNP combinations with frequency among disease individuals considerably higher than among nondisease individuals. Our contributions A novel combinatorial method for finding disease- associated multi-SNP combinations was developed. Multi-SNP combinations significantly associating with diseases were found. For Crohn's disease data (Daly, et al., 2001), a few associated multi-SNP combinations with multiple-testing-adjusted to p < 0.05 were found, while no single SNP or pair of SNPs showed significant association. For a dataset for an autoimmune disorder (Ueda, et al., 2003), a few previously unknown associated multi-SNP combinations were found. For tick-borne encephalitis virus-induced disease, a multi-SNP combination within a group of genes showing a high degree of linkage disequilibrium significantly associated with the severity of the disease was found. Disease-Associated Multi-SNP Combinations Search Given: a population of n genotypes (or haplotypes) each containing values of m SNPs from {0,1,2} and disease status (diseased or nondisease) Find: all multi-SNP combinations with multiple testing adjusted p-value of the frequency distribution below 0.05 Discussion The relative qualities of the searching methods are compared using the number of statistically significant multi-SNP combinations found. The statistical significance was adjusted to multiple testing and the adjusted 0.05 threshold is shown (third column). In the 4th, 5th and 6th columns, we give the frequencies of the best multi-SNP combination among disease and nondisease populations and the unadjusted p-value, respectively. Results/comparison of searching methods Crohn's disease : 387 genotypes with 103 SNPs derived from the 616 KB region of human Chromosome 5q31, 144 disease genotypes and 243 nondisease genotypes. (Daly et al., 2001). Autoimmune disorder : 1024 genotypes with 108 SNPs containing gene CD28, CTLA4 and ICONS, 378 disease genotypes and 646 nondisease genotypes. (Ueda et al., 2003). Tick-borne encephalitis : 75 genotypes with 41 SNPs containing gene TLR3, PKR, OAS1, OAS2, and OAS3, 21 disease genotypes and 54 nondisease genotypes. (Barkash et al., 2006). Data Sets Proposed searching methods Comparing indexed counterparts with ES and CS shows that indexing is quite successful. Indeed, the indexed searches found the same multi-SNP combinations as the non-indexed searches but were much faster and the multiple-testing adjusted 0.05-threshold was higher and easier to meet. Comparing the CS with the ES counterparts is advantageous to the former. Indeed, for the Crohn's disease data (Daly.et al., 2001), the ES on the first and second search levels is unsuccessful while the CS finds several statistically significant multi-SNP combinations. Similarly, for the tick-borne encephalitis virus-induced disease data, the CS and ICS(20) found a significant association on the first level while no association was found by the ES or IES(20). For the autoimmune disorder data (Ueda.et al., 2003), the CS found many more statistically significant multi-SNP combinations then the ES. We conclude that the proposed indexing approach and the combinatorial search method are very promising techniques for searching for statistically significant diseases-associated multi-SNP combinations and disease susceptibility prediction sick sick sick sick sick healthy x x 1 x x 2 x x x MSC 4 sick : 1 healthy check significance If the reported SNP is found among 100 SNPs then the probability that the SNP is associated with a disease by mere chance becomes 100 times larger (Bonferroni). Bonferroni is too crude (e.g., 3-SNP combinations among 100 SNPs, p < 0.05×10 -6 ) We adjust resulted p-values via randomization Unadjusted p-value: Probability of case/control distribution in a set defined by MSC, computed by binomial distribution Multiple-testing adjusted p-value : randomization Randomly permute the disease status of the population to generate 1000 instances. Apply searching methods on each instance to get MSCs. Compute the probability of MSCs that have a higher unadjusted p-value than the observed p-value. In our search we report only MSC with adjusted p-value < 0.05 Disease-closure allow finding of the statistically significant MSC on the earlier stage of searching. Trivial MSCs and MSCs which coincide after disease- closure are avoided. That significantly speedups the searching. Faster than ES Finds more significant association on the early stage of searching Still slow for wide-genome studies Searching level – number of SNPs which define MSC before disease-closure Indexed Combinatorial Search (ICS): Combinatorial search on the indexed datasets obtained by extracting k indexed SNPs with MLR based tagging method. Can perform complete searching for the larger datasets