CANDID: A candidate gene identification tool Part 2 Janna Hutz March 26, 2007
Review Literature –Well-characterized genes Protein domains –All genes Cross-species conservation –All genes
Today’s agenda Expression levels Linkage data Association data CANDID performance measures
Candidate lists vs. single candidates Candidate lists –Complex trait or disease –Disease with known heterogeneity Single candidates –Mendelian trait –New disease –Disease with clear, well-defined pathology
Candidate lists vs. single candidates Microarray SNP typing Sequencing Immunocytochemistry Knockout model ACT[A/G]GGA
Example 4 Goiter - thyroid gland problem Iodine deficiency Genetic causes
Example 4 Iodine is not supplied Iodine is present, but is not added to the molecule Which gene is mutated?
Expression data We know what tissue our gene is expressed in (thryoid). How can we use this knowledge to help identify the candidate? Wouldn’t it be nice if we had an expression database?
Expression databases Our ideal expression database would have: –Expression data for the same genes across many different tissues –As many tissues as possible –As many genes as possible –Good documentation Gene Atlas
Genomics Institute of the Novartis Research Foundation 79 human tissues (160 samples) 2 arrays –Affymetrix HG-U133A –GNF1H (custom) 17,809 genes
Measure of gene expression Our thyroid gene: –Gene that is brightest on the thyroid array? –Gene that is brightest on the thyroid array, compared to all the other arrays.
Measures of gene expression Run CANDID, specifying that we’re interested in the thyroid. User name: workshop Password: perl (We’ll need a tissue code for that.)
Example 4 - Results Our favorite genes: TP53 - rank is… –16314th KRAS - rank is… –5229th What genes are ranked most highly?
Example 4 - Results 192 genes with expression score of 1 The TOP gene is actually responsible for the phenotype described earlier –Its expression score = 1
Prior evidence I’m not interested in examining all of the genes in the genome - just some of them. Linkage and association
Linkage CANDID can: –Weight regions with higher LOD scores –Limit analysis to certain regions –How does it do this?
Linkage scoring 3172 gene’s LOD score maximum genome-wide LOD score
Linkage files How does CANDID get this linkage information? CANDID takes two kinds of files –Unformatted output from GENEHUNTER and MERLIN –Custom linkage files
Custom linkage files Simple format Line 1 of the file must contain the word “custom” somewhere Subsequent lines: Chromosome(tab)cM (tab)LOD score But how do I get cM positions?
Mapmaker Inputs file as: Chromosome(tab) basepair (tab) LOD score Outputs new file in the format: Chromosome(tab) cM (tab) LOD score Will be available on the CANDID website soon
Example 5 Deletion on chromosome 13 between cM and cM. pancreatic cancer
Creating a custom linkage file Example: custom
Running CANDID 1.Try running CANDID using only the linkage criterion. 2.Now, run CANDID with the linkage criterion and literature criterion (your choice of keywords) Linkage weight = 1000 Literature weight = 1
Results From OMIM: “Individuals with mutations in the BRCA2 gene, which predisposes to breast and ovarian carcinoma, have an increased risk of pancreatic cancer; germline mutations in BRCA2 are the most common inherited alteration identified in familial pancreatic cancer.”
But linkage is so last season…
Association Increasing numbers of association studies Increasing numbers of SNPs in each study Can CANDID use this information, too?
Association Database –dbSNP million human SNPs –Includes HapMap SNPs –Most comprehensive –Each snp has a number prefixed with “rs”
Association How does CANDID accept association data? Custom file format - each line is: rs# (tab) p-value
Association scoring For each gene, take the best p-value for that gene’s SNPs Subtract that p-value from 1 Unless you test SNPs in every gene, this can be kind of unfair…
Association scoring Tested 10 genes Gene 9 has a best p-value of 0.8 (bad) Gene X was not tested Should Gene 9 get a higher overall score than Gene X?
p-value threshold User defines a p-value threshold Let’s say it’s 0.1. Any SNPs with p-values above 0.1 are not considered. Now Gene 9 and Gene X have the same score (0).
Example 6 Age-related Eye Disease Study Macular degeneration
Example 6 Make custom association file rs rs rs Run CANDID with this association file
Results rs rs rs } CFH } SLC25A46
So just how well does this work anyway?
Preliminary evidence Online Mendelian Inheritance in Man 154 diseases linked to chromosome 1 Literature, domains - chose keywords Conservation Expression - chose tissue codes
Ideal weights Tested all combinations of weights in those 4 categories –Possible weights: (0, 0.1, …, 0.9, 1) Which weight combination was the best, across all 154 diseases?
Top 10 weight combinations 1.Literature = 1, everything else = 0 2.Literature = 0.9, everything else = 0 3.Literature = 0.8, everything else = 0 4.Literature = 0.7, everything else = 0 5.… 10. Literature = 0.1, everything else = Literature = 1, domains = 0.1
More specifics Literature only: average ranking = 425 –425/38697 = 98.9th percentile –44/154 genes ranked #1 for at least one set of weights Chromosome 1: average ranking = 22 –22/2280 = 99th percentile –84/154 genes ranked #1 for at least one set of weights
Analysis of results They make a lot of sense. Genes in OMIM are, by definition, well- characterized. Many diseases are rare, with particular names or keywords that would only appear in papers about the disease genes.
Next steps Separate OMIM analysis into simple and complex traits –Get new ideal weights See how well these ideal weights do in ranking candidates from chromosome 2.
Next steps CANDID’s databases were last compiled in November Find publications that have come out since then. How well does CANDID do in ranking those genes?
Next steps Many new whole-genome studies and microarray studies implicate lists of candidates. If CANDID analyzes those phenotypes, how significant is the overlap of CANDID’s top genes and those papers’ top genes?
Next steps Any other suggestions? Any interesting data you have?
Any questions?
Acknowledgments Mike Province Howard McLeod Aldi Kraja Ingrid Borecki Qunyuan Zhang Ryan Christensen John Martin