Presentation is loading. Please wait.

Presentation is loading. Please wait.

CANDID: A candidate gene identification tool Janna Hutz March 19, 2007.

Similar presentations


Presentation on theme: "CANDID: A candidate gene identification tool Janna Hutz March 19, 2007."— Presentation transcript:

1 CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007

2 Candidate genes Positional –Linkage evidence –Deletion syndrome –Loss of heterozygosity –Disease-related amplification –Association Biological –Pathways –Phenotypic characteristics ACT[A/G]GGA

3 A case study: acd

4 0 cM ~82 cM acd Os Es-1 ~31 cM D8Mit5 D8Mit79 D8Mit13 25 cM 38.7 cM 67 cM acd 51 cM

5 A case study: acd 1/145 3/145 0/145 1/145

6 Which gene is acd?

7 Prioritization tools Endocrinologist/Geneticist Ensembl RT-PCR Sequencing BINGO! …two years later.

8 How can we improve this?

9 Improve our tools Clinician –Has memorized information about many disorders; can name some relevant genes –Gets his/her information from… PubMed

10 How do we use PubMed to analyze our candidates? –Enter our phenotypic keywords into PubMed. Read the papers that come up in the results. Make a list of genes. –Do PubMed searches for all the candidates. Read the papers that come up in the results. Rate the candidates. Better: Don’t do it yourself…

11 PubMed Each publication has a PubMed ID Each gene has a Gene ID Wouldn’t it be nice if we could link Gene IDs and PubMed IDs? –ftp://ftp.ncbi.nlm.nih.gov/gene/DATAftp://ftp.ncbi.nlm.nih.gov/gene/DATA –gene2pubmed.gz –TaxonomyID; GeneID; PubMedID

12 Who makes that file? (1) From http://www.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html http://www.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html Links between Gene and PubMed are the result of the following: 1. Manual curation within NCBI. Part of the process of generating a REVIEWED RefSeq is an analysis of the current literature. Papers that are seminal in defining the gene, its sequence, and its function are added to the record at that time. Alert users point out gaps or errors in papers associated with a Gene record. These messages are reviewed and implemented as required.

13 Who makes that file? (2) 2. Integration of information from other public databases. Gene integrates gene-citation from resources external to NCBI such as model organism-specific databases, Gene Ontology (GO), groups curating interactions, and sequence databases. The assumption in using these source is that they report citations specific to a gene in a known species. Gene does not process citations from OMIM automatically, because many of citations in OMIM refer to studies of genes in species other than human.

14 Example 1 pancreatic cancer sequence candidates $

15 Help Sally. Use CANDID’s literature criterion http://dsgweb.wustl.edu/llfs/secure_html/hutz/index.html User: workshop Password: perl031907

16

17 Help Sally. Look for genes that are involved with pancreatic cancer. What are some keywords we can use?

18

19 A measure of relevancy Find relevant publications Is Gene X linked to these publications? How many publications match? What percent of Gene X’s publications match?

20 By the numbers… Literature scores run from 0 to 1. Number of gene’s publications that match Number of gene’s publications The score is…

21 Matching Every publication has a “Text Words” field that includes, when available, … –Title –Abstract –Other abstract –MeSH terms –MeSH subheadings –Publication types –Substance names –Personal name as subject –MEDLINE secondary source –Other terms

22 Summary

23 Results

24 Exporting to Excel Output file is a comma-separated file Download it, and change the.output to.csv. If Excel doesn’t open it automatically when you click on it, paste the data into a new sheet and use the Text Import Wizard to separate the columns.

25 Drawbacks What if a gene isn’t associated with any publications? –It’s not important –It’s not yet characterized

26 What about those genes?

27 Analyzing the “other genes” We don’t have literature data. We don’t have expression data. All we have is a sequence.

28 Fun with sequences DNA –Cross-species conservation RNA (cDNA) –Cross-species conservation –Protein sequence prediction Protein conservation Protein domain prediction

29 Protein domains InterPro Conserved Domain Database (NCBI) Wouldn’t it be nice if we could link Gene IDs and protein domains? Interpro ftp://ftp.ncbi.nlm.nih.gov/gene/DATA

30 Who makes those links? From http://www.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html Links between Entrez Gene and Conserved Domain Database (CDD) are calculated from the domains annotated by the CDD group on Reference Sequence proteins.

31 How can we use this? The CDD domains have descriptions. These descriptions can be searched… 1.CANDID finds domains containing our keywords. 2.If a gene has one of those domains, it gets a score of 1. …just like when we searched PubMed!

32 How far back does our gene go? Is our gene in mammals? Fish? Bacteria?

33 More sequence fun Many measures of conservation –Nucleotide similarity (percentage, pairwise) –Amino acid similarity (percentage, pairwise) –etc., etc.

34 HomoloGene Gets sequences Uses amino acid AND nucleotide similarity measures Plus lots more math, equals… A label that answers our question

35 Labels used in CANDID Homo sapiens Primates (chimp, gorilla) Rodents (rat, mouse) Eutherian mammals (dog, cow, cat) Amniota (chicken) Insects (mosquito, bee) Bilateria (C. elegans) Fungi Eukaryotes HIGHER SCORE

36 Example 2 pancreas: tumor tissue pancreas: normal tissue custom microarray Known and unknown genes

37 Array candidates Let’s increase the number of CANDID results we got in Example 1…

38 Weighting system Prioritize genes of known or unknown function Modify weights for each category Well-characterized genes: higher literature weight Uncharacterized genes: higher domains, conservation weights

39 Example 3 Make up your own example! Use literature, domains, and/or conservation criteria.

40 Next week Expression data Linkage data Association data CANDID’s efficiency Anything else?

41


Download ppt "CANDID: A candidate gene identification tool Janna Hutz March 19, 2007."

Similar presentations


Ads by Google