CANDID: A candidate gene identification tool Janna Hutz March 19, 2007.

Slides:



Advertisements
Similar presentations
Applications of GO. Goals of Gene Ontology Project.
Advertisements

PubMed Advanced: Linking PubMed to NCBI Genetics Databases KTL Vaughan Librarian for Bioinformatics & Pharmacy UNC-CH Health Sciences Library.
Oncomine Database Lauren Smalls-Mantey Georgia Institute of Technology June 19, 2006 Note: This presentation contains animation.
Pathways analysis Iowa State Workshop 11 June 2009.
Provenance in a Collaborative Bio-database RAASWiki Donald Dunbar & Jon Manning Queen’s Medical Research Institute University of Edinburgh Use Cases for.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
Gene Ontology John Pinney
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Babelomics Functional interpretation of genome-scale experiments Barcelona, 28 November de 2007 Ignacio Medina David Montaner
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Sequence/Structure Alignment Resources from NCBI Steve Bryant Protein Data Bank Rutgers University November 19, 2005.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Arabidopsis Gene Project GK-12 April Workshop Karolyn Giang and Dr. Mulligan.
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Bioinformatics Primer.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Introduction to Gene Mining Part B: How similar are plant and human versions of a gene? After completing part B, you will demonstrate How to use NCBI BLASTp.
Copyright OpenHelix. No use or reproduction without express written consent1.
PubMed and other Online Tools Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries/ U.F. Genetics Institute GMS 6014 January.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Searching PubMed® NCBI, NLM Resources, Micromedex -GSBS TTUHSC Preston Smith Library presents Rev. 08/17/14.
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Using Entrez.
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Copyright OpenHelix. No use or reproduction without express written consent1.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Copyright OpenHelix. No use or reproduction without express written consent1.
DONNA MAGLOTT, PH.D. PRO AND MEDICAL GENETICS RESOURCES AT NCBI.
HUMAN-MOUSE CONSERVED COEXPRESSION NETWORKS PREDICT CANDIDATE DISEASE GENES Ala U., Piro R., Grassi E., Damasco C., Silengo L., Brunner H., Provero P.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Professional Development Course 1 – Molecular Medicine Genome Biology June 12, 2012 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services.
CANDID: A candidate gene identification tool Part 2 Janna Hutz March 26, 2007.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Microsoft Office 2003: Advanced 1 ADVANCED MICROSOFT ACCESS Lesson 10 – Analyzing Data.
NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Using Entrez.
NCBI Literature Databases: PubMed
ID Mapping to accessions from different databases. COST Functional Modeling Workshop April, Helsinki.
Bioinformatics and Computational Biology
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
Getting GO: how to get GO for functional modeling Iowa State Workshop 11 June 2009.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.
生物資料庫搜尋 ( 第八組 ) 連威森 王鼎 黃智楹 張鈞淵
Keeping Current: Genetics Resources. This workshop will provide an overview of NCBI resources for finding-- Background information & journal articles.
BLAST: Basic Local Alignment Search Tool Robert (R.J.) Sperazza BLAST is a software used to analyze genetic information It can identify existing genes.
Data Mining with BioMart
Functional Annotation of the Horse Genome
Annotation: linking literature to gene products
ID Mapping tools: Converting Accessions between Databases
Ensembl Genome Repository.
Gene Safari (Biological Databases)
Problems from last section
How to search NCBI.
Presentation transcript:

CANDID: A candidate gene identification tool Janna Hutz March 19, 2007

Candidate genes Positional –Linkage evidence –Deletion syndrome –Loss of heterozygosity –Disease-related amplification –Association Biological –Pathways –Phenotypic characteristics ACT[A/G]GGA

A case study: acd

0 cM ~82 cM acd Os Es-1 ~31 cM D8Mit5 D8Mit79 D8Mit13 25 cM 38.7 cM 67 cM acd 51 cM

A case study: acd 1/145 3/145 0/145 1/145

Which gene is acd?

Prioritization tools Endocrinologist/Geneticist Ensembl RT-PCR Sequencing BINGO! …two years later.

How can we improve this?

Improve our tools Clinician –Has memorized information about many disorders; can name some relevant genes –Gets his/her information from… PubMed

How do we use PubMed to analyze our candidates? –Enter our phenotypic keywords into PubMed. Read the papers that come up in the results. Make a list of genes. –Do PubMed searches for all the candidates. Read the papers that come up in the results. Rate the candidates. Better: Don’t do it yourself…

PubMed Each publication has a PubMed ID Each gene has a Gene ID Wouldn’t it be nice if we could link Gene IDs and PubMed IDs? –ftp://ftp.ncbi.nlm.nih.gov/gene/DATAftp://ftp.ncbi.nlm.nih.gov/gene/DATA –gene2pubmed.gz –TaxonomyID; GeneID; PubMedID

Who makes that file? (1) From Links between Gene and PubMed are the result of the following: 1. Manual curation within NCBI. Part of the process of generating a REVIEWED RefSeq is an analysis of the current literature. Papers that are seminal in defining the gene, its sequence, and its function are added to the record at that time. Alert users point out gaps or errors in papers associated with a Gene record. These messages are reviewed and implemented as required.

Who makes that file? (2) 2. Integration of information from other public databases. Gene integrates gene-citation from resources external to NCBI such as model organism-specific databases, Gene Ontology (GO), groups curating interactions, and sequence databases. The assumption in using these source is that they report citations specific to a gene in a known species. Gene does not process citations from OMIM automatically, because many of citations in OMIM refer to studies of genes in species other than human.

Example 1 pancreatic cancer sequence candidates $

Help Sally. Use CANDID’s literature criterion User: workshop Password: perl031907

Help Sally. Look for genes that are involved with pancreatic cancer. What are some keywords we can use?

A measure of relevancy Find relevant publications Is Gene X linked to these publications? How many publications match? What percent of Gene X’s publications match?

By the numbers… Literature scores run from 0 to 1. Number of gene’s publications that match Number of gene’s publications The score is…

Matching Every publication has a “Text Words” field that includes, when available, … –Title –Abstract –Other abstract –MeSH terms –MeSH subheadings –Publication types –Substance names –Personal name as subject –MEDLINE secondary source –Other terms

Summary

Results

Exporting to Excel Output file is a comma-separated file Download it, and change the.output to.csv. If Excel doesn’t open it automatically when you click on it, paste the data into a new sheet and use the Text Import Wizard to separate the columns.

Drawbacks What if a gene isn’t associated with any publications? –It’s not important –It’s not yet characterized

What about those genes?

Analyzing the “other genes” We don’t have literature data. We don’t have expression data. All we have is a sequence.

Fun with sequences DNA –Cross-species conservation RNA (cDNA) –Cross-species conservation –Protein sequence prediction Protein conservation Protein domain prediction

Protein domains InterPro Conserved Domain Database (NCBI) Wouldn’t it be nice if we could link Gene IDs and protein domains? Interpro ftp://ftp.ncbi.nlm.nih.gov/gene/DATA

Who makes those links? From Links between Entrez Gene and Conserved Domain Database (CDD) are calculated from the domains annotated by the CDD group on Reference Sequence proteins.

How can we use this? The CDD domains have descriptions. These descriptions can be searched… 1.CANDID finds domains containing our keywords. 2.If a gene has one of those domains, it gets a score of 1. …just like when we searched PubMed!

How far back does our gene go? Is our gene in mammals? Fish? Bacteria?

More sequence fun Many measures of conservation –Nucleotide similarity (percentage, pairwise) –Amino acid similarity (percentage, pairwise) –etc., etc.

HomoloGene Gets sequences Uses amino acid AND nucleotide similarity measures Plus lots more math, equals… A label that answers our question

Labels used in CANDID Homo sapiens Primates (chimp, gorilla) Rodents (rat, mouse) Eutherian mammals (dog, cow, cat) Amniota (chicken) Insects (mosquito, bee) Bilateria (C. elegans) Fungi Eukaryotes HIGHER SCORE

Example 2 pancreas: tumor tissue pancreas: normal tissue custom microarray Known and unknown genes

Array candidates Let’s increase the number of CANDID results we got in Example 1…

Weighting system Prioritize genes of known or unknown function Modify weights for each category Well-characterized genes: higher literature weight Uncharacterized genes: higher domains, conservation weights

Example 3 Make up your own example! Use literature, domains, and/or conservation criteria.

Next week Expression data Linkage data Association data CANDID’s efficiency Anything else?