Databases (“knowledge bases”) used in genome analysis Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA
Growth in genome sequencing
Working Draft Sequence gaps
J. Smith - a very common name Structure - a very common term Glutamine amidotransferase - less common term but not a very good descriptor
A different professor Janet Smith Another Janet Smith in the news
Glutamine for sale
Tools of trade for the “armchair scientist” Databases PubMed and other NCBI databases Biochemical databases Protein domain databases Structural databases Genome comparison databases Tools CDD / COGs VAST / FSSP
Types of databases Archival or Primary Data Curated or Processed Data Text: PubMed DNA Sequence: GenBank Protein Sequence: Entrez Proteins, TREMBL Protein Structures: PDB Curated or Processed Data DNA sequences : RefSeq, LocusLink, OMIM Protein Sequences: SWISS-PROT, PIR Protein Structures : SCOP, CATH, MMDB Genomes: Entrez Genomes, COGs Nucleic Acids Research: Database Issue each January 1 Articles on ~100 different databases
http://www.ncbi.nlm.nih.gov
The National Center for Biotechnology Information (NCBI) Created as a part of the National Library of Medicine, National Institutes of Health in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information Tools: BLAST(1990), Entrez (1992) GenBank (1992) Free MEDLINE (PubMed, 1997) Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM, UniGene, Taxonomy, GeneMap, SAGE, LocusLink, RefSeq
What is GenBank? Archival nucleotide sequence database Sample slogans: “Easy deposits, unlimited withdrawals, high interest”, “All bases covered”, “Billions and billions served” Data are shared nightly among three collaborating databases: GenBank at NCBI - Bethesda, Maryland, USA DNA Database of Japan (DDBJ) at NIG - Mishima, Japan European Molecular Biology Laboratory Database (EMBL) at EBI - Hinxton, UK
Some guiding principles of working with GenBank GenBank is a nucleotide-centric view of the information space GenBank is a repository of all publically available sequences In GenBank, records are grouped for various reasons Data in GenBank is only as good as what you put in
NCBI databases and their links Word Weight VAST BLAST Phylogeny Article Abstracts Medline 3-D Structure 3 D Structure Taxonomy MMDB Genomes Nucleotide Sequences Protein Sequences
Entrez: An integrated search and retrieval system
PubMed book links
GenBank Record Locus Name Accession Number gi Number Medline ID [rest of protein sequence deleted for brevity] [rest of nucleotide sequence deleted for brevity] gi Number Medline ID Protein Sequence GenPept ID Nucleotide Sequence
Archival databases are unreliable Misinterpreted experimental results Annotations base on low similarity gi|1968785 - cDNA 5' end similar to similar to arrest- defective protein isolog (H. sapiens) gi|6522905 - very hypothetical protein (S. pombe) Biologically senseless annotations Deinococcus: head morphogenesis protein Arabidopsis: separation anxiety protein-like Yersinia: automembrane protein H H. pylori - brute force protein S. cerevisiae - inside intron 7 Propagated mistakes of sequence comparison (e.g. ABC1/ABC)
Advanced Neighbors: BLink
BLink
Protein sequence motif is a descriptor of a protein family Glutamine amidotransferase class I [PAS]-[LIVMFYT]-[LIVMFY]-G-[LIVMFY]-C- [LIVMFYN]-G-x-[QEH]- x-[LIVMFA] [C is the active site residue] Glutamine amidotransferase class II <x(0,11)-C-[GS]-[IV]-[LIVMFYW]-[AG]
purF gene neighbors
Searching MMDB
Principles of structural alignment Dali: http://www.ebi.ac.uk/dali/ Looks for minimal RMSD between Ca atoms. Calculate Ca - Ca distance matrices, then identifies the longest alignable segments VAST (Vector Alignment Search Tool) http://www.ncbi.nlm.nih.gov/Structure/ looks for pairs of secondary structure elements (a-helices, b-strands) that have similar orientation and connectivity
Dali alignment of Tyr phosphatase
VAST Structure Neighbors
Structure Summary BLAST neighbors VAST neighbors Cn3D viewer
Cn3D : Displaying Structures Chloroquine
Structure Neighbors
Use of structural alignments Chloroquine NADH
Online Mendelian Inheritance in Man A catalog of human genes and genetic disorders
OMIM record for Presenilin 1 (PSEN1) Contents Additional info in OMIM Each record provides a state of the art summary of current knowledge Associated LocusLink record External resources Extensive references to literature
OMIM Search Results by Titles alzheimer AND presenilin 1
Entrez Genome: Gene Location View of chromosome 14 Multiple Maps STSs, ESTs, etc. Gene Name
Integrated View of Chromosome 7 Entrez Genomes Map Viewer Chromosome 7 GenBank Map Contig Map STS Map Multiple Maps STSs, ESTs, etc.
Entrez Genome: Gene Location View of chromosome 14 Gene Name
Entrez Genome: Gene Location Entrez Genomes Map Viewer Chromosome 14 Cytogenetic map Location of PSEN1 and surrounding genes
LocusLink
LocusLink Multiple Organisms Text querying Alphabetical listings alzheimer Curated Resource Central hub of information for human, mouse, rat, zebrafish, and fruit fly loci Text querying Alphabetical listings Approved symbol Stable Locus ID Description Genome Position External Links
LocusLink RefSeq GenBank OMIM UniGene dbSNP
LocusLink: LocusID 5663 PSEN1
National Center for Biotechnology Information Directed by Dr. David J. Lipman http://www.ncbi.nlm.nih.gov