Download presentation
Published byRodney Fisher Modified over 9 years ago
1
Databases (“knowledge bases”) used in genome analysis
Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA
2
Growth in genome sequencing
3
Working Draft Sequence
gaps
4
J. Smith - a very common name
Structure - a very common term Glutamine amidotransferase - less common term but not a very good descriptor
5
A different professor Janet Smith
Another Janet Smith in the news
6
Glutamine for sale
7
Tools of trade for the “armchair scientist”
Databases PubMed and other NCBI databases Biochemical databases Protein domain databases Structural databases Genome comparison databases Tools CDD / COGs VAST / FSSP
8
Types of databases Archival or Primary Data Curated or Processed Data
Text: PubMed DNA Sequence: GenBank Protein Sequence: Entrez Proteins, TREMBL Protein Structures: PDB Curated or Processed Data DNA sequences : RefSeq, LocusLink, OMIM Protein Sequences: SWISS-PROT, PIR Protein Structures : SCOP, CATH, MMDB Genomes: Entrez Genomes, COGs Nucleic Acids Research: Database Issue each January 1 Articles on ~100 different databases
10
The National Center for Biotechnology Information (NCBI)
Created as a part of the National Library of Medicine, National Institutes of Health in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information Tools: BLAST(1990), Entrez (1992) GenBank (1992) Free MEDLINE (PubMed, 1997) Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM, UniGene, Taxonomy, GeneMap, SAGE, LocusLink, RefSeq
11
What is GenBank? Archival nucleotide sequence database Sample slogans: “Easy deposits, unlimited withdrawals, high interest”, “All bases covered”, “Billions and billions served” Data are shared nightly among three collaborating databases: GenBank at NCBI - Bethesda, Maryland, USA DNA Database of Japan (DDBJ) at NIG - Mishima, Japan European Molecular Biology Laboratory Database (EMBL) at EBI - Hinxton, UK
12
Some guiding principles of working with GenBank
GenBank is a nucleotide-centric view of the information space GenBank is a repository of all publically available sequences In GenBank, records are grouped for various reasons Data in GenBank is only as good as what you put in
13
NCBI databases and their links
Word Weight VAST BLAST Phylogeny Article Abstracts Medline 3-D Structure 3 D Structure Taxonomy MMDB Genomes Nucleotide Sequences Protein Sequences
14
Entrez: An integrated search and retrieval system
20
PubMed book links
21
GenBank Record Locus Name Accession Number gi Number Medline ID
[rest of protein sequence deleted for brevity] [rest of nucleotide sequence deleted for brevity] gi Number Medline ID Protein Sequence GenPept ID Nucleotide Sequence
24
Archival databases are unreliable
Misinterpreted experimental results Annotations base on low similarity gi| cDNA 5' end similar to similar to arrest- defective protein isolog (H. sapiens) gi| very hypothetical protein (S. pombe) Biologically senseless annotations Deinococcus: head morphogenesis protein Arabidopsis: separation anxiety protein-like Yersinia: automembrane protein H H. pylori - brute force protein S. cerevisiae - inside intron 7 Propagated mistakes of sequence comparison (e.g. ABC1/ABC)
25
Advanced Neighbors: BLink
26
BLink
38
Protein sequence motif is a descriptor of a protein family
Glutamine amidotransferase class I [PAS]-[LIVMFYT]-[LIVMFY]-G-[LIVMFY]-C- [LIVMFYN]-G-x-[QEH]- x-[LIVMFA] [C is the active site residue] Glutamine amidotransferase class II <x(0,11)-C-[GS]-[IV]-[LIVMFYW]-[AG]
47
purF gene neighbors
48
Searching MMDB
49
Principles of structural alignment
Dali: Looks for minimal RMSD between Ca atoms. Calculate Ca - Ca distance matrices, then identifies the longest alignable segments VAST (Vector Alignment Search Tool) looks for pairs of secondary structure elements (a-helices, b-strands) that have similar orientation and connectivity
50
Dali alignment of Tyr phosphatase
51
VAST Structure Neighbors
52
Structure Summary BLAST neighbors VAST neighbors Cn3D viewer
53
Cn3D : Displaying Structures
Chloroquine
54
Structure Neighbors
55
Use of structural alignments
Chloroquine NADH
56
Online Mendelian Inheritance in Man
A catalog of human genes and genetic disorders
57
OMIM record for Presenilin 1 (PSEN1)
Contents Additional info in OMIM Each record provides a state of the art summary of current knowledge Associated LocusLink record External resources Extensive references to literature
58
OMIM Search Results by Titles
alzheimer AND presenilin 1
59
Entrez Genome: Gene Location
View of chromosome 14 Multiple Maps STSs, ESTs, etc. Gene Name
60
Integrated View of Chromosome 7
Entrez Genomes Map Viewer Chromosome 7 GenBank Map Contig Map STS Map Multiple Maps STSs, ESTs, etc.
61
Entrez Genome: Gene Location
View of chromosome 14 Gene Name
62
Entrez Genome: Gene Location
Entrez Genomes Map Viewer Chromosome 14 Cytogenetic map Location of PSEN1 and surrounding genes
63
LocusLink
64
LocusLink Multiple Organisms Text querying Alphabetical listings
alzheimer Curated Resource Central hub of information for human, mouse, rat, zebrafish, and fruit fly loci Text querying Alphabetical listings Approved symbol Stable Locus ID Description Genome Position External Links
65
LocusLink RefSeq GenBank OMIM UniGene dbSNP
66
LocusLink: LocusID 5663 PSEN1
67
National Center for Biotechnology Information
Directed by Dr. David J. Lipman
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.