Databases (“knowledge bases”) used in genome analysis

Name: Databases (“knowledge bases”) used in genome analysis
Uploaded: 2018-01-10T21:33:54+00:00
Duration: PTM12S51
Channel: Rodney Fisher
Description: Databases (“knowledge bases”) used in genome analysis

Databases (“knowledge bases”) used in genome analysis
Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA

Growth in genome sequencing

Working Draft Sequence
gaps

J. Smith - a very common name
Structure - a very common term Glutamine amidotransferase - less common term but not a very good descriptor

A different professor Janet Smith
Another Janet Smith in the news

Glutamine for sale

Tools of trade for the “armchair scientist”
Databases PubMed and other NCBI databases Biochemical databases Protein domain databases Structural databases Genome comparison databases Tools CDD / COGs VAST / FSSP

Types of databases Archival or Primary Data Curated or Processed Data
Text: PubMed DNA Sequence: GenBank Protein Sequence: Entrez Proteins, TREMBL Protein Structures: PDB Curated or Processed Data DNA sequences : RefSeq, LocusLink, OMIM Protein Sequences: SWISS-PROT, PIR Protein Structures : SCOP, CATH, MMDB Genomes: Entrez Genomes, COGs Nucleic Acids Research: Database Issue each January 1 Articles on ~100 different databases

The National Center for Biotechnology Information (NCBI)
Created as a part of the National Library of Medicine, National Institutes of Health in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information Tools: BLAST(1990), Entrez (1992) GenBank (1992) Free MEDLINE (PubMed, 1997) Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM, UniGene, Taxonomy, GeneMap, SAGE, LocusLink, RefSeq

What is GenBank? Archival nucleotide sequence database Sample slogans: “Easy deposits, unlimited withdrawals, high interest”, “All bases covered”, “Billions and billions served” Data are shared nightly among three collaborating databases: GenBank at NCBI - Bethesda, Maryland, USA DNA Database of Japan (DDBJ) at NIG - Mishima, Japan European Molecular Biology Laboratory Database (EMBL) at EBI - Hinxton, UK

Some guiding principles of working with GenBank
GenBank is a nucleotide-centric view of the information space GenBank is a repository of all publically available sequences In GenBank, records are grouped for various reasons Data in GenBank is only as good as what you put in

NCBI databases and their links
Word Weight VAST BLAST Phylogeny Article Abstracts Medline 3-D Structure 3 D Structure Taxonomy MMDB Genomes Nucleotide Sequences Protein Sequences

Entrez: An integrated search and retrieval system

PubMed book links

GenBank Record Locus Name Accession Number gi Number Medline ID
[rest of protein sequence deleted for brevity] [rest of nucleotide sequence deleted for brevity] gi Number Medline ID Protein Sequence GenPept ID Nucleotide Sequence

Archival databases are unreliable
Misinterpreted experimental results Annotations base on low similarity gi| cDNA 5' end similar to similar to arrest- defective protein isolog (H. sapiens) gi| very hypothetical protein (S. pombe) Biologically senseless annotations Deinococcus: head morphogenesis protein Arabidopsis: separation anxiety protein-like Yersinia: automembrane protein H H. pylori - brute force protein S. cerevisiae - inside intron 7 Propagated mistakes of sequence comparison (e.g. ABC1/ABC)

Advanced Neighbors: BLink

Protein sequence motif is a descriptor of a protein family
Glutamine amidotransferase class I [PAS]-[LIVMFYT]-[LIVMFY]-G-[LIVMFY]-C- [LIVMFYN]-G-x-[QEH]- x-[LIVMFA] [C is the active site residue] Glutamine amidotransferase class II <x(0,11)-C-[GS]-[IV]-[LIVMFYW]-[AG]

purF gene neighbors

Searching MMDB

Principles of structural alignment
Dali: Looks for minimal RMSD between Ca atoms. Calculate Ca - Ca distance matrices, then identifies the longest alignable segments VAST (Vector Alignment Search Tool) looks for pairs of secondary structure elements (a-helices, b-strands) that have similar orientation and connectivity

Dali alignment of Tyr phosphatase

VAST Structure Neighbors

Structure Summary BLAST neighbors VAST neighbors Cn3D viewer

Cn3D : Displaying Structures
Chloroquine

Structure Neighbors

Use of structural alignments
Chloroquine NADH

Online Mendelian Inheritance in Man
A catalog of human genes and genetic disorders

OMIM record for Presenilin 1 (PSEN1)
Contents Additional info in OMIM Each record provides a state of the art summary of current knowledge Associated LocusLink record External resources Extensive references to literature

OMIM Search Results by Titles
alzheimer AND presenilin 1

Entrez Genome: Gene Location
View of chromosome 14 Multiple Maps STSs, ESTs, etc. Gene Name

Integrated View of Chromosome 7
Entrez Genomes Map Viewer Chromosome 7 GenBank Map Contig Map STS Map Multiple Maps STSs, ESTs, etc.

View of chromosome 14 Gene Name

Entrez Genomes Map Viewer Chromosome 14 Cytogenetic map Location of PSEN1 and surrounding genes

LocusLink

LocusLink Multiple Organisms Text querying Alphabetical listings
alzheimer Curated Resource Central hub of information for human, mouse, rat, zebrafish, and fruit fly loci Text querying Alphabetical listings Approved symbol Stable Locus ID Description Genome Position External Links

LocusLink RefSeq GenBank OMIM UniGene dbSNP

LocusLink: LocusID 5663 PSEN1

National Center for Biotechnology Information
Directed by Dr. David J. Lipman

Databases (“knowledge bases”) used in genome analysis

Similar presentations

Presentation on theme: "Databases (“knowledge bases”) used in genome analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Databases (“knowledge bases”) used in genome analysis

Similar presentations

Presentation on theme: "Databases (“knowledge bases”) used in genome analysis"— Presentation transcript:

Similar presentations

About project

Feedback