Databases (“knowledge bases”) used in genome analysis

Slides:



Advertisements
Similar presentations
Bioinformatics Ayesha M. Khan Spring 2013.
Advertisements

NCBI/WHO PubMed/Hinari Course NCBI Literature Databases: PubMed Background.
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.
1.
NCBI web resources I: databases and Entrez Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
NATIONAL LIBRARY OF MEDICINE The PubMed ID and Entrez, PubMed and PubMed Central Edwin Sequeira National Center for Biotechnology Information June 21,
GENBANK, SWISSPROT AND OTHERS As Problem Sources for CSE 549 Andriy Tovkach Genetics.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Biological databases.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Essential Bioinformatics and Biocomputing (LSM2104: Section I) Biological Databases and Bioinformatics Software Prof. Chen Yu Zong Tel:
Bioinformatics Primer HC Lee 2000 July. What is Bioinformatics? Biomedical/biotechnical information Reproduction and annotation of biosequences – DNA.
Lecture 2.21 Retrieving Information: Using Entrez.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
The Protein Data Bank (PDB)
Sequence/Structure Alignment Resources from NCBI Steve Bryant Protein Data Bank Rutgers University November 19, 2005.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
An Introduction to Bioinformatics Molecular Biology Databases.
A Study of Cystic Fibrosis Using Web-Based Tools Anuradha Datta Murphy Graduate Student, Dept. of Molecular and Integrative Physiology, University of Illinois.
Arabidopsis Gene Project GK-12 April Workshop Karolyn Giang and Dr. Mulligan.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Introductory Overview
DbSNP: the NCBI database of genetic variation S. T. Sherry, M.H. Ward, M. Kholodov, J. Baker, L. Phan, E. M. Smigielski and K. Sirotkin, Nucleic Acids.
Databases. Where to get data? GenBank – Protein Databases –SWISS-PROT:
On line (DNA and amino acid) Sequence Information
Bioinformatics.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
1 Database Resources of the National Center for Biotechnology Information Baharak Rastegari MEDG 505 presentation February 3, 2005 David.
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Using Entrez.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Sequence Retrieving, Manipulation and Management BIOINFORMATICS Lecture 3.
1 Review of Biological Database Utilization. 2 Biological Databases We will discuss: Usefulness to the bioinformaticist Database types Search methods.
Biological Databases and Tools Sandra Sinisi / Kathryn Steiger November 25, 2002.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Organizing information in the post-genomic era The rise of bioinformatics.
NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Using Entrez.
NCBI Literature Databases: PubMed
BIOLOGICAL DATABASES. BIOLOGICAL DATA Bioinformatics is the science of Storing, Extracting, Organizing, Analyzing, and Interpreting information in biological.
Bioinformatics and Computational Biology
Computer Storage of Sequences
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
A Field Guide to GenBank and NCBI Molecular Biology Resources
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
Information retrieval and sliding window programs April 5, 2011 Hand in Homework #1. Homework #2 due Tuesday, April 12. Learning objectives- Understand.
Database resources of the National Center for Biotechnology The National Center for Biotechnology Information (NCBI) at the National Institutes of Health.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
NCBI PubMed NCBI Literature Databases: PubMed Session #1, April 28, 2005 Session #2, April 29, 2005 Ho Chi Minh City, VietNam.
Keeping Current: Genetics Resources. This workshop will provide an overview of NCBI resources for finding-- Background information & journal articles.
Entrez, dbSNP, GEO, OMIM & LinkOut JanPlan Entrez Distributed by NCBI in 1991 on CD-ROM Included linked nodes: GenBank & PDB Translated GenBank,
Introduction to Genes and Genomes with Ensembl
NCBI Molecular Biology Resources
Biological databases: Collection, storage and maintenance
Archives and Information Retrieval
생물정보학 Bioinformatics.
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Databases (“knowledge bases”) used in genome analysis Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA

Growth in genome sequencing

Working Draft Sequence gaps

J. Smith - a very common name Structure - a very common term Glutamine amidotransferase - less common term but not a very good descriptor

A different professor Janet Smith Another Janet Smith in the news

Glutamine for sale

Tools of trade for the “armchair scientist” Databases PubMed and other NCBI databases Biochemical databases Protein domain databases Structural databases Genome comparison databases Tools CDD / COGs VAST / FSSP

Types of databases Archival or Primary Data Curated or Processed Data Text: PubMed DNA Sequence: GenBank Protein Sequence: Entrez Proteins, TREMBL Protein Structures: PDB Curated or Processed Data DNA sequences : RefSeq, LocusLink, OMIM Protein Sequences: SWISS-PROT, PIR Protein Structures : SCOP, CATH, MMDB Genomes: Entrez Genomes, COGs Nucleic Acids Research: Database Issue each January 1 Articles on ~100 different databases

http://www.ncbi.nlm.nih.gov

The National Center for Biotechnology Information (NCBI) Created as a part of the National Library of Medicine, National Institutes of Health in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information Tools: BLAST(1990), Entrez (1992) GenBank (1992) Free MEDLINE (PubMed, 1997) Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM, UniGene, Taxonomy, GeneMap, SAGE, LocusLink, RefSeq

What is GenBank? Archival nucleotide sequence database Sample slogans: “Easy deposits, unlimited withdrawals, high interest”, “All bases covered”, “Billions and billions served” Data are shared nightly among three collaborating databases: GenBank at NCBI - Bethesda, Maryland, USA DNA Database of Japan (DDBJ) at NIG - Mishima, Japan European Molecular Biology Laboratory Database (EMBL) at EBI - Hinxton, UK

Some guiding principles of working with GenBank GenBank is a nucleotide-centric view of the information space GenBank is a repository of all publically available sequences In GenBank, records are grouped for various reasons Data in GenBank is only as good as what you put in

NCBI databases and their links Word Weight VAST BLAST Phylogeny Article Abstracts Medline 3-D Structure 3 D Structure Taxonomy MMDB Genomes Nucleotide Sequences Protein Sequences

Entrez: An integrated search and retrieval system

PubMed book links

GenBank Record Locus Name Accession Number gi Number Medline ID [rest of protein sequence deleted for brevity] [rest of nucleotide sequence deleted for brevity] gi Number Medline ID Protein Sequence GenPept ID Nucleotide Sequence

Archival databases are unreliable Misinterpreted experimental results Annotations base on low similarity gi|1968785 - cDNA 5' end similar to similar to arrest- defective protein isolog (H. sapiens) gi|6522905 - very hypothetical protein (S. pombe) Biologically senseless annotations Deinococcus: head morphogenesis protein Arabidopsis: separation anxiety protein-like Yersinia: automembrane protein H H. pylori - brute force protein S. cerevisiae - inside intron 7 Propagated mistakes of sequence comparison (e.g. ABC1/ABC)

Advanced Neighbors: BLink

BLink

Protein sequence motif is a descriptor of a protein family Glutamine amidotransferase class I [PAS]-[LIVMFYT]-[LIVMFY]-G-[LIVMFY]-C- [LIVMFYN]-G-x-[QEH]- x-[LIVMFA] [C is the active site residue] Glutamine amidotransferase class II <x(0,11)-C-[GS]-[IV]-[LIVMFYW]-[AG]

purF gene neighbors

Searching MMDB

Principles of structural alignment Dali: http://www.ebi.ac.uk/dali/ Looks for minimal RMSD between Ca atoms. Calculate Ca - Ca distance matrices, then identifies the longest alignable segments VAST (Vector Alignment Search Tool) http://www.ncbi.nlm.nih.gov/Structure/ looks for pairs of secondary structure elements (a-helices, b-strands) that have similar orientation and connectivity

Dali alignment of Tyr phosphatase

VAST Structure Neighbors

Structure Summary BLAST neighbors VAST neighbors Cn3D viewer

Cn3D : Displaying Structures Chloroquine

Structure Neighbors

Use of structural alignments Chloroquine NADH

Online Mendelian Inheritance in Man A catalog of human genes and genetic disorders

OMIM record for Presenilin 1 (PSEN1) Contents Additional info in OMIM Each record provides a state of the art summary of current knowledge Associated LocusLink record External resources Extensive references to literature

OMIM Search Results by Titles alzheimer AND presenilin 1

Entrez Genome: Gene Location View of chromosome 14 Multiple Maps STSs, ESTs, etc. Gene Name

Integrated View of Chromosome 7 Entrez Genomes Map Viewer Chromosome 7 GenBank Map Contig Map STS Map Multiple Maps STSs, ESTs, etc.

Entrez Genome: Gene Location View of chromosome 14 Gene Name

Entrez Genome: Gene Location Entrez Genomes Map Viewer Chromosome 14 Cytogenetic map Location of PSEN1 and surrounding genes

LocusLink

LocusLink Multiple Organisms Text querying Alphabetical listings alzheimer Curated Resource Central hub of information for human, mouse, rat, zebrafish, and fruit fly loci Text querying Alphabetical listings Approved symbol Stable Locus ID Description Genome Position External Links

LocusLink RefSeq GenBank OMIM UniGene dbSNP

LocusLink: LocusID 5663 PSEN1

National Center for Biotechnology Information Directed by Dr. David J. Lipman http://www.ncbi.nlm.nih.gov