Introduction to Bioinformatics Databases. DNARNAphenotypeprotein Central dogma of molecular biology A main focus of bioinformatics is to study molecular.

Slides:



Advertisements
Similar presentations
NCBI/WHO PubMed/Hinari Course NCBI Literature Databases: PubMed Background.
Advertisements

Databases (“knowledge bases”) used in genome analysis
Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.
Bunu databases’in icine koy lecture 5i de sonuna
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis.
1.
NCBI web resources I: databases and Entrez Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
On line (DNA and amino acid) Sequence Information Lecture 7.
Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner Bioinformatics M.E:
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
NATIONAL LIBRARY OF MEDICINE The PubMed ID and Entrez, PubMed and PubMed Central Edwin Sequeira National Center for Biotechnology Information June 21,
GENBANK, SWISSPROT AND OTHERS As Problem Sources for CSE 549 Andriy Tovkach Genetics.
Archives and Information Retrieval
Genome Browsers Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Biological databases.
DNA sequencing Matt Hudson. DNA Sequencing Dideoxy sequencing was developed by Fred Sanger at Cambridge in the 1970s. Often called “Sanger sequencing”.
Lecture 2.21 Retrieving Information: Using Entrez.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
An Introduction to Bioinformatics Molecular Biology Databases.
Databases. Where to get data? GenBank – Protein Databases –SWISS-PROT:
On line (DNA and amino acid) Sequence Information
Bioinformatics.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics Jack Min Office 3012 Office hours: TR 12:15 – 4.
Genomics, Proteomics, and Bioinformatics Biology 224 Instructor: Tom Peavy January 29, 2008.
Introduction to Bioinformatics Part 1 of 2 Jonathan Pevsner, Ph.D. M.E: September 8, 2003.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Genomics, Proteomics, and Bioinformatics Biology 224 Instructor: Tom Peavy August 31, 2009.
Introduction to Bioinformatics Monday, November 15, 2010 Jonathan Pevsner Bioinformatics M.E:
Searching PubMed® NCBI, NLM Resources, Micromedex -GSBS TTUHSC Preston Smith Library presents Rev. 08/17/14.
Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Sequence Retrieving, Manipulation and Management BIOINFORMATICS Lecture 3.
1 Review of Biological Database Utilization. 2 Biological Databases We will discuss: Usefulness to the bioinformaticist Database types Search methods.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Introduction to Bioinformatics Introduction to Databases
DAY 1c: Accessing Completed Genomes 1. UCSC Genome Bioinformatics 2. Ensembl 3. NCBI Genomic Biology.
Organizing information in the post-genomic era The rise of bioinformatics.
Professional Development Course 1 – Molecular Medicine Genome Biology June 12, 2012 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services.
Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.
NCBI Literature Databases: PubMed
1 of 42 Browsing Genes and Genomes with Ensembl Maria Wilbe Department of Animal Breeding and Genetics, SLU, Sweden
BIOLOGICAL DATABASES. BIOLOGICAL DATA Bioinformatics is the science of Storing, Extracting, Organizing, Analyzing, and Interpreting information in biological.
EB3233 Bioinformatics Introduction to Bioinformatics.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Copyright OpenHelix. No use or reproduction without express written consent1.
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
Copyright OpenHelix. No use or reproduction without express written consent1.
Information retrieval and sliding window programs April 5, 2011 Hand in Homework #1. Homework #2 due Tuesday, April 12. Learning objectives- Understand.
Genome Bioinformatics DNA and protein Databases I.
Instructor Prof. Chandrama P. Upadhyaya 220, Life Sciences Building ,
Keeping Current: Genetics Resources. This workshop will provide an overview of NCBI resources for finding-- Background information & journal articles.
Chapter 2: Access to Information Jonathan Pevsner, Ph.D.
Introduction to Genes and Genomes with Ensembl
Introduction to Bioinformatics
Retrieving Information: Using Entrez
Archives and Information Retrieval
생물정보학 Bioinformatics.
Mangaldai College, Mangaldai
Access to Sequence Data and Related Information
Genomes and Their Evolution
How to search NCBI.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Introduction to Bioinformatics Databases

DNARNAphenotypeprotein Central dogma of molecular biology A main focus of bioinformatics is to study molecular sequence data to gain insight into a broad range of biological problems.

After Pace NR (1997) Science 276:734 Page 6 With the use of bioinformatics we can learn the variation that occur between species, and we can deduce the evolutionary history of life on Earth.

Growth of GenBank Base pairs of DNA (billions) Sequences (millions) December 1982 June 2006

Growth of the International Nucleotide Sequence Database Collaboration Base pairs of DNA (billions) Base pairs contributed by GenBank EMBL DDBJ

DNARNAprotein Central dogma of molecular biology genometranscriptomeproteome Central dogma of bioinformatics and genomics

DNARNA cDNA ESTs UniGene phenotype genomic DNA databases protein sequence databases protein Fig. 2.2 Page 20

GenBank EMBLDDBJ There are three major public DNA databases The underlying raw DNA sequences are identical Page 16

GenBankEMBLDDBJ Housed at EBI European Bioinformatics Institute There are three major public DNA databases Housed at NCBI National Center for Biotechnology Information Housed in Japan Page 16

>300,000 species are represented in GenBank Table 2-1

Taxonomy nodes at NCBI

The most sequenced organisms in GenBank Homo sapiens 10.7 billion bases Mus musculus 6.5b Rattus norvegicus 5.6b Danio rerio 1.7b Zea mays 1.4b Oryza sativa 0.8b Drosophila melanogaster0.7b Gallus gallus 0.5b Arabidopsis thaliana 0.5b Updated GenBank release Table 2-2 Page 18

The most sequenced organisms in GenBank Homo sapiens 11.2 billion bases Mus musculus 7.5b Rattus norvegicus 5.7b Danio rerio 2.1b Bos taurus1.9b Zea mays 1.4b Oryza sativa (japonica) 1.2b Xenopus tropicalis0.9b Canis familiaris0.8b Drosophila melanogaster0.7b Updated GenBank release Table 2-2 Page 18

The most sequenced organisms in GenBank Homo sapiens 12.3 billion bases Mus musculus 8.0b Rattus norvegicus 5.7b Bos taurus3.5b Danio rerio 2.5b Zea mays 1.8b Oryza sativa (japonica) 1.5b Strongylocentrotus purpurata1.2b Sus scrofa1.0b Xenopus tropicalis1.0b Updated GenBank release Table 2-2 Page 18

National Center for Biotechnology Information (NCBI) Page 24

Types of Data in GenBank  DNA level  RNA level (cDNA)  Protein sequences.  …

Fig. 2.5 Page 25

Fig. 2.5 Page 25

PubMed is… National Library of Medicine's search service 16 million citations in MEDLINE links to participating online journals PubMed tutorial (via “Education” on side bar) Page 24

Entrez integrates… the scientific literature; DNA and protein sequence databases; 3D protein structure data; population study data sets; assemblies of complete genomes Page 24

Entrez is a search and retrieval system that integrates NCBI databases Page 24

BLAST is… Basic Local Alignment Search Tool NCBI's sequence similarity search tool supports analysis of DNA and protein databases 100,000 searches per day Page 25

OMIM is… Online Mendelian Inheritance in Man catalog of human genes and genetic disorders edited by Dr. Victor McKusick, others at JHU Page 25

Books is… searchable resource of on-line books Page 26

TaxBrowser is… browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses) taxonomy information such as genetic codes molecular data on extinct organisms Page 26

Structure site includes… Molecular Modelling Database (MMDB) biopolymer structures obtained from the Protein Data Bank (PDB) Cn3D (a 3D-structure viewer) vector alignment search tool (VAST) Page 26

Accessing information on molecular sequences Page 26

Accession numbers are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data. Page 26

What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775GenBank genomic DNA sequence NT_030059Genomic contig Rs dbSNP (single nucleotide polymorphism) N An expressed sequence tag (1 of 170) NM_006744RefSeq DNA sequence (from a transcript) NP_007635RefSeq protein AAC02945GenBank protein Q28369SwissProt protein 1KT7Protein Data Bank structure record protein DNA RNA Page 27

Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 27

4 ways to access protein and DNA sequences [1] Entrez Gene with RefSeq Entrez Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635) Page 27

From the NCBI home page, type “rbp4” and hit “Go” revised Fig. 2.7 Page 29

revised Fig. 2.7 Page 29

By applying limits, there are now just two entries

revised Fig. 2.8 Page 30 Entrez Gene (top of page) Note that links to many other RBP4 database entries are available

Entrez Gene (middle of page)

Entrez Gene (bottom of page)

Fig. 2.9 Page 32

Fig. 2.9 Page 32

Fig. 2.9 Page 32

FASTA format Fig Page 32

FASTA format

What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775GenBank genomic DNA sequence NT_030059Genomic contig Rs dbSNP (single nucleotide polymorphism) N An expressed sequence tag (1 of 170) NM_006744RefSeq DNA sequence (from a transcript) NP_007635RefSeq protein AAC02945GenBank protein Q28369SwissProt protein 1KT7Protein Data Bank structure record protein DNA RNA Page 27

NCBI’s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence. RefSeq identifiers include the following formats: Complete genomeNC_###### Complete chromosomeNC_###### Genomic contigNT_###### mRNA (DNA format)NM_###### e.g. NM_ ProteinNP_###### e.g. NP_ Page 29-30

Accession MoleculeNote AP_ Protein Protein products; alternate NC_ Genomic Complete genomic molecules NG_ Genomic Incomplete genomic regions NM_ mRNATranscript products; mRNA NM_ mRNATranscript products; 9-digit NP_ ProteinProtein products; NP_ ProteinProtein products; 9-digit NR_ RNANon-coding transcripts NT_ GenomicGenomic assemblies NW_ GenomicGenomic assemblies NZ_ABCD GenomicWhole genome shotgun data XM_ mRNATranscript products XP_ ProteinProtein products XR_ RNATranscript products YP_ ProteinProtein products ZP_ Protein Protein products NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences

Ensembl to access protein and DNA sequences Try Ensembl at for a premier human genome web browser. Ensembl is a joint scientific project between the European Bioinformatics InstituteEuropean Bioinformatics Institute and the Wellcome Trust Sanger InstituteWellcome Trust Sanger Institute, Its aim is to provide a centralised resource for geneticists, molecular biologists and other researchers studying the genomesgenomes of our own species and other vertebrates.vertebrates We will encounter Ensembl as we study the human genome, BLAST, and other topics.

click human

Species in Ensembl FISHES BIRDS REPTILES MAMMALS PLACENTALS MONOTREMES MARSUPIALS OTHER BIRDS PALEOGNATHS PASSERINES CROCODILES TURTLES LIZARDS AMPHIBIANS TELEOSTS SHARKS RAYS LATIMERIA BICHIR/POLYPTERUS LUNGFISHES AGNATHANS NON-VERTEBRATES

enter RBP4

Five ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 33

ExPASy to access protein and DNA sequences ExPASy sequence retrieval system (ExPASy = Expert Protein Analysis System) Visit Page 33

Fig Page 33

Example of how to access sequence data: HIV-1 pol There are many possible approaches. Begin at the main page of NCBI, and type an Entrez query: hiv-1 pol Page 34

Searching for HIV-1 pol: Following the “genome” link yields a manageable three results

Example of how to access sequence data: HIV-1 pol For the Entrez query: hiv-1 pol there are about 40,000 nucleotide or protein records (and >100,000 records for a search for “hiv-1”), but these can easily be reduced in two easy steps: --specify the organism, e.g. hiv-1[organism] --limit the output to RefSeq! Page 34

only 1 RefSeq over 100,000 nucleotide entries for HIV-1

Examples of how to access sequence data: histone query for “histone”# results protein records21847 RefSeq entries7544 RefSeq (limit to human)1108 NOT deacetylase697 At this point, select a reasonable candidate (e.g. histone 2, H4) and follow its link to Entrez Gene. There, you can confirm you have the right gene/protein

Access to Biomedical Literature Page 35

PubMed at NCBI to find literature information

PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citations and author abstracts from over 4,600 journals published in the United States and in 70 foreign countries. It has >14 million records dating back to Page 35

PubMed search strategies Try the tutorial (“education” on the left sidebar) Use boolean queries (capitalize AND, OR, NOT) lipocalin AND disease Try using “limits” Try “Links” to find Entrez information and external resources Obtain articles on-line via Welch Medical Library (and download pdf files): Page 35

lipocalin AND disease (60 results) lipocalin OR disease (1,650,000 results) lipocalin NOT disease (530 results) 1 AND 2 1 OR 2 1 NOT Fig Page 34 8/04