Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine

Slides:



Advertisements
Similar presentations
MitoInteractome : Mitochondrial Protein Interactome Database Rohit Reja Korean Bioinformation Center, Daejeon, Korea.
Advertisements

Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
© Wiley Publishing All Rights Reserved. How Most People Use Bioinformatics.
On line (DNA and amino acid) Sequence Information Lecture 7.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
Gene Ontology John Pinney
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Archives and Information Retrieval
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Lecture 2.21 Retrieving Information: Using Entrez.
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
How to access genomic information using Ensembl August 2005.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
How to use the web for bioinformatics Ethan Strauss X 1171
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
On line (DNA and amino acid) Sequence Information
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Using Entrez.
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Online Mendelian Inheritance in Man (OMIM): What it is & What it can do for you Knowledge Management & Eskind Biomedical Library January 27, 2012 helen.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
Motif discovery and Protein Databases Tutorial 5.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Copyright OpenHelix. No use or reproduction without express written consent1.
Class material and homework for February 9 today’s in-class topic: selected examples of contemporary biotechnology –polymerase chain reaction (PCR) –DNA.
This tutorial will describe how to navigate the section of Gramene that provides descriptions of alleles associated with morphological, developmental,
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Copyright OpenHelix. No use or reproduction without express written consent1.
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
What is sequencing? Video: WlxM (Illumina video) WlxM.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
E-utilities: Short course. The Entrez Query System at NCBI.
Biological Databases By: Komal Arora.
NCBI Molecular Biology Resources
Mangaldai College, Mangaldai
Introduction to Bioinformatics
Welcome to the Protein Database Tutorial
Basic Local Alignment Search Tool
Explore Evolution: Instrument for Analysis
Gene Safari (Biological Databases)
Basic Local Alignment Search Tool
Presentation transcript:

Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Molecular Biology Databases

Tour of the major molecular biology databases A database is an indexed collection of information There is a tremendous amount of information about biomolecules in publicly available databases. Today, we will just look at some of the main databases and what kind of information they contain.

Data about Databases Nucleic Acids research publishes an annual database issue issue lists 1170 editorially selected databases (link on course web site) Small excerpt from the A's: –AARSDB: Aminoacyl-tRNA synthetase sequences –ABCdb: ABC transporters –AceDB: C. elegans, S. pombe, and human sequences and genomic information –ACTIVITY: Functional DNA/RNA site activity –ALFRED: Allele frequencies and DNA polymorphisms

Located Sequence Features Indexing relevant data isn’t always easy –Naming schemes are always in flux, vary across communities, and are often controversial. –Descriptions of phenotypes are very difficult to standardize (even many clinical ones) Genome sequences provide a clear reference –A “located sequence feature” (place on a chromosome) is unambiguous and biologically meaningful –Closely related to the molecular concept of a gene.

What can be discovered about a gene by a database search? Best to have specific informational goals: –Evolutionary information: homologous genes, taxonomic distributions, allele frequencies, synteny, etc. –Genomic information: chromosomal location, introns, UTRs, regulatory regions, shared domains, etc. –Structural information: associated protein structures, fold types, structural domains –Expression information: expression specific to particular tissues, developmental stages, phenotypes, diseases, etc. –Functional information: enzymatic/molecular function, pathway/cellular role, localization, role in diseases

Using a database How to get information out of a database: –Summaries: how many entries, average or extreme values; rates of change, most recent entries, etc. –Browsing: getting a sense of the kind and quality of information available, e.g. checking familiar records –Search: looking for specific, predefined information “Key” to searching a database: –Must identify the element(s) of the database that are of interest somehow: Gene name, symbol, location or other identifying information. Sequences of genes, mRNAs, proteins, etc. A crossreference from another database or database generated id.

Searching for information about genes and their products Gene and gene product databases are often organized by sequence –Genomic sequence encodes all traits of an organism. –Gene products are uniquely described by their sequences. –Similar sequences among biomolecules indicates both similar function and an evolutionary relationship Macromolecular sequences provide biologically meaningful keys for searching databases

Searching sequence databases Starting from a sequence alone, find information about it Many kinds & sources of input sequences –Genomic, expressed, protein (amino acid vs. nucleic acid) –Complete or fragmentary sequences Goal is to retrieve a set of similar sequences. –Exact matches are rare, and not always interesting –Both small differences (mutations) and large (not required for function) within “similar” sequences can be biologically important.

What might we want to know about a sequence? Is this sequence similar to any known genes? How close is the best match? Significance? What do we know about that gene? –Genomic (chromosomal location, allelic information, regulatory regions, etc.) –Structural (known structure? structural domains? etc.) –Functional (molecular, cellular & disease) Evolutionary information: –Is this gene found in other organisms? –What is its taxonomic tree?

NCBI and Entrez One of the most useful and comprehensive database collections is the NCBI, part of the National Library of Medicine. –Home to GenBank, PubMed & many other familiar DBs. NCBI provides interesting summaries, browsers, and search tools Entrez is their database search interface Can search on gene names, chromosomal location, diseases, articles, keywords...

BLAST: Searching with a sequence Goals is to find other sequences that are more similar to the query than would be expected by chance (and therefore are likely homologous). Can start with nucleotide or amino acid sequence, and search for either (or both) Many options –E.g. ignore low information (repetitive) sequence, set significance critical value –Defaults are not always appropriate: READ THE NCBI EDUCATION PAGES!

Main BLAST page

A demonstration sequence atgcacttgagcagggaagaaatccacaaggactcaccagtctcctggtctgcagagaagacagaatcaacatgagcacagcaggaaaagtaa tcaaatgcaaagcagctgtgctatgggagttaaagaaacccttttccattgaggaggtggaggttgcacctcctaaggcccatgaagttcgtatt aagatggtggctgtaggaatctgtggcacagatgaccacgtggttagtggtaccatggtgaccccacttcctgtgattttaggccatgaggcagc cggcatcgtggagagtgttggagaaggggtgactacagtcaaaccaggtgataaagtcatcccactcgctattcctcagtgtggaaaatgcaga atttgtaaaaacccggagagcaactactgcttgaaaaacgatgtaagcaatcctcaggggaccctgcaggatggcaccagcaggttcacctgc aggaggaagcccatccaccacttccttggcatcagcaccttctcacagtacacagtggtggatgaaaatgcagtagccaaaattgatgcagcct cgcctctagagaaagtctgtctcattggctgtggattttcaactggttatgggtctgcagtcaatgttgccaaggtcaccccaggctctacctgtg ctgtgtttggcctgggaggggtcggcctatctgctattatgggctgtaaagcagctggggcagccagaatcattgcggtggacatcaacaaggac aaatttgcaaaggccaaagagttgggtgccactgaatgcatcaaccctcaagactacaagaaacccatccaggaggtgctaaaggaaatgact gatggaggtgtggatttttcatttgaagtcatcggtcggcttgacaccatgatggcttccctgttatgttgtcatgaggcatgtggcacaagtgtca tcgtaggggtacctcctgattcccaaaacctctcaatgaaccctatgctgctactgactggacgtacctggaagggagctattcttggtggcttta aaagtaaagaatgtgtcccaaaacttgtggctgattttatggctaagaagttttcattggatgcattaataacccatgttttaccttttgaaaaaat aaatgaaggatttgacctgcttcactctgggaaaagtatccgtaccattctgatgttttgagacaatacagatgttttcccttgtggcagtcttcag cctcctctaccctacatgatctggagcaacagctgggaaatatcattaattctgctcatcacagattttatcaataaattacatttgggggctttc caaagaaatggaaattgatgtaaaattatttttcaagcaaatgtttaaaatccaaatgagaactaaataaagtgttgaacatcagctggggaattg aagccaataaaccttccttcttaaccatt

Major choices: –Translation –Database –Filters –Restrictions –Matrix

Formatted blast output

Close hit: Macaque ADH alpha

Distant hit: L-threonine 3-dehydrogenase from a thermophilic bacterium

Parameters

Click on:

Taxonomy report (link from “Results of BLAST” page)

What did we just do? Identify loci (genes) associated with the sequence. Input was human Alcohol Dehydrogenase 1A For each particular “hit”, we can look at that sequence and its alignment in more detail. See similar sequences, and the organisms in which they are found. But there’s much more that can be found on these genes, even just inside NCBI…

Blink: Precomputed blast

Conserved domains

NCBI version of KEGG & EcoCyc

More from Entrez Gene

And more…

PubMed

Gene Expression

Detailed expression information

Genome map view

OMIM

NCBI is not all there is... Links to non-NCBI databases (see also “Link Out”) –Reactome for pathways (also KEGG) –HGNC for nomenclature –HPRD protein information –Regulatory / binding site DBs (e.g. CREB; some not linked) –IHOP (information hyperlinked over proteins) Other important gene/protein resources not linked: –UniProt (most carefully annotated) –PDB (main macromolecular structure repository) –UCSC (best genome viewer & many useful ‘tracks’) –DIP / MINT (protein-protein interactions) –More: InterPro, MetaCyc, Enzyme, etc. etc.

Gene Names (not easy!)

Protein reference db

… …

Take home messages There are a lot of molecular biology databases, containing a lot of valuable information Not even the best databases have everything (or the best of everything) These databases are moderately well cross-linked, and there are “linker” databases Sequence is a good identifier, maybe even better than gene name!

Homework Pick a favorite gene (or, if you don’t know any, how about looking up one of my favorites, PPAR-Delta) and gather information about it from at least five distinct resources. Readings: –Nucleic Acids Research online Molecular Biology Database Collection in 2009 Nucl. Acids Res : D1-D4 doi: /nar/gkn942 also, browse some of the entries themselves. –NCBI tutorial, Entrez: Making use of its power.