Informatics for Molecular Biologists Ansuman Chattopadhyay,PhD Head, Molecular Biology Information Service Falk Library, Health Sciences Library System.

Slides:



Advertisements
Similar presentations
What is RefSeqGene?.
Advertisements

Databases (“knowledge bases”) used in genome analysis
Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
NCBI web resources I: databases and Entrez Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Archives and Information Retrieval
Biological databases.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Lecture 2.21 Retrieving Information: Using Entrez.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Login: BITseminar Pass: BITseminar2011 Login: BITseminar Pass: BITseminar2011.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
The Genome Genome Browser Training Materials developed by: Warren C. Lathe, Ph.D. and Mary Mangan, Ph.D. Part 1.
Bioinformatics.
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center.
Searching PubMed® NCBI, NLM Resources, Micromedex -GSBS TTUHSC Preston Smith Library presents Rev. 08/17/14.
1 Database Resources of the National Center for Biotechnology Information Baharak Rastegari MEDG 505 presentation February 3, 2005 David.
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Using Entrez.
CANDID: A candidate gene identification tool Janna Hutz March 19, 2007.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Copyright OpenHelix. No use or reproduction without express written consent 2 Overview of Genome Browsers Materials prepared by Warren C. Lathe, Ph.D.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Professional Development Course 1 – Molecular Medicine Genome Biology June 12, 2012 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Sackler Medical School
NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Using Entrez.
NCBI Literature Databases: PubMed
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Bioinformatics and Computational Biology
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Copyright OpenHelix. No use or reproduction without express written consent1.
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Welcome to the combined BLAST and Genome Browser Tutorial.
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Keeping Current: Genetics Resources. This workshop will provide an overview of NCBI resources for finding-- Background information & journal articles.
Chapter 2: Access to Information Jonathan Pevsner, Ph.D.
Introduction to Bioinformatics
NCBI Molecular Biology Resources
Archives and Information Retrieval
gene-CENTRIC database
Welcome to the Protein Database Tutorial
Basic Local Alignment Search Tool
Gene Safari (Biological Databases)
Problems from last section
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Informatics for Molecular Biologists Ansuman Chattopadhyay,PhD Head, Molecular Biology Information Service Falk Library, Health Sciences Library System University of Pittsburgh

Molecular Biology Information Service Falk Library of Health Sciences Health Sciences Library System University of Pittsburgh 200 Scaife Hall Desoto and Terrace Streets Pittsburgh, PA 15261

Topics Searching tools –Internet –PubMed NCBI developed bioinformatics tools –Entrez Gene Structure visualization tools –Cn3D Genome Browsers –UCSC genome browsers –NCBI Map viewer

Information search space Biomedical literature databases Molecular databases Organism whole genome sequences

Literature database NCBI PubMed –contains over 15 million citations dating back to the mid-1950's.mid-1950's Search: “apoptosis”: 130,476 “breast cancer”: 160,055 “p53”: 42,418

Molecular databases

Organisms whole genome sequences

Internet for Biologists Google Vs Clusty –Google: Chronological list of search results –Clusty: Search results categorized into topical clusters Vivísimo's clustering technology creates topical categories on-the-fly from the search results, using terms in the title, snippet, and any other available textual description in the search results themselves

Google Vs Clusty Search Example: Pittsburgh –GoogleGoogle –ClustyClusty

Clusters help you see your search results by topic, so you can zero in on exactly what you’re looking for or discover unexpected relationships between items.

Search examples for Clusty SNP BLAST Lupus

Web 2.0 Website bookmark and tagging tool –Del.icio.us a social bookmarking web service for storing, sharing, and discovering web bookmarks.

Web 2.0 Connotea;

Medline searching tool PubMed vs ClusterMedPubMed Search example : macular degeneration, cell cycle, p53

Molecular databases DNA Sequence Databases and Analysis Tools Enzymes and Pathways Gene Mutations, Genetic Variations and Diseases Genomics Databases and Analysis Tools Immunological Databases and Tools Microarray, SAGE, and other Gene Expression Organelle Databases Other Databases and Tools (Literature Mining, Lab Protocols, Medical Topics, and others) Plant Databases Protein Sequence Databases and Analysis Tools Proteomics Resources RNA Databases and Analysis Tools Structure Databases and Analysis Tools

HSLS OBRC

Types of databases –By level of curation: Archival –GenBank, GenPept, ssSNP Curated –Refseq, SwissProt, RefSNP

Types of databases –Archival data repository of information redundant; might have many sequence records for the same gene, each from a different lab submitters maintain editorial control over their records: what goes in is what comes out no controlled vocabulary variation in annotation of biological features Example: GenBank record

GenBank archival database of nucleotide sequences from >130,000 organisms records annotated with coding region (CDS) features also include amino acid translations each record represents the work of a single lab redundant; can have many sequence records for a single gene

International Nucleotide Sequence Database Collaboration

Types of databases

Refseq Curated data –non-redundant; one record for each gene, or each splice variant –each record is intended to present an encapsulation of the current understanding of a gene or protein, similar to a review article –records contain value-added information that have been added by an expert(s)

Refseq Database of reference sequences Curated Non-redundant; one record for each gene, or each splice variant, from each organism represented A representative GenBank record is used as the source for a RefSeq record Value-added information is added by an expert(s) Each record is intended to present an encapsulation of the current understanding of a gene or protein, similar to a review article Variety of accession number prefixes (NM_, NP_, etc.) and status codes (provisional, reviewed, etc.). More about those in later slides. RefSeq database includes genomic DNA, mRNA, and protein sequences, so organizes information according to the model of the central dogma of biology

RefSeq

Searching GenBank Find messenger RNA sequence for Human epidermal growth factor (EGF) gene.

Databases developers NCBI EBI

Neighbors and Hard Links Genomes Taxonomy PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure Word weight VAST BLAST Phylogeny Source NCBI

NCBI Tools

Entrez Gene NCBI’s database for gene centric information focuses on organisms genome completely sequenced an active research community to contribute gene-specific information scheduled for intense sequence analysis –Total Taxa: 4246; Total Genes: 284, ,000 organisms in the nucleotide sequence database (Genbank)

Entrez gene each record represents a single gene from a given organism Gene record includes: –a unique identifier or GeneID assigned by NCBI –a preferred symbol –and any one or more of: –sequence information –map information –official nomenclature from an authority list –alternate gene symbols –summary of gene/protein function –published references that provide additional information on function –expression –homology data –and more

SNP Genomic Sequence Exon-Intron Structure Expression Profile Interacting Partners 3D Structure mRNA Sequence Chromosomal Localization Disease Amino acid Sequence Homologous Sequences Gene / Protein

Searching Entrez Gene

Entrez gene Find: gene symbols and aliases sequences: genomic, mRNA, protein intron-exon architecture genomic context: neighboring and antisense genes Interacting partners associated gene ontology terms: function, cellular component and biological process

Entrez Gene record Query: BRCA1 Search Tips:  Query text box: BRCA1  Limits: To limit your search to a specific field, select: “Gene name” from drop-down menu Limit by taxonomy: select “Homo sapiens” Name and aliases Chromosoma l location

Sourse: NCBI

Entrez Gene: sequences and genomic context Sequences: mRNA, Genomic, Protein mRNA Seq ProteinSeq Genomic Seq

Transcription and alternative splicing Alternative splicing:

Entrez Gene: intron-exon architectures Tips: Change Display to “Gene Table” from “Summary”

Genomic SeqmRNA Seq ProteinSeq

Gene Ontology –Controlled vocabulary tagging Function Biological Processes Cellular Component

Entrez Gene : Gene Ontology

Homologous sequences

Entrez Gene: Homologous sequence Tips: change Display settings from" summary” to “Alignment score” to “Multiple Alignment”

Single nucleotide polymorphisms Single nucleotide polymorphisms (SNP) are DNA sequence variations that occur when a single nucleotide (A,T,C,or G) in the genome sequence is altered. For example a SNP might change the DNA sequence AAGGCTAA to ATGGCTAA

SNPs

Coding SNPs

Entrez Gene: SNPs

Protein Info: HPRD

Entrez Gene: Links

Entrez Gene: Linkout

Seq to Entrez gene: UCSC BLAT Query Seq: SGLTPEEFMLVYKFARKHHITLTNLITEE

BLAT to Entreze Gene CLICK

Find chromosomal location of your gene of interest. How many exons have been reported for your gene? What are its neighboring genes ? Query sequence: IHYNYMCNSSCMGGMNRRPILTII Hands-On Exercise Question

Exercise: Find the protein sequence for rat leptin. BLAT this sequence vs. the human genome to find the human homolog. Look for SNPs in the coding region of this gene—are there any?

Sequence alignment Pair wise alignment Multiple alignment

Pairwise alignment Global –Needleman Wunsc (1970) Local –Smith-Waterman (1981) –Lipman and Pearson /FASTA (1985) –Basic Local Alignment Search Tool (BLAST:1991)

BLAST To find homologous sequence for a sequence of interest by searching sequence databases: Nucleotide: Protein: TTGGATTATTTGGGGATAATAATGAAGATAGCAA TTATCTCAGGGAAAGGAGGAGTAGGAAAATCTTC TA TTTCAACATCCTTAGCTAAGCTGTTTTCAAAAG AGTTTAATATTGTAGCATTAGATTGTGATGTTGAT MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDER EIKKRDIFSLLLGVAGLNKSVEEFE NELKNKLTEEAKNKMENIK KELEDVGFKVKDIIVVGIPHEEIVKIAEDEGVDIIIMGSHGKTNLKEILLG

BLAST To Find statistically significant matches, based on sequence similarity, to a protein or nucleotide sequence of interest. Obtain information on inferred function of the gene or protein. Find conserved domains in your sequence of interest that are common to many sequences. Compare two known sequences for similarity.

What you can do with BLAST Find homologous sequence in all combinations (DNA/Protein) of query and database. –DNA Vs DNA –DNA translation Vs Protein –Protein Vs Protein –Protein Vs DNA translation –DNA translation Vs DNA translation

BLAST exercise Find homologous sequences for uncharacterized archaebacterial protein, NP_247556, from Methanococcus jannaschii

BLAST search Sort by E values 2X Sequence description Link to Entrez number of display cut off (100)over rides E value cut off (10) Descriptions of hits

BLAST search Orthologs from closely related species will have the highest scores and lowest E values –Often E = to Closely related homologs with highly conserved function and structure will have high scores –Often E = to Distantly related homologs may be hard to identify –Less than E = 10 -4

Protein domains Wikipedia SH2Src homology 2 domains; Signal transduction, involved in recognition of phosphorylated tyrosine (pTyr). SH2 domains typically bind pTyr-containing ligands via two surface pockets, a pTyr and hydrophobic binding pocket, allowing proteins with SH2 domains to localize to tyrosine phosphorylated sites.

Searching CDD CDD SEARCH Query sequence:

Blink BLink displays the graphical output of pre-computed blastp results against the protein non-redundant (nr) database. This graphical output includes: –Alignment of up to 200 BLAST hits on the query sequence –Best Hits to each organism –List of known protein domains in the query sequence –Filter hits by selecting the BLAST cutoff score –Distribution of hits by taxonomic grouping –Display of similar sequences with known 3D structure –Filter hits by database and/or by taxonomic grouping –Display a taxonomic tree of all organisms with similar sequences Access : Link out from NCBI protein records Link toTP53 Blink:

Protein structure

Protein data bank (PDB) international database of 3-D biological macromolecular structures accepts direct submissions of structure data maintained by a nonprofit organization, the Research Collaboratory for Structural Bioinformatics (RCSB), associated with Rutgers University, San Diego Supercomputer Center, and the Biotechnology Division of the National Institute of Standards and Technology contains molecular structures of proteins and nucleic acids, primarily structures experimentally-derived by X-Ray crystallography and NMR also includes some theoretical models, though they are not encouraged.

3D structure viewing software NCBI Cn3D First glance in Jmol A simple tool for macromolecular visualization. The Cn3D home page includes a link in the blue sidebar for instructionsCn3D home page on installing Cn3D, which is available for PC, Mac, and Unix.installing Cn3D

Cn3D View the 3-dimensional structure for 1TUP and practice using some of the Cn3D features that allow you to: –spin the structure using your mouse –use the control+left mouse button combination to zoom in and out of the structure –use the shift+left mouse button combination to move the structure across the viewing window –use the Style menu to render the structure in different ways (e.g., worms, space fill, ball and stick,...) –use the Style menu to color the structure in different ways (e.g., secondary structure, domain,...) –use the Style/Edit Global Style to label every 20th amino acids

What is it? Genome Browser is a computer program which helps to display gene maps, browse the chromosomes, align genes or gene models with ESTs or contigs etc.

Genome Sequence Project Time Line 1976 : RNA Bacteriophage MS2 1995: Haemophilus influenzae 2003: Human genome reference sequence 2005: 265 genomes; 21 archaeal, 211 bacterial, 33 eukaryotic

Genome Browsers NCBI MAP Viewer EBI Ensembl UCSC Genome Browser