©CMBI 2008 Step 2: Bioinformatics databases & sequence retrieval Content of lecture I.Introduction II.Bioinformatics data & databases III.Sequence Retrieval.

Slides:



Advertisements
Similar presentations
It og Sundhed Nov Jan. Thomas Nordahl Petersen, Associate Professor Center for Biological Sequence Analysis, DTU
Advertisements

On line (DNA and amino acid) Sequence Information Lecture 7.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
©CMBI 2001 What are we looking for? Data & databases.
Bioinformatics databases & sequence retrieval Content of lecture I.Introduction II.Bioinformatics data & databases III.Sequence Retrieval with MRS Celia.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modeling Anne Mølgaard, CBS, BioCentrum, DTU.
Archives and Information Retrieval
©CMBI 2008 Aligning Sequences The most powerful weapon in the bioinformaticist’s armory is sequence alignment. Why? Lets’ think about an alignment. It.
It og Sundhed Thomas Nordahl Petersen, Associate Professor Center for Biological Sequence Analysis, DTU
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
©CMBI 2007 Search tools Google, MRS, (SRS). ©CMBI 2007 Search tools Google= Thé best generic search and retrieval system MRS= Maarten’s Retrieval System.
The Protein Data Bank (PDB)
ProteinStructuralDatabases. Proteins are built from amino-acids. Introduction H | NH2-c-CO2H | R.
©CMBI 2005 Why align sequences? Lots of sequences with unknown structure and function. A few sequences with known structure and function If they align,
©CMBI 2005 Search tools Google, MRS, SRS. ©CMBI 2004 Search tools SRS = Sequence Retrieval System MRS = Maarten’s Retrieval System Google = Thé best generic.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
1 Computational Biology, Part 13 Retrieving and Displaying Macromolecular Structures Robert F. Murphy Copyright  1996, 1999, All rights reserved.
1 Computational Biology, Part 11 Retrieving and Displaying Macromolecular Structures Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Motif searching and protein structure prediction May 26, 2005 Hand in written assignments today! Learning objectives-Learn how to read structure information.
Comparing protein structure and sequence similarities Sumi Singh Sp 2015.
An Introduction to Bioinformatics Molecular Biology Databases.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
On line (DNA and amino acid) Sequence Information
Structure and Function of Proteins Lecturer: Dr. Ora Furman Oct 2009 Winter 2009/10 Teaching Assistants: Miraim Oxsman Sivan Pearl.
Unit 7 RNA, Protein Synthesis & Gene Expression Chapter 10-2, 10-3
Protein Synthesis. DNA RNA Proteins (Transcription) (Translation) DNA (genetic information stored in genes) RNA (working copies of genes) Proteins (functional.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
©CMBI 2008 Data and Databases Your questions: –Lookup –Compare –Predict.
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
Secondary Databases Ansuman sahoo Roll: Y Bioinformatics Class Presentation 30 Jan 2013.
Biological Databases By : Lim Yun Ping E mail :
Macromolecular Visualization or… Where to go when ChemDraw just isn’t enough Martin Case Chem
Part I: Identifying sequences with … Speaker : S. Gaj Date
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
©CMBI 2009 Alignment & Secondary Structure You have learned about: Data & databases Tools Amino Acids Protein Structure Today we will discuss: Aligning.
Module 3 Protein Structure Database/Structure Analysis Learning objectives Understand how information is stored in PDB Learn how to read a PDB flat file.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Motif discovery and Protein Databases Tutorial 5.
A program of ITEST (Information Technology Experiences for Students and Teachers) funded by the National Science Foundation Background Session #3 DNA &
RNA 2 Translation.
Bioinformatics and Computational Biology
Amino Acids ©CMBI 2001 “ When you understand the amino acids, you understand everything ”
Alignment & Secondary Structure You have learned about: Data & databases Tools Amino Acids Protein Structure Today we will discuss: Aligning sequences.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
X-ray detection xray/facilities.html.
©CMBI 2008 Databases Data must be in a certain format for software to recognize Every database can have its own format but some data elements are essential.
Bioinformatics databases & sequence retrieval Content of lecture I.Introduction II.Bioinformatics data & databases III.Sequence Retrieval with MRS Celia.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Copyright OpenHelix. No use or reproduction without express written consent1.
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Bioinformatics A Summary seminar (with many hints for exam questions)
EBI is an Outstation of the European Molecular Biology Laboratory. A web based integrated search service to understand ligand binding and secondary structure.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
Genomics Lecture 3 By Ms. Shumaila Azam. Proteins Proteins: large molecules composed of one or more chains of amino acids, polypeptides. Proteins are.
Arginine, who are you? Why so important?. Release 2015_01 of 07-Jan-15 of UniProtKB/Swiss-Prot contains sequence entries, comprising
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
Protein databases Henrik Nielsen
PDBemotif A web based integrated search service to understand ligand binding and secondary structure properties in macromolecular structures.
Protein Families, Motifs & Domains.
Aligning Sequences You have learned about: Data & databases Tools
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

©CMBI 2008 Step 2: Bioinformatics databases & sequence retrieval Content of lecture I.Introduction II.Bioinformatics data & databases III.Sequence Retrieval with MRS Celia van Gelder CMBI UMC Radboud Nov 2008

©CMBI 2008 I. Bioinformatics questions Lookup Is the gene known for my protein (or vice versa)? What sequence patterns are present in my protein? To what class or family does my protein belong? Compare Are there sequences in the database which resemble the protein I cloned? How can I optimally align the members of this protein family? Predict Can I predict the active site residues of this enzyme? Can I predict a (better) drug for this target? How can I predict the genes located on this genome?

©CMBI 2008 Sequence similarity MVVSGAPPAL GGGCLGTFTS LLLLASTAIL NAARIPVPPA CGKPQQLNRV VGGEDSTDSE WPWIVSIQKN GTHHCAGSLL TSRWVITAAH CFKDNLNKPY LFSVLLGAWQ LGNPGSRSQK VGVAWVEPHP VYSWKEGACA DIALVRLERS IQFSERVLPI CLPDASIHLP PNTHCWISGW GSIQDGVPLP HPQTLQKLKV PIIDSEVCSH LYWRGAGQGP ITEDMLCAGY LEGERDACLG DSGGPLMCQV DGAWLLAGII SWGEGCAERN RPGVYISLSA HRSWVEKIVQ GVQLRGRAQG Image, you sequenced this human protein. You know it is a serine protease. Which residues belong to the active site? Is its sequence similar to the mouse serine protease?

©CMBI 2008 Alignment MVVSGAPPAL GGGCLGTFTS LLLLASTAIL NAARIPVPPA CGKPQQLNRV VGGEDSTDSE MMISRPPPAL GGDQFSILIL LVLLTSTAPI SAATIRVSPD CGKPQQLNRI VGGEDSMDAQ *::*.**** **. :. : *:**:*** :.** * *.* *********: ****** *:: WPWIVSIQKN GTHHCAGSLL TSRWVITAAH CFKDNLNKPY LFSVLLGAWQ LGNPGSRSQK WPWIVSILKN GSHHCAGSLL TNRWVVTAAH CFKSNMDKPS LFSVLLGAWK LGSPGPRSQK ******* ** *:******** *.***:**** ***.*::** *********: **.**.**** VGVAWVEPHP VYSWKEGACA DIALVRLERS IQFSERVLPI CLPDASIHLP PNTHCWISGW VGIAWVLPHP RYSWKEGTHA DIALVRLEHS IQFSERILPI CLPDSSVRLP PKTDCWIAGW **:*** *** ******: * ********:* ******:*** ****:*::** *:*.***:** GSIQDGVPLP HPQTLQKLKV PIIDSEVCSH LYWRGAGQGP ITEDMLCAGY LEGERDACLG GSIQDGVPLP HPQTLQKLKV PIIDSELCKS LYWRGAGQEA ITEGMLCAGY LEGERDACLG ********** ********** ******:*. ********. ***.****** ********** DSGGPLMCQV DGAWLLAGII SWGEGCAERN RPGVYISLSA HRSWVEKIVQ GVQLRGRAQG DSGGPLMCQV DDHWLLTGII SWGEGCAD-D RPGVYTSLLA HRSWVQRIVQ GVQLRG---- ********** *. ***:*** *******: : ***** ** * *****::*** ****** => Transfer of information

II. Bioinformatics data and databases mRNA expression profiles (DNA microarrays) Collision Induced Dissociation Spectra (tandem mass-spectrometry)

©CMBI 2008 Database Size – EMBL DNA database

©CMBI 2008 Biological databases (1) Primary databases contain biomolecular sequences or structures (experimental data!) and associated annotation information SequencesNucleic acid sequences EMBL, Genbank, DDBJ Protein sequences SwissProt, trEMBL, UniProt StructuresProtein Structures PDB Structures of small compounds CSD Genomes Human Genome Database HGD Mouse Genome Database MGD

©CMBI 2008 Biological databases (2) Secondary databases Contain data derived from primary database(s) Patterns, motifs, domainsPROSITE, PFAM, PRINTS, INTERPRO, Disease mutations OMIM / MIM Pathways KEGG

©CMBI 2008 Databases Data must be in a certain format for software to recognize Every database can have its own format but some data elements are essential for every database: 1. Unique identifier, or accession code 2. Name of depositor 3. Literature references 4. Deposition date 5. The real data

©CMBI 2008 Quality of Data SwissProt Data is only entered by annotation experts EMBL, PDB “Everybody” can submit data No human intervention when submitted; some automatic checks

©CMBI 2008 SwissProt database Database of protein sequences entries (Oct 2008) Ca. 200 Annotation experts worldwide Keyword-organised flatfile Obligatory deposit of in SwissProt before publication Presently, databases are being merged into UniProt.

©CMBI 2008 Important records in SwissProt (1) ID HBA_HUMAN Reviewed; 142 AA. AC P69905; P01922; Q3MIF5; Q96KF1; Q9NYR7; DT 21-JUL-1986, integrated into UniProtKB/Swiss-Prot. DT 23-JAN-2007, sequence version 2. DT 23-SEP-2008, entry version 63. DE RecName: Full=Hemoglobin subunit alpha; DE AltName: Full=Hemoglobin alpha chain; DE AltName: Full=Alpha-globin;

©CMBI 2008 Important records in SwissProt (2) Cross references section: Hyperlinks to all entries in other databases which are relevant for the protein sequence HBA_HUMAN

©CMBI 2008 Important records in SwissProt (3) Features section: post-translational modifications, signal peptides, binding sites, enzyme active sites, domains, disulfide bridges, local secondary structure, sequence conflicts between references etc. etc.

©CMBI 2008 And finally, the amino acid sequence!

©CMBI 2008 EMBL database Nucleotide database EMBL: 146 million sequence entries comprising 235 billion nucleotides (Oct 2008) EMBL records follows roughly same scheme as SwissProt Obligatory deposit of sequence in EMBL before publication Most EMBL sequences never seen by a human

©CMBI 2008 Protein Data Bank (PDB) Databank for macromolecular structure data (3-dimensional coordinates). Started ca. 30 years ago (on punched cards!) Obligatory deposit of coordinates in the PDB before publication ~ entries (April 2008) ( ~2500 “unique” structures) PDB file is a keyword-organised flat-file (80 column) 1) human readable 2) every line starts with a keyword (3-6 letters) 3) platform independent

©CMBI 2008 PDB important records (1) PDB nomenclature Filename= accession number= PDB Code Filename is 4 positions (often 1 digit & 3 letters, e.g. 1CRN) HEADER describes molecule & gives deposition date HEADER PLANT SEED PROTEIN 30-APR-81 1CRN CMPND name of molecule COMPND CRAMBIN SOURCE organism SOURCE ABYSSINIAN CABBAGE (CRAMBE ABYSSINICA) SEED

©CMBI 2008 PDB important records (2) SEQRES Sequence of protein; be aware: Not always all 3d-coordinates are present for all the amino acids in SEQRES!! SEQRES 1 46 THR THR CYS CYS PRO SER ILE VAL ALA ARG SER ASN PHE 1CRN 51 SEQRES 2 46 ASN VAL CYS ARG LEU PRO GLY THR PRO GLU ALA ILE CYS 1CRN 52 SEQRES 3 46 ALA THR TYR THR GLY CYS ILE ILE ILE PRO GLY ALA THR 1CRN 53 SEQRES 4 46 CYS PRO GLY ASP TYR ALA ASN 1CRN 54 SSBOND disulfide bridges SSBOND 1 CYS 3 CYS 40 SSBOND 2 CYS 4 CYS 32

©CMBI 2008 PDB important records (3) and at the end of the PDB file the “real” data: ATOM one line for each atom with its unique name and its x,y,z coordinates ATOM 1 N THR CRN 70 ATOM 2 CA THR CRN 71 ATOM 3 C THR CRN 72 ATOM 4 O THR CRN 73 ATOM 5 CB THR CRN 74 ATOM 6 OG1 THR CRN 75 ATOM 7 CG2 THR CRN 76 ATOM 8 N THR CRN 77 ATOM 9 CA THR CRN 78 ATOM 10 C THR CRN 79 ATOM 11 O THR CRN 80

©CMBI 2008 Structure Visualization Structures from PDB can be visualized with: 1.Yasara ( 2.SwissPDBViewer ( 3.Protein Explorer ( 4.Cn3D (

©CMBI 2008 Part III: Sequence Retrieval with MRS Google= Thé best generic search and retrieval system MRS= Maarten’s Retrieval System ( ) MRS is the Google of the biological database world Search engine (like Google) Input/Query = word(s) Output = entry/entries from database Google searches everywhere for everything, MRS searches in selected data environments

©CMBI 2008 MRS MRS is mainly used for (but not restricted to) protein/nucleic acid and related databases DNA and protein sequences Sequence related information (e.g. alignments, protein, domains, enzymes, metabolic pathways, structural information) Genomic information Hereditary information

©CMBI 2008 MRS Search Steps Select database(s) of choice Formulate your query Hit “Search” The result is a “query set” or “hitlist” Analyze the results

©CMBI 2008 MRS home page

©CMBI 2008 MRS Database Selection You can choose between selecting all databases or just one of them. But think about your query first!!

©CMBI 2008 Simply type your keywords in the keyword field and choose SEARCH. If you know the fields of the database you are searching in you can specify your query further But think about your query first!! MRS Search options

©CMBI 2008 MRS Results (1)

©CMBI 2008 MRS Results (2)

©CMBI 2008 MRS Options MRS creates a result, or a “query set”, or “hitlist”. With the result you can do different things in MRS: –View the hits –Blast single hit sequences –Clustal multiple hit sequences

©CMBI 2008 MRS - View Hits

©CMBI 2008 Combine in MRS AND or & AND is implicit OR or | NOT or !

©CMBI 2008 MRS - Options Home brings you back to the start page of MRS. That is the page from which you can do keyword searches. Blast brings you to the MRS-page from which you can do Blast searches. Blast results brings you to the page where MRS stores your Blast results. Clustal brings you to the MRS-page from which you can do Clustal alignments. Settings lets you choose your favourite display style Databanks lists all databases that MRS can search in. DB:uniprot lists the currently selected database. Help provides some help

©CMBI 2008 Try it yourself with the exercises! Ground rules for bioinformatics Don't always believe what programs tell you - they're often misleading & sometimes wrong! Don't always believe what databases tell you - they're often misleading & sometimes wrong! Don't always believe what lecturers tell you - they're sometimes wrong! Don't be a naive user, computers don’t do biology & bioinformatics, you do! free after Terri Attwood free after Terri Attwood