©CMBI 2008 Data and Databases Your questions: –Lookup –Compare –Predict.

Slides:



Advertisements
Similar presentations
Bioinformatics and Chips Bioinformatics is a very integral part of each step in a chip project. Bioinformatics is a very integral part of each step in.
Advertisements

©CMBI 2001 What are we looking for? Data & databases.
Bioinformatics databases & sequence retrieval Content of lecture I.Introduction II.Bioinformatics data & databases III.Sequence Retrieval with MRS Celia.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modeling Anne Mølgaard, CBS, BioCentrum, DTU.
Archives and Information Retrieval
Protein-a chemical view A chain of amino acids folded in 3D Picture from on-line biology bookon-line biology book Peptide Protein backbone N / C terminal.
©CMBI 2008 Aligning Sequences The most powerful weapon in the bioinformaticist’s armory is sequence alignment. Why? Lets’ think about an alignment. It.
ProteinStructuralDatabases. Proteins are built from amino-acids. Introduction H | NH2-c-CO2H | R.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
©CMBI 2005 Why align sequences? Lots of sequences with unknown structure and function. A few sequences with known structure and function If they align,
©CMBI 2008 Step 2: Bioinformatics databases & sequence retrieval Content of lecture I.Introduction II.Bioinformatics data & databases III.Sequence Retrieval.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
High Throughput Processing of the Structural Information of the Protein Data Bank Zoltán Szabadka, Vince Grolmusz Department of Computer Science Eötvös.
1 Computational Biology, Part 13 Retrieving and Displaying Macromolecular Structures Robert F. Murphy Copyright  1996, 1999, All rights reserved.
1 Computational Biology, Part 11 Retrieving and Displaying Macromolecular Structures Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Structure and Function of Proteins Lecturer: Dr. Ora Furman Oct 2009 Winter 2009/10 Teaching Assistants: Miraim Oxsman Sivan Pearl.
Number of released entries Year. Growth of Molecular Complexity Number of Chains Year Number of Structures Containing that Number of Chains.
Part II : Introduction To Protein Structure Kong Lesheng Victor Tong Joo Chuan National University of Singapore.
Protein Synthesis. DNA RNA Proteins (Transcription) (Translation) DNA (genetic information stored in genes) RNA (working copies of genes) Proteins (functional.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
The.pdb file format, and other resources for structural information Topic 5 Chapter 10 & 11, Du and Bourne “Structural Bioinformatics”
Archives and Information Retrieval
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
Bsubt.embl complete entry in EMBL format (DNA and Features) bsubt.embl.Z bsubt.fasta complete DNA sequence in Fasta format bsubt.fasta.Z bsubt.con construct.
Secondary Databases Ansuman sahoo Roll: Y Bioinformatics Class Presentation 30 Jan 2013.
Biological databases Nicky Mulder:
GBIO Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 Office hours:
Biological Databases By : Lim Yun Ping E mail :
Fortaleza 31.VII.2006 UniProtKB: Questions and answers UniProtKB/Swiss-Prot: Questions, Answers and a few Tips.
Macromolecular Visualization or… Where to go when ChemDraw just isn’t enough Martin Case Chem
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
1.Overall amino acid structure 2.Amino acid stereochemistry 3.Amino acid sidechain structure & classification 4.‘Non-standard’ amino acids 5.Amino acid.
Secondary structure prediction
Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild
1 EMBL Outstation — The European Bioinformatics Institute Automatic and Reliable Functional Annotation of Proteins.
Sequence Search and Analysis SPE 1653 (703)
The digestive system, thermodynamics, enzymes, and transport across membranes May 12, 2003 Learning objectives- Be capable of manipulating protein structures.
©CMBI 2009 Alignment & Secondary Structure You have learned about: Data & databases Tools Amino Acids Protein Structure Today we will discuss: Aligning.
Protein and RNA Families
PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.
School B&I TCD Bioinformatics Proteins: structure,function,databases,formats.
Amino Acids ©CMBI 2001 “ When you understand the amino acids, you understand everything ”
Alignment & Secondary Structure You have learned about: Data & databases Tools Amino Acids Protein Structure Today we will discuss: Aligning sequences.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
3DM: Protein engineering Super-family platforms Bio-Prodict DM super-family systems Henk-Jan Joosten Remko Kuipers Tom v/d Bergh Bas Vroling.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
X-ray detection xray/facilities.html.
©CMBI 2008 Databases Data must be in a certain format for software to recognize Every database can have its own format but some data elements are essential.
1 EMBL Outstation — The European Bioinformatics Institute Mus musculus - a model organism in SWISS-PROT.
Bioinformatics databases & sequence retrieval Content of lecture I.Introduction II.Bioinformatics data & databases III.Sequence Retrieval with MRS Celia.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Bioinformatics A Summary seminar (with many hints for exam questions)
Methods in 3D Structure Determination
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
Protein databases Henrik Nielsen
Protein Families, Motifs & Domains.
Getting the Most out of the PDBe
Archives and Information Retrieval
Number of released entries
“Explosion”, “Avalanche”, “Tsunami” of data Complexity of data
Swiss-Prot Database --- Xie, H
Aligning Sequences You have learned about: Data & databases Tools
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

©CMBI 2008 Data and Databases Your questions: –Lookup –Compare –Predict

©CMBI 2008 Your questions Lookup Is the gene known for my protein (or vice versa)? What sequence patterns are present in my protein? Are the mutations known which cause this disease? To what class or family does my protein belong? Compare Are there protein sequences in the database which resemble the protein I cloned? How can I optimally align the members of this protein family? Are these two sequences similar? Predict Can I predict the active site residues of this enzyme? Can I make a 3D model for my protein? Can I predict a (better) drug for this target? How can I improve the thermostability of this protein? (protein engineering) How can I predict the genes located on this genome?

©CMBI 2008 Sequence similarity MVVSGAPPAL GGGCLGTFTS LLLLASTAIL NAARIPVPPA CGKPQQLNRV VGGEDSTDSE WPWIVSIQKN GTHHCAGSLL TSRWVITAAH CFKDNLNKPY LFSVLLGAWQ LGNPGSRSQK VGVAWVEPHP VYSWKEGACA DIALVRLERS IQFSERVLPI CLPDASIHLP PNTHCWISGW GSIQDGVPLP HPQTLQKLKV PIIDSEVCSH LYWRGAGQGP ITEDMLCAGY LEGERDACLG DSGGPLMCQV DGAWLLAGII SWGEGCAERN RPGVYISLSA HRSWVEKIVQ GVQLRGRAQG Image, you sequenced this human protein. You know it is a serine protease. Which residues belong to the active site? Is its sequence similar to the mouse serine protease?

©CMBI 2008 Alignment MVVSGAPPAL GGGCLGTFTS LLLLASTAIL NAARIPVPPA CGKPQQLNRV VGGEDSTDSE MMISRPPPAL GGDQFSILIL LVLLTSTAPI SAATIRVSPD CGKPQQLNRI VGGEDSMDAQ *::*.**** **. :. : *:**:*** :.** * *.* *********: ****** *:: WPWIVSIQKN GTHHCAGSLL TSRWVITAAH CFKDNLNKPY LFSVLLGAWQ LGNPGSRSQK WPWIVSILKN GSHHCAGSLL TNRWVVTAAH CFKSNMDKPS LFSVLLGAWK LGSPGPRSQK ******* ** *:******** *.***:**** ***.*::** *********: **.**.**** VGVAWVEPHP VYSWKEGACA DIALVRLERS IQFSERVLPI CLPDASIHLP PNTHCWISGW VGIAWVLPHP RYSWKEGTHA DIALVRLEHS IQFSERILPI CLPDSSVRLP PKTDCWIAGW **:*** *** ******: * ********:* ******:*** ****:*::** *:*.***:** GSIQDGVPLP HPQTLQKLKV PIIDSEVCSH LYWRGAGQGP ITEDMLCAGY LEGERDACLG GSIQDGVPLP HPQTLQKLKV PIIDSELCKS LYWRGAGQEA ITEGMLCAGY LEGERDACLG ********** ********** ******:*. ********. ***.****** ********** DSGGPLMCQV DGAWLLAGII SWGEGCAERN RPGVYISLSA HRSWVEKIVQ GVQLRGRAQG DSGGPLMCQV DDHWLLTGII SWGEGCAD-D RPGVYTSLLA HRSWVQRIVQ GVQLRG---- ********** *. ***:*** *******: : ***** ** * *****::*** ****** => Transfer of information

©CMBI 2008 Data & Databases Data in databases what the data looks like Programs (tools) to search these databases how it can be accessed

©CMBI 2008 Biological Databases The number of databases - DBCAT lists over 1200 databases (2006) The size of databases - Grows exponentially - EMBL database: New entries entered at 5 sec/seq!

©CMBI 2008 Database Size

©CMBI 2008 Primary and Secondary Databases Primary databases REAL EXPERIMENTAL DATA Biomolecular sequences or structures and associated annotation information (organism, function, mutation linked to disease, functional/structural patterns, bibliographic etc.) Secondary databases DERIVED INFORMATION Fruits of analyses of sequences in the primary sources (patterns, blocks, profiles etc., that represent the most conserved features of multiple alignments)

©CMBI 2008 Primary Databases Sequence Information DNA: EMBL, Genbank, DDBJ Protein: SwissProt, TREMBL, PIR Genome Information EnsEMBL, TIGR rice genome, Celera, SNP databases Structure Information PDB, NDB, CCDB/CSD

©CMBI 2008 Secondary Databases Sequence-related Information -ProSite, Enzyme, REBase Genome-related Information -OMIM, TransFac Structure-related Information -DSSP, HSSP, FSSP, PDBFinder Pathway Information -KEGG, Pathways

©CMBI 2008 Prosite example

©CMBI 2008 KEGG example

©CMBI 2008 Databases Data must be in a certain format for software to recognize Every database can have its own format but some data elements are essential for every database: 1. Unique identifier, or accession code 2. Name of depositor 3. Literature references 4. Deposition date 5. The real data

©CMBI 2008 Quality of Data SwissProt Data is only entered by annotation experts EMBL, PDB Everybody can submit data No human intervention when submitted; some automatic checks

©CMBI 2008 SwissProt database Database of protein sequences derived from: - translations of DNA (from EMBL Database) - adapted from the PIR collection - extracted from the literature - and directly submitted by researchers entries (may 2008) Ca. 200 Annotation experts worldwide Keyword-organised flatfile Obligatory deposit of in SwissProt before publication Presently, databases are being merged into UniProt.

©CMBI 2008 SwissProt records ID identification line ID ENTRY_NAME DATA_CLASS; MOLECULE_TYPE; SEQUENCE_LENGTH. ID CRAM_CRAAB STANDARD; PRT; 46 AA. Format for the ENTRY_NAME: NAME_SPECIES (  12 characters) For number of organisms (16) recognizable names: HUMAN, MOUSE, CHICK, BOVIN, YEAST, ECOLI…. N.B. The ID can change, e.g. serotonine receptors have got a new nomenclature

©CMBI 2008 SwissProt records AC accession number AC P01542; AC is unique: Name, sequence, everything can change but AC stays the same DT deposition date DT 21-JUL-1986 (Rel. 01, Created) DT 30-MAY-2000 (Rel. 39, Last sequence update) DT 30-MAY-2000 (Rel. 39, Last annotation update) 1) You can not see what the last annotation update was 2) No depositor record (Implicit: author of first reference)

©CMBI 2008 SwissProt records DE description DE CRAMBIN. DE 6-phosphofructo-2-kinase 1 (EC ) (Phosphofructokinase 2 I) General descriptive information GN gene name GN THI2. OS & OC & OG OS Crambe abyssinica (Abyssinian crambe). OC Eukaryota; Viridiplantae; Embryophyta; Tracheophyta; OC Magnoliophyta; eudicotyledons; Rosidae; Brassicales; OC Brassicaceae; Crambe. Organism Species; Organism Classification; Organelle

©CMBI 2008 SwissProt records RN References RN [1] RP SEQUENCE. RX MEDLINE; RA Teeter M.M., Mazer J.A., L'Italien J.J.; RT "Primary structure of the plant protein crambin."; RL Biochemistry 20: (1981). CC Comments or notes CC -!- FUNCTION: THE FUNCTION OF THIS HYDROPHOBIC PLANT SEED PROTEIN CC IS NOT KNOWN. CC -!- MISCELLANEOUS: TWO ISOFORMS EXISTS (MAJOR ONE SHOWN HERE) CC AND A MINOR FORM SI. CC -!- SIMILARITY: BELONGS TO THE PLANT THIONIN FAMILY.

©CMBI 2008 SwissProt records DR Database Cross Reference DR PIR; A01805; KECX. DR PDB; 1AB1; DR PDB; 1CBN; (…) DR PDB; 1JXY; X-ray; A=1-46. DR InterPro; IPR001010; Thionin. DR Pfam; PF00321; Thionin; 1. DR PRINTS; PR00287; THIONIN. DR PROSITE; PS00271; THIONIN; 1. KW Keyword Not standardized (under control of depositor) KW Thionin; 3D-structure.

©CMBI 2008 SwissProt records FT Feature table data FT DISULFID 3 40 FT DISULFID 4 32 FT DISULFID FT VARIANT P -> S (IN ISOFORM SI). FT VARIANT L -> I (IN ISOFORM SI). FT STRAND 2 3 FT HELIX 7 16 FT TURN FT HELIX FT TURN FT STRAND FT TURN 42 43

©CMBI 2008 Feature table ( continued ) Other features: post-translational modifications, binding sites, enzyme active sites, local secondary structure or other characteristics reported in the cited references. Sequence conflicts between references are also included. FT CONFLICT MISSING (IN REF. 2). FT MUTAGEN G->R,L,M: DNA BINDING LOST. FT MOD_RES PHOSPHORYLATION (BY PKC). FT LIPID 1 1 MYRISTATE. FT CARBOHYD GLUCOSYLGALACTOSE. FT METAL COPPER (POTENTIAL). FT BINDING HEME (COVALENT). FT PROPEP ACTIVATION PEPTIDE. FT DOMAIN EXTRACELLULAR (POTENTIAL). FT ACT_SITE ACCEPTS A PROTON DURING CATALYSIS.

©CMBI 2008 SwissProt records SQ sequence header SQ SEQUENCE 46 AA; 4736 MW; 919E68AF159EF722 CRC64; Sequence data TTCCPSIVAR SNFNVCRLPG TPEALCATYT GCIIIPGATC PGDYAN Termination line //

©CMBI 2008 SwissProt entry in MRS

©CMBI 2008 SwissProt entry in MRS

©CMBI 2008 SwissProt entry in MRS

©CMBI 2008 EMBL database Nucleotide database –EMBL: 114 million sequence entries comprising 215 billion nucleotides (March 2008) –Of which EMEST: 50 million sequence entries comprising 27 billion nucleotides (March 2008) –EMBL records follows roughly same scheme as SwissProt –Obligatory deposit of sequence in EMBL before publication –Most EMBL sequences never seen by a human

©CMBI 2008 Protein Data Bank (PDB) Databank for macromolecular structure data (3-dimensional coordinates). Started ca. 30 years ago (on punched cards!) Obligatory deposit of coordinates in the PDB before publication ~ entries (Aug 2007) ( ~2500 “unique” structures) PDB file is a keyword-organised flat-file (80 column) 1) human readable 2) every line starts with a keyword (3-6 letters) 3) platform independent

©CMBI 2008 PDB records Filename= accession number= PDB Code Filename is 4 positions (often 1 digit & 3 letters, e.g. 1CRN) HEADER describes molecule & gives deposition date HEADER PLANT SEED PROTEIN 30-APR-81 1CRN CMPND name of molecule COMPND CRAMBIN SOURCE organism SOURCE ABYSSINIAN CABBAGE (CRAMBE ABYSSINICA) SEED

©CMBI 2008 PDB records AUTHOR AUTHOR W.A.HENDRICKSON,M.M.TEETER 1CRN 6 Revision date REVDAT 5 16-APR-87 1CRND 1 HEADER 1CRND 2 REVDAT 4 04-MAR-85 1CRNC 1 REMARK 1CRNC 1 REVDAT 3 30-SEP-83 1CRNB 1 REVDAT 1CRNB 1 REVDAT 2 03-DEC-81 1CRNA 1 SHEET 1CRNB 2 REVDAT 1 28-JUL-81 1CRN 0 REMARK There are very many different REMARK records & subrecords! Not standardized. REMARK 1 REFERENCE 3 1CRNC 10 REMARK 1 AUTH M.M.TEETER,W.A.HENDRICKSON 1CRN 16 REMARK 1 TITL HIGHLY ORDERED CRYSTALS OF THE PLANT SEED PROTEIN 1CRN 17 REMARK 1 TITL 2 CRAMBIN 1CRN 18 REMARK 1 REF J.MOL.BIOL. V CRN 19 REMARK 1 REFN ASTM JMOBAK UK ISSN CRN 20 REMARK 2 1CRN 21 REMARK 2 RESOLUTION. 1.5 ANGSTROMS. 1CRN 22

©CMBI 2008 PDB records SEQRES Sequence of protein; Be aware: Not always all 3d-coordinates are present for all the amino acids in SEQRES!! SEQRES 1 46 THR THR CYS CYS PRO SER ILE VAL ALA ARG SER ASN PHE 1CRN 51 SEQRES 2 46 ASN VAL CYS ARG LEU PRO GLY THR PRO GLU ALA ILE CYS 1CRN 52 SEQRES 3 46 ALA THR TYR THR GLY CYS ILE ILE ILE PRO GLY ALA THR 1CRN 53 SEQRES 4 46 CYS PRO GLY ASP TYR ALA ASN 1CRN 54 HET & FORMUL HET NAD A 1 44 NAD CO-ENZYME 4MDH 219 HET SUL A 2 5 SULFATE 4MDH 220 HET NAD B 1 44 NAD CO-ENZYME 4MDH 221 HET SUL B 2 5 SULFATE 4MDH 222 FORMUL 3 NAD 2(C21 H28 N7 O14 P2) 4MDH 223 FORMUL 4 SUL 2(O4 S1) 4MDH 224 FORMUL 5 HOH *471(H2 O1) 4MDH 225

©CMBI 2008 PDB records HELIX/SHEET/TURN Secondary structure elements provided by the crystallographer (subjective). HELIX 1 H1 ILE 7 PRO /10 CONFORMATION RES 17,19 SHEET 2 S1 2 CYS 32 ILE TURN 1 T1 PRO 41 TYR 44 SSBOND disulfide bridges SSBOND 1 CYS 3 CYS 40 SSBOND 2 CYS 4 CYS 32 CRYST1, ORIG, SCALE crystallographic parameters CRYST P CRN 63 SCALE CRN 67 SCALE CRN 68 SCALE CRN 69

©CMBI 2008 PDB records ATOM One line per atom with, unique name and x,y,z coordinates ATOM 1 N THR CRN 70 ATOM 2 CA THR CRN 71 ATOM 3 C THR CRN 72 ATOM 4 O THR CRN 73 ATOM 5 CB THR CRN 74 ATOM 6 OG1 THR CRN 75 ATOM 7 CG2 THR CRN 76 ATOM 8 N THR CRN 77 ATOM 9 CA THR CRN 78 ATOM 10 C THR CRN 79 ATOM 11 O THR CRN 80 TER The TER record terminates the amino acid chain ATOM 325 OD1 ASN CRN 394 ATOM 326 ND2 ASN CRN 395 ATOM 327 OXT ASN CRN 396 TER 328 ASN 46 1CRN 397

©CMBI 2008 PDB records HETATM atomic coordinates for atoms within "non-standard" groups (cofactors, ions, …) and for water molecules HETATM 5158 AP NAD B MDH5495 HETATM 5159 AO1 NAD B MDH5496 HETATM 5160 AO2 NAD B MDH5497 HETATM 5207 O HOH MDH5544 HETATM 5208 O HOH MDH5545 HETATM 5209 O HOH MDH5546 CONECT connection records (not obligatory) indicate which atoms are connected (mainly for HETATM)