Structure databases, searches and alignments Marian Novotny Molecular Bioinformatics X3.

Slides:



Advertisements
Similar presentations
Web Resources for Bioinformatics Vadim Alexandrov and Mark Gerstein.
Advertisements

C A T H C A T H lass rchitecture opology or Fold Group
Pfam(Protein families )
PDB-Protein Data Bank SCOP –Protein structure classification CATH –Protein structure classification genTHREADER–3D structure prediction Swiss-Model–3D.
1 Protein Structure, Structure Classification and Prediction Bioinformatics X3 January 2005 P. Johansson, D. Madsen Dept.of Cell & Molecular Biology, Uppsala.
Structural bioinformatics
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modeling Anne Mølgaard, CBS, BioCentrum, DTU.
Archives and Information Retrieval
Protein Structure, Databases and Structural Alignment
Thomas Blicher Center for Biological Sequence Analysis
The Protein Data Bank (PDB)
Introduction to bioinformatics
ProteinStructuralDatabases. Proteins are built from amino-acids. Introduction H | NH2-c-CO2H | R.
©CMBI 2005 Why align sequences? Lots of sequences with unknown structure and function. A few sequences with known structure and function If they align,
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modelling Thomas Blicher Center for Biological Sequence Analysis.
1 Computational Biology, Part 13 Retrieving and Displaying Macromolecular Structures Robert F. Murphy Copyright  1996, 1999, All rights reserved.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
1 Computational Biology, Part 11 Retrieving and Displaying Macromolecular Structures Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structures.
Structure and Function of Proteins Lecturer: Dr. Ora Furman Oct 2009 Winter 2009/10 Teaching Assistants: Miraim Oxsman Sivan Pearl.
Protein Tertiary Structure Prediction
Chapter 12 Protein Structure Basics. 20 naturally occurring amino acids Free amino group (-NH2) Free carboxyl group (-COOH) Both groups linked to a central.
Number of released entries Year. Growth of Molecular Complexity Number of Chains Year Number of Structures Containing that Number of Chains.
Part II : Introduction To Protein Structure Kong Lesheng Victor Tong Joo Chuan National University of Singapore.
Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.
Practical session 2b Introduction to 3D Modelling and threading 9:30am-10:00am 3D modeling and threading 10:00am-10:30am Analysis of mutations in MYH6.
The.pdb file format, and other resources for structural information Topic 5 Chapter 10 & 11, Du and Bourne “Structural Bioinformatics”
Protein 3D-structure analysis Exercises. Practicals Find update frequency for RCSB PDB: weekly. When was the last update? How many protein structures.
PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.
EBI is an Outstation of the European Molecular Biology Laboratory. Protein Databank in Europe (PDBe)‏ An Introduction.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
EBI is an Outstation of the European Molecular Biology Laboratory. Protein Database in Europe Gaurav Sahni, Ph.D. Deposition, Validation, Search and Analysis.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Macromolecular Visualization or… Where to go when ChemDraw just isn’t enough Martin Case Chem
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
The Strategy of Atomic Resolution Structural Biology Break down complexity so that the system can be understood at a fundamental level Build up a picture.
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
Secondary structure prediction
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
©CMBI 2009 Alignment & Secondary Structure You have learned about: Data & databases Tools Amino Acids Protein Structure Today we will discuss: Aligning.
Module 3 Protein Structure Database/Structure Analysis Learning objectives Understand how information is stored in PDB Learn how to read a PDB flat file.
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
EBI is an Outstation of the European Molecular Biology Laboratory. Protein Database in Europe Deposition, Validation, Search and Analysis Services.
Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome.
EBI is an Outstation of the European Molecular Biology Laboratory. Protein Database in Europe Gaurav Sahni, Ph.D. Deposition, Validation, Search and Analysis.
X-ray detection xray/facilities.html.
Motivational Lecture: UNIX and computer-aided design of new medicines. Alexey Onufriev.
EMBL-EBI Representative sets and Clustering.. EMBL-EBI Representative sets A subset of data that provides a statistically valid sample set for the complete.
©CMBI 2008 Databases Data must be in a certain format for software to recognize Every database can have its own format but some data elements are essential.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Lecture 11 CS5661 Structural Bioinformatics – Structure Comparison Motivation Concepts Structure Comparison.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
EMBL-EBI Eugene Krissinel SSM - MSDfold. EMBL-EBI MSDfold (SSM)
Lecture 10 CS566 Fall Structural Bioinformatics Motivation Concepts Structure Solving Structure Comparison Structure Prediction Modeling Structural.
EBI is an Outstation of the European Molecular Biology Laboratory. A web based integrated search service to understand ligand binding and secondary structure.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
EBI is an Outstation of the European Molecular Biology Laboratory. PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches.
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
Sequence similarity, BLAST alignments & multiple sequence alignments
Protein Structure Visualisation
Getting the Most out of the PDBe
Number of released entries
Protein Structures.
Homology Modeling.
Protein structure prediction.
Introduction to Databases
Presentation transcript:

Structure databases, searches and alignments Marian Novotny Molecular Bioinformatics X3

Outline 1.Structure databases - why do we need them? - types of structural databases - Protein Data Bank - other useful databases 2.Searches - text searches 3.Structure searches (alignments) - why? - how ? - comparison of available tools

Structure databases Why?  data tend to get lost  source of information for further analysis  better access to data by general public  validation of data is (sometimes) possible

…. a structured collection of data held in computer storage; esp. one that incorporates software to make it accessible in a variety of ways; transf., any large collection of information. Oxford English dictionary Database is… …..a usually large collection of data organized especially for rapid search and retrieval (as by a computer) Merriam-Webster Online

Databases Primary databases Added-value databases Derived databases RCSB MSD PDBJ NDB CSD OCA PDBSum EDS Whatcheck Jena Image library

ftp archive of flat files

Primary databases - repositories of experimental data of macromolecular structures (X-ray, NMR, electron microscopy…) - RCSB (USA), MSD (Europe) and PDBJ (Japan) collaborate to form wwPDB. Data can be submitted to any of these databases. Databases interchange their new data on a regular basis, so they have an identical content. - primary databases differ in presentation of data and the amount of extra services and links they provide

The Protein Data Bank (PDB) - established in 1971 by Walter Hamilton at Brookhaven National Laboratory - seven structures were deposited at the beginning - the database was distributed on magnetic tapes - RCSB now run by the consortium of three institutions (San Diego Supercomputer Centre, Rutgers University and Centre for Avanced Reasearch and Biotechnology) structures ( ) - distributed over internet - released once a week

HEADER HYDROLASE 27-OCT-03 1UR9 TITLE INTERACTIONS OF A FAMILY 18 CHITINASE WITH THE DESIGNED TITLE 2 INHIBITOR HM508, AND ITS DEGRADATION PRODUCT, TITLE 3 CHITOBIONO-DELTA-LACTONE COMPND MOL_ID: 1; COMPND 2 MOLECULE: CHITINASE B; COMPND 3 CHAIN: A, B; COMPND 4 EC: ; COMPND 5 ENGINEERED: YES; COMPND 6 MUTATION: YES SOURCE MOL_ID: 1; SOURCE 2 ORGANISM_SCIENTIFIC: SERRATIA MARCESCENS; SOURCE 3 STRAIN: BJL200; SOURCE 4 EXPRESSION_SYSTEM: ESCHERICHIA COLI; SOURCE 5 EXPRESSION_SYSTEM_STRAIN: DH5 ALPHA; SOURCE 6 OTHER_DETAILS: CLONED GENE KEYWDS CHITINASE, INHIBITION, LACTONE, CHITIN DEGRADATION, KEYWDS 2 HYDROLASE, GLYCOSIDASE EXPDTA X-RAY DIFFRACTION AUTHOR G.VAAJE-KOLSTAD,A.VASELLA,M.G.PETER,C.NETTER,D.R.HOUSTON, AUTHOR 2 B.WESTERENG,B.SYNSTAD,V.G.H.EIJSINK,D.M.F.VAN AALTEN REVDAT 1 27-APR-04 1UR9 0 JRNL AUTH G.VAAJE-KOLSTAD,A.VASELLA,M.G.PETER,C.NETTER, JRNL AUTH 2 D.R.HOUSTON,B.WESTERENG,B.SYNSTAD,V.G.H.EIJSINK JRNL AUTH 2 D.M.F.VAN AALTEN JRNL TITL INTERACTIONS OF A FAMILY 18 CHITINASE WITH THE JRNL TITL 2 DESIGNED INHIBITOR HM508 AND ITS DEGRADATION JRNL TITL 3 PRODUCT, CHITOBIONO-DELTA-LACTONE JRNL REF J.BIOL.CHEM. V JRNL REFN ASTM JBCHA3 US ISSN REMARK 1 REMARK 1 REFERENCE 1 REMARK 1 AUTH D.M.F.VAN AALTEN,D.KOMANDER,B.SYNSTAD,S.GASEIDNES, REMARK 1 AUTH 2 M.G.PETER,V.G.H.EIJSINK REMARK 1 TITL STRUCTURAL INSIGHTS INTO THE CATALYTIC MECHANSIM OF REMARK 1 TITL 2 A FAMILY 18 EXOCHITINASE REMARK 1 REF PROC.NAT.ACAD.SCI.USA V REMARK 1 REFN ASTM PNASA6 US ISSN REMARK 1 REFERENCE 2 REMARK 1 AUTH D.M.F.VAN AALTEN,B.SYNSTAD,M.B.BRURBERG,E.HOUGH, REMARK 1 AUTH 2 B.RIISE,V.G.H.EIJSINK,R.K.WIERENGA REMARK 1 TITL STRUCTURE OF A TWO-DOMAIN CHITOTRIOSIDASE FROM PDB FILE

ATOM 340 N PHE A N ATOM 341 CA PHE A C ATOM 342 C PHE A C ATOM 343 O PHE A O ATOM 344 CB PHE A C ATOM 345 CG PHE A C ATOM 346 CD1 PHE A C ATOM 347 CD2 PHE A C ATOM 348 CE1 PHE A C ATOM 349 CE2 PHE A C ATOM 350 CZ PHE A C PDB file format Atom number Residue type Residue number X,Y,Z coordinates Temperature factor Atom identifier Chain Occupancy Atom type

ATOM 340 N PHE A N ATOM 341 CA PHE A C ATOM 342 C PHE A C ATOM 343 O PHE A O ATOM 344 CB PHE A C ATOM 345 CG PHE A C ATOM 346 CD1 PHE A C ATOM 347 CD2 PHE A C ATOM 348 CE1 PHE A C ATOM 349 CE2 PHE A C ATOM 350 CZ PHE A C ATOM 340 N PHE A N ATOM 341 CA PHE A C ATOM 342 C PHE A C ATOM 343 O PHE A O ATOM 344 CB PHE A C ATOM 345 CG PHE A C ATOM 346 CD1 PHE A C ATOM 347 CD2 PHE A C ATOM 348 CE1 PHE A C ATOM 349 CE2 PHE A C ATOM 350 CZ PHE A C

PDB files - problems - PDB format uses fixed-width fields, so one entry is limited to 99,999 atom records and chain identifier is limited to single character (not even for structures of huge complexes - e.g. ribosome and viruses) ATOM 340 N PHE A N - parsing of PDB files difficult - apart from ATOM records the file is almost unstructured (e.g. no rules to describe structure determination in REMARKS records) mmCIF and XML formats deal with these issues

Trust PDB? The database centres can’t refuse to accept any data! Even if curators of the PDB know the data contain serious errors. So, PDB does contain a lot of errors - from sequence consistency errors (you’ll deal with them) to completely wrong folds. And even the best data are still only the models that fit best experimental data. Never trust the PDB!

Trp D 67 7GPB Do you find this Trp normal?

Validation of structure files - check statistics for bond lengths, angles, Ramachandran plots…. - do statistics look similar to those of other proteins? WhatCheck, Procheck - how well does the model fit experimental data? EDS

Electron Density Server

PDBsum

PDBSum-Highlights

Text searches in structural databases Options: PDB - SearchLite, SearchFields MSD - MSDlight, MSDpro (Java), MSDmine OCA Find all the structures deposited by Gerard Kleywegt with resolution better than 2Å and published in Journal of Molecular Biology

Search Fields

Summary - three major repositories of structural data: RCSB, MSD and PDBJ -all three are part of wwPDB -structural data are deposited in PDB files - problems - new formats - mmCIF, XML - validation tools are necessary - WHATCheck, EDS - new services are developed to analyze the whole database (MSD services) - searches at various levels of depth/complexity - Searchlite, Search Fields - added-value databases - OCA, PDBSum

Structural alignment

Why structural alignment ? we have sequence alignment - Clustal… KTHLCV KSHA-V that gives us an idea about a correspondence of amino acids of two (or more ) proteins That enables to infer information about function And evolution of the Protein If the sequences are similar enough !!!!

What is twilight zone ? Sequence alignment unambiguously distinguishes only between protein pairs of similar structure and non- similar structures when the pairwise sequence identity is high. High sequence identity roughly means over 40 %. The signal gets blurred in the twilight zone of % sequence identity.

More of the twilight zone More than 90 % sequence pairs with the sequence identity lower than 25 % have different structures. Significance of sequence alignments is length dependent. The longer the sequence the lower identity is required to be called significant.Nevertheless, it converges to 25% with alignments longer than 80 amino acids. ‘The more similar than identical’ rule can reduce a number of false positives. Using intermediate sequences for finding links between more distant families can also reduce the number of false positives.

How far can the sequence identity drop? Average sequence identity of random alignments % Average sequence identity of remote homologues 8.5 %

How does it work? From

Structural alignment because: Structures are better conserved than sequences structural alignment can imply a functional similarity that is not detectable from a sequence alignment. Might help to improve sequence alignment when structures are available (phylogenetic studies, homology modeling). Will improve sequence alignment methods (use of structural alignments’ substitution matrices, gap penalties). Will improve sequence prediction methods

1FWR_A MKNWKTSAESILTTGPVVPVIVVKKLEHAVPMAKA 2YPI_A ARTFFVGGNFKLNGSKQSIKEIVERLNTASIPENVEVVICPPATYLDYSVSLVKKPQVTV ::... : :. *.. :. *... 1FWR_A LVAGGVRVLEVTLRTECAVDAIRAIAKEVPEAIVGAGTVLNPQQLAEVTE AGA 2YPI_A GAQNAYLKASGAFTGENSVDQIKDVGAKWVILGHSERRSYFHEDDKFIADKTKFALGQGV.... :: * :** *: :. :. :: ::: *. 1FWR_A QFAISPGLTEPLLKAATEGTIPLIPGISTVSELMLGMDYGLKEFQFFPAEANGGVKALQA 2YPI_A GVILCIGETLEEKKAGKTLDVVERQLNAVLEEVKDWTNVVVAYEPVWAIGTGLAATPEDA. :. * * **.. : :.:.*: : :.:. :..... :* 1FWR_A IAGPFSQVRFCPKGGISPANYRDYLALKSVLCIGGSWLVPADALEAGDYDRITKLAREAV 2YPI_A QDIHASIRKFLASKLGDKAASELRILYGGSANGSNAVTFKDKADVDGFLVGGASLKPEFV * :*... *. :...:..* * :.* * * 1FWR_A EGAKL-- 2YPI_A DIINSRN Structural versus sequence alignment Sequence ART---FFVGGNFKLNG-SKQSI-KEIVERLNTASI--PENVEVVICP.=ALI |=ID | | |.... |... Sequence 2 MKNWKTSAESIL--TTGP--VVPVI--VVKKLEHAVP-MAKALVAG-GVR-----V-LEV Sequence PATYLDYSVSLV-KKPQVTVGAQ-N--AY-LKASGAFTGEN-S---VDQIKDVG.=ALI |=ID |...||| | Sequence 2 TLRTECAVDAIRAIAKEVP-E--AIVGAGTVLN-PQ QLAEVT--E---AG Sequence 1 AKWVILGH--SERRSYFHEDDKFIADKTKFALGQGVGVILCIGETLEEKKAGKTLDVVER.=ALI |=ID |...| |.| |..|.... Sequence 2 AQFAIS-PGL TEPLLKAATEGTIPLIPGIS TVS Sequence 1 QLNAV-LEEVKDW-TNVVVAYEP--VW--AIGTGLAATPEDA--QDI--HASI-RKFLA-.=ALI |=ID.| | Sequence 2 ELMLGMD--YG-LK---EFQFFPAE-ANG G----VKA--LQA--IAG-P--FS Sequence 1 SKLGDKAA-SELRILYGGSANGSN-AVTF---KDK-ADVDGFLVGGA-SLK =ALI |=ID. |....| | Sequence QV---RFCPKGGIS-PANY--RDYL--ALKSVLCIGG-SWL-VPADALEAGDY Sequence 1 --P--EFV--DIIN--SR-N.=ALI |=ID Sequence 2 DRITKL-AREA--VEGAKL-

PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS PHE ASN VAL CYS ARG THR PRO GLU ALA ILE CYS PHE ASN VAL CYS ARG THR PRO GLU ALA ILE CYS Sequence versus structural alignment

Is it difficult to make structural alignment? Structural alignment is NP-hard (nondeterministic polynomial time) problem. In other words, it is not tractable properly. Even, if it would, the result would be correct from technical point of view not necessary from biological point of view. Yes, it is.

General solution Use a heuristic approach: 1.Represent the proteins A and B in some coordinate independent space 2.Compare A and B 3.Optimize the alignment between A and B (e.g. minimize R.M.S.d.) 4.Measure the statistical significance of the alignment against some random set of structure comparisons

“..in some coordinate independent space…” Make the problem easier by: - comparing only distance matrices of atoms -comparing secondary structure element (SSE) - comparing cartoons - comparing vectors of SSE - combination of mentioned methods - ….

None of the methods guarantee the finding of the closest structure and two methods can disagree at all amino acid positions. Nevertheless they can still provide a valuable insight into the history of the protein and give hints concerning the function.

ServerLocationMethod CE Extension of optimal path 1 DALI Distance-matrix alignment 2 DEJAVU SSE alignment with C  atom optimisation 3 LOCK Absolute orientation of corresponding points 4 MATRAS Markov transition model of evolution 5 PRIDE C  C  atom distances 6 SSM Graph matching algorithm TOP SSE alignment 7 TOPS tops.ebi.ac.uk/tops/compare1. html TOPS-diagram alignment 8 TOPSCAN pscan Secondary topology-string alignment 9 VAST tsearch.html Vector alignment 10 Methods for fold comparison

Protein structure classification If you want to know which structures are similar to a known structure, these systems might help: A)Manual - SCOP B)Semi-automatic - CATH C)Automatic - FSSP

CATH CATH Topology or fold group level From C. Orengo talk at EMBO course, Cambridge 2004

TIM barrel enzymes – 18 different homologous families >60 different E.C. numbers EC Wheel of TIM barrelsStructure of TIM barrel: Triose phosphate isomerase From J. Thornton talk at EMBO course, Cambridge 2004

Rossmann Fold Jelly Roll Alpha/Beta Plaits Arc repressor-like OB Fold CATH Rossmann Alpha-beta plaitTIM barrel Jelly Roll Immunoglobulin OB fold SH3-like Up-down Arc repressor-like nearly one third of the superfamilies belong to <10 fold groups From C. Orengo talk at EMBO course, Cambridge 2004

TargetDB contains sequences annotated like: -hypothetical protein Af0491 from A. fulgidus - putative serine hydrolase from S.cerevisiae -predicted glutamine amidotransferase from P. aeruginosa (January 2005) PDB contains about 500 structures with a similar degree of confidence in functional assignment Hypothetical Protein Mth938 (PDB ID:1ihn)

Function from structure

Fold and structural motifs SSM fold search Surface clefts Residue conservation DNA-binding HTH motifs Nest analysis Sequence motifs (PROSITE, BLOCKS, SMART, Pfam, etc) Sequence scans Sequence search vs PDB Sequence search vs Uniprot Superfamily HMM library Gene neighbours n-residue templates Enzyme active sites Ligand binding sites DNA binding sites Reverse templates

Summary Structural alignment can help with protein annotations even when the sequence similarity is not significant. Sequence identity of two proteins with similar structures can be lower than 10 % - number of folds is limited. Recent progress in protein structure determination increases the usefulness of structural alignment. Structural alignment is difficult problem that is solved by heuristic methods. These methods simplify the problem and sacrifice the optimum result for the speed.

Summary II Different methods can provide completely different alignments. In our results, CE, Dali,Matras and Vast were the best servers for finding structural relatives. A few structural classification systems have been developed (CATH, FSSP, SCOP), they provide hierarchical classification of protein structures and enable to infer functional and evolutionary relationships between proteins. Folds are not distributed equally. Ten most frequent folds represent almost one third of all structures.