Protein structure – introduction “Bioinformatics: genes, proteins and computers” Orengo, Jones and Thornton (2003).

Slides:



Advertisements
Similar presentations
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Advertisements

PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Protein Structure Prediction using ROSETTA
Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Structural bioinformatics
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Strict Regularities in Structure-Sequence Relationship
Protein Structure Modeling (2). Prediction
Heuristic alignment algorithms and cost matrices
Protein structure (Part 2 of 2).
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Protein Structure Modeling (1). Protein Folding Problem A protein folds into a unique 3D structure under physiological conditions Lysozyme sequence: KVFGRCELAA.
Introduction to Structural Bioinformatics Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia.
Recursive domains in proteins
The Protein Data Bank (PDB)
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
CISC667, F05, Lec20, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction Protein Secondary Structure.
Protein Tertiary Structure. Primary: amino acid linear sequence. Secondary:  -helices, β-sheets and loops. Tertiary: the 3D shape of the fully folded.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Protein structures in the PDB
Protein structure Classification Ole Lund, Associate professor, CBS, DTU.
PDB-Protein Data Bank SCOP –Protein structure classification CATH –Protein structure classification genTHREADER–3D structure prediction Swiss-Model–3D.
Protein Structure Prediction II
Bioinformatics (3 lectures) Why bother about proteins/prediction What is bioinformatics Protein databases Making use of database information –Predictions.
Protein Structure Prediction Samantha Chui Oct. 26, 2004.
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structures.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Protein Tertiary Structure Prediction
Forces and Prediction of Protein Structure Ming-Jing Hwang ( 黃明經 ) Institute of Biomedical Sciences Academia Sinica
Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.
Macromolecular structure
Practical session 2b Introduction to 3D Modelling and threading 9:30am-10:00am 3D modeling and threading 10:00am-10:30am Analysis of mutations in MYH6.
Genomics and Personalized Care in Health Systems Lecture 9 RNA and Protein Structure Leming Zhou, PhD School of Health and Rehabilitation Sciences Department.
Protein Sequence Alignment and Database Searching.
Protein Structure Prediction. Historical Perspective Protein Folding: From the Levinthal Paradox to Structure Prediction, Barry Honig, 1999 A personal.
Lecture 10 – protein structure prediction. A protein sequence.
Representations of Molecular Structure: Bonds Only.
Bioinformatics 2 -- Lecture 8 More TOPS diagrams Comparative modeling tutorial and strategies.
CATH – a hierarchic classification of protein domain structures Rui Kuang.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
Secondary structure prediction
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Part I : Introduction to Protein Structure A/P Shoba Ranganathan Kong Lesheng National University of Singapore.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein Strucure Comparison Chapter 6,7 Orengo. Helices α-helix4-turn helix, min. 4 residues helix3-turn helix, min. 3 residues π-helix5-turn helix,
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Protein Modeling Protein Structure Prediction. 3D Protein Structure ALA CαCα LEU CαCαCαCαCαCαCαCα PRO VALVAL ARG …… ??? backbone sidechain.
Protein Structure Prediction: Homology Modeling & Threading/Fold Recognition D. Mohanty NII, New Delhi.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
Structural proteomics Handouts. Proteomics section from book already assigned.
Query sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDN GVDGEWTYTE Structure-Sequence alignment “Structure is better preserved than sequence” Me! Non-redundant.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Forces and Prediction of Protein Structure Ming-Jing Hwang ( 黃明經 ) Institute of Biomedical Sciences Academia Sinica
3.3b1 Protein Structure Threading (Fold recognition) Boris Steipe University of Toronto (Slides evolved from original material.
METHOD: Family Classification Scheme 1)Set for a model building: 67 microbial genomes with identified protein sequences (Table 1) 2)Set for a model.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Chapter 14 Protein Structure Classification
Protein Structures.
Protein Sequence Analysis - Overview -
Protein structure prediction.
Presentation transcript:

Protein structure – introduction “Bioinformatics: genes, proteins and computers” Orengo, Jones and Thornton (2003).

Secondary structure elements  -helix  -strand   -sheet

Tertiary structure = protein fold complete 3-dimensional structure why is it interesting? isn’t the sequence enough?  a key to understand protein function  Structure-based drug design  detection of distant evolutionary relationships  the structure is more conserved!

Fold classification classification: clustering proteins into structural families motivation?  profound analysis of evolutionary mechanisms  constraints on secondary structure packing?  classification at domain level

 hierarchical classification of protein domain structures in the Brookhaven Protein Databank (PDB).  domains are clustered at four major levels: Class Architecture Topology Homologous superfamily Sequence family CATH – Protein Structure Classification

 Class secondary structure content: mainly , mainly ,  – , low 2 nd structure content.  Architecture gross orientation of secondary structures, independent of connectivity.  Topology ( = fold) clusters structures according to their topological connections. CATH – hierarchical classification

CATH – architectures

CATH – architectures (cont.)

 Homologous superfamily homologous domains identified by sequence similarity, and structure similarity  Sequence family domains clustered in the same sequence families, with sequence identity>35% CATH – hierarchical classification  other classification schemes: SCOP, FSSP partial disagreement between them.

Growing demand for protein structures!  PDB contains 20,868 structures  X-Ray and NMR have limitations. WE NEED FASTER METHODS!  GenBank contains 24,027,936 sequences!

Protein Structure Prediction I) Ab initio = ‘from the beginning’ - Simulation (physics) - search for conformation with lowest energy - Knowledge-based (i.e. statistics) protein sequence: RGYSLGNWVC KVFGRCELAA AMKRHGLDNY AAKFESNFNT QATNRNTDGS TDYGILQINS RWWCNDGRTP GSRNLCNIPC SALLSSDITA SVNCAKKIVS DGNGMNAWVA WRNRCKGTDV  Limited to very short peptides!

Can known structures assist prediction? the number of possible folds seems to be limited!  CATH inspection: more then 36,000 domains, but... only ~800 topology groups Total of "new folds" (light blue) and "old folds" (orange) for a given year  PDB inspection: a ‘new’ protein has a good chance to be of a known structure!

Template-based prediction (fold recognition) II) Comparative modeling (homology modeling) - alignment with homologous sequence of known structure. - high sequence identity areas: similar structure - variable areas: must be built  can’t be used if no sequence similarity found! III) Threading - alignment with structure sequences in fold library - sophisticated scoring function finds most similar fold - ‘Threading’ aligns target sequence onto template structure

“What are the baselines for protein fold recognition?” McGuffin, Bryson and Jones (2001) Goals: 1.what constitutes a baseline level of success for protein fold recognition methods, above random guesswork? 2. can simple methods that make use of 2 nd structure information assign folds more reliably? 3.how valuable might these methods be in the rapid construction of a useful hierarchical classification?

1. Absolute difference in length 2. Absolute difference in number of secondary structure elements 3. Simple alignment of secondary structure elements 4. Alignment of secondary structure elements (Przytycka et al., 1999) 5. Alignment of secondary structure elements without additional scoring 6. Alignment of secondary structure elements using DSSP as secondary structure assignment 7. Alignment of secondary structure elements with gap penalty 8. Alignment of secondary structure elements with gap penalty for long elements 9. Alignment of secondary structure elements with absolute difference in length as scoring scheme 10. Alignment of full length secondary structure strings 11. Alignment of primary sequence  shorten 2 nd structure strings: CCCHHHHCCCEEECCHHHCCC  HCECH.  pairwise alignment  scoring function also considers length of elements The methods evaluated (ordered by complexity and runtime)

A representative set of protein domains  a set of 1087 domains representing different “Sequence Families” was selected from CATH. 1. >1atx00 2. GAAaLbKSDGPNTRGNSMSGTIWVFGcPSGWNNbEGRAIIGYacKQ 3. EEE TTS S TTSSEEEEEESS TT EEE SSSSSEEEE 4. CEEEEEHHECEEEECCCECEEEECCCEECCEECEEECCEECEEEEC  generate an informative file for each domain:

First evaluation: true positive percentage compare true positive percentage, at a fixed 3% false positive. run each method on all possible pairs from the 1087 set (a,b) (a,c) (a,d)... (g,d) (g,e)... (k,f)... (r,s).... ~590,000 pairs CATH (g) != CATH (e)CATH (r) = CATH (s)CATH (a) != CATH (b)CATH (a) = CATH (d) for each list: go top downward, and compare assignment to CATH true counter = false counter = CATH (k) != CATH (f) 2 3 STOP! 3% false positives reached. true positive for this method = 2% Sort each score list by descending similarity score. (a,d) 0.99 (g,e) 0.98 (r,s) 0.87 | (a,b) 0.63 (k,f) 0.45 (g,d) 0.37 lets assume there are 100 structurly similar pairs And 100 dissimilar pairs

We need lower,upper controls to compare with lower control: intelligent guesswork 1. randomly assign CATH topology codes according to frequency 2. calculate true positive, false positive percentage upper control: automated recognition (given the 3D structure) 1.FSSP, SCOP and CATH databases were screened for all dissimilar domains that exist in the three of them. 2.FSSP gave similarity scores to all possible pairs. 3.FSSP assignments compared against CATH, and against SCOP.

Optimisation of similarity scoring methods: “Class pre-filter” each domain was assigned a class according to 2 nd structure: percentage of residues constituting  -helices /  -strands domain “1cgt03” 80% of AA in  -strand 10% of AA in  -helix

 most accurate is method number 5: “Alignment of secondary structure elements without additional scoring”, with: 27.18% true positive.  partial agreement between classification schemes: FSSP compared with SCOP: 61.1%, FSSP compared with CATH: 46.7%  methods that use 2 nd structure alignments are in better agreement with CATH  accuracy ordering of methods doesn’t correspond to their relative complexity  methods that use 2 nd structure usually don’t benefit notably from class pre-filter.

Second evaluation: CASP-like sensitivity similarly to CASP – we measure the sensitivity of each method: what is the probability of a method correctly assigning a fold? lower control: a random proportional fold assignment upper control: FSSP was used as a scoring method

Sensitivity results:  method 5 wins again: 31.8% sensitivity. other 2 nd structure based methods with small gap.  sensitivity order of the methods ~ true positive percentage order.

Similarity trees - can we construct classification? Best method’s similarity scores for all pairs were clustered into a tree. a. globin-like <> casein kinase b. immunoglobulin-like <> thrombin subunit H whole tree: generally disordered 1ckjA2 1irk02 1phk02 1ampE2 1hcl02 1ckjA2 1gdj00 1kobA2 1hbg00 1babA0 1lhs00 1mba00 1eca00 1ithA0 1ash00 1flp00 1sctA0 1cpcA0 1ddt02 1colA0 (a)(b) 1bec01 1tcrA2 1edhA2 1nfkA1 1itbB1 1cgt03 1svpA2 1jxpA2 1try02 1sgt02 1sgpE1 1sgpE2 1dar02

Conclusions 1.Baseline level to be exceeded by fold recognitionmethods: 27% true positive assignments allowing 3% false positive; sensitivity level of 32%. 2.methods which make use of 2 nd structure information seem more accurate and sensitive than those who don’t. 3.simple 2 nd structure alignments alone can not construct reliable classification hierarchy. 4.the agreement between FSSP, SCOP and CATH classification schemes is surprisingly low.