Protein Structure Prediction

Slides:



Advertisements
Similar presentations
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Advertisements

Secondary Structure Prediction Victor A. Simossis Bioinformatics Center IBIVU.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Protein Secondary Structures
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Sequence analysis June 20, 2006 Learning objectives-Understand sliding window programs. Understand difference between identity, similarity and homology.
Predicting local Protein Structure Morten Nielsen.
Protein-a chemical view A chain of amino acids folded in 3D Picture from on-line biology bookon-line biology book Peptide Protein backbone N / C terminal.
1 Levels of Protein Structure Primary to Quaternary Structure.
Amino Acids and Proteins 1.What is an amino acid / protein 2.Where are they found 3.Properties of the amino acids 4.How are proteins synthesized 1.Transcription.
Garnier-Osguthorpe-Robson
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Sequence analysis June 18, 2008 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
©CMBI 2008 Aligning Sequences The most powerful weapon in the bioinformaticist’s armory is sequence alignment. Why? Lets’ think about an alignment. It.
Protein Structure Modeling (1). Protein Folding Problem A protein folds into a unique 3D structure under physiological conditions Lysozyme sequence: KVFGRCELAA.
Protein Secondary Structures Assignment and prediction.
Sequence analysis June 19, 2007 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU April 8, 2003Claus Lundegaard Protein Secondary Structures Assignment and prediction.
Sequence analysis June 17, 2003 Learning objectives-Review amino acids structures. Understand sliding window programs. Understand difference between identity,
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Computational Biology, Part 10 Protein Structure Prediction and Display Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Protein Secondary Structures Assignment and prediction.
©CMBI 2005 Why align sequences? Lots of sequences with unknown structure and function. A few sequences with known structure and function If they align,
CISC667, F05, Lec20, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction Protein Secondary Structure.
Protein Secondary Structures Assignment and prediction Pernille Haste Andersen
Protein Secondary Structure Prediction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia.
Structure Prediction in 1D
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU October 29, 2004Claus Lundegaard Protein Secondary Structures Assignment and.
Protein Secondary Structures Assignment and prediction.
Predicting local Protein Structure Morten Nielsen.
The relative orientation observed for  helices packed on ß sheets.
Protein Structure Elements Primary to Quaternary Structure.
Protein Structure FDSC400. Protein Functions Biological?Food?
Protein Structures: Experiments and Modeling Patrice Koehl.
Computer-Aided Protein Structure Prediction Dr. G.P.S. Raghava, F.N.A. Sc. Bioinformatics Centre Institute of Microbial Technology Institute of Microbial.
Protein structure prediction
What are proteins? Proteins are important; e.g. for catalyzing and regulating biochemical reactions, transporting molecules, … Linear polymer chain composed.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
©CMBI 2006 Amino Acids “ When you understand the amino acids, you understand everything ”
Protein Secondary Structure Prediction Some of the slides are adapted from Dr. Dong Xu’s lecture notes.
Protein Secondary Structure Prediction. Input: protein sequence Output: for each residue its associated Secondary structure (SS): alpha-helix, beta-strand,
Lecture 15 Secondary Structure Prediction Bioinformatics Center IBIVU.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Protein Secondary Structure Prediction
Secondary structure prediction
2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”
Web Servers for Predicting Protein Secondary Structure (Regular and Irregular) Dr. G.P.S. Raghava, F.N.A. Sc. Bioinformatics Centre Institute of Microbial.
Protein structure prediction May 26, 2011 HW #8 due today Quiz #3 on Tuesday, May 31 Learning objectives-Understand the biochemical basis of secondary.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
Protein Secondary Structure Prediction G P S Raghava.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Predicting Protein Structure: Comparative Modeling (homology modeling)
Amino Acids ©CMBI 2001 “ When you understand the amino acids, you understand everything ”
1 Protein structure Prediction. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics.
Proteins Secondary Structure Predictions
Secondary Structure Prediction Lecture 7 Structural Bioinformatics Dr. Avraham Samson
Protein Structure Modelling
Protein structure prediction Haixu Tang School of Informatics.
Protein structure prediction June 27, 2003 Learning objectives-Understand the basis of secondary structure prediction programs. Become familiar with the.
Proteins Structure Predictions Structural Bioinformatics.
Fibrous Proteins Examples 1. a-keratins 2. Silk Fibroin 3. Collagen
Improved Protein Secondary Structure Prediction. Secondary Structure Prediction Given a protein sequence a 1 a 2 …a N, secondary structure prediction.
Introduction to Bioinformatics II
Levels of Protein Structure
Computer-Aided Protein Structure Prediction
Computer-Aided Protein Structure Prediction
Lecture 10 Secondary Structure Prediction
Computer-Aided Protein Structure Prediction
Presentation transcript:

Protein Structure Prediction

? MNIFEMLRID EGLRLKIYKD TEGYYTIGIG HLLTKSPSLN AAKSELDKAI GRNCNGVITK DEAEKLFNQD VDAAVRGILR NAKLKPVYDS LDAVRRCALI NMVFQMGETG VAGFTNSLRM LQQKRWDEAA VNLAKSRWYN QTPNRAKRVI TTFRTGTWDA YKNL

Secondary Structure Predictions: KESAAAKFER QHMDSGNSPS SSSNYCNLMM CCRKMTQGKC KPVNTFVHES HHHHHHH HH SSTT T HHHHHH HHTT SSSS SEEEEE S LADVKAVCSQ KKVTCKNGQT NCYQSKSTMR ITDCRETGSS KYPNCAYKTT HHHHHGGGGS EEE TTS S EEE SSEEE EEEEEE TTT BTTB EEEE QVEKHIIVAC GGKPSVPVHF DASV EEEEEEEEEE ETTTTEE EE EE 3D Secondary Structure Predictions: The secondary structure of a protein (alpha-beta-loop) can be determined from its amino acidic sequence. The secondary structure is generally assigned from non-local interactions, that is from its H-bonding profile between CO and NH groups of the protein backbone.

Dihedral Angles

Ramachandran Plot

Different Levels of Protein Structure

Why predict when you can have the real thing? UniProt Release 1.3 (02/2004) consists of: Swiss-Prot Release : 144731 protein sequences TrEMBL Release : 1017041 protein sequences PDB structures : : 24358 protein structures Primary structure Secondary structure Tertiary structure Quaternary structure Function Mind the (sequence-structure) gap!

Secondary Structure Prediction Given a protein sequence a1a2…aN, secondary structure prediction aims at defining the state of each amino acid ai as being either H (helix), E (extended=strand), or O (other) (Some methods have 4 states: H, E, T for turns, and O for other). The quality of secondary structure prediction is measured with a “3-state accuracy” score, or Q3. Q3 is the percent of residues that match “reality” (X-ray structure).

Quality of Secondary Structure Prediction Determine Secondary Structure positions in known protein structures using DSSP or STRIDE: Kabsch and Sander. Dictionary of Secondary Structure in Proteins: pattern recognition of hydrogen-bonded and geometrical features. Biopolymer 22: 2571-2637 (1983) (DSSP) Frischman and Argos. Knowledge-based secondary structure assignments. Proteins, 23:566-571 (1995) (STRIDE)

Limitations of Q3 ALHEASGPSVILFGSDVTVPPASNAEQAK hhhhhooooeeeeoooeeeooooohhhhh ohhhooooeeeeoooooeeeooohhhhhh hhhhhoooohhhhooohhhooooohhhhh Amino acid sequence Actual Secondary Structure Q3=22/29=76% (useful prediction) Q3=22/29=76% (terrible prediction) Q3 for random prediction is 33% Secondary structure assignment in real proteins is uncertain to about 10%; Therefore, a “perfect” prediction would have Q3=90%.

Early methods for Secondary Structure Prediction Chou and Fasman (Chou and Fasman. Prediction of protein conformation. Biochemistry, 13: 211-245, 1974) GOR (Garnier, Osguthorpe and Robson. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol., 120:97- 120, 1978)

Chou and Fasman Start by computing amino acids propensities to belong to a given type of secondary structure: Propensities > 1 mean that the residue type I is likely to be found in the Corresponding secondary structure type.

From sequences to profile For each position along the sequence, tabulate how often each type of amino acid occur (include ‘.’ for gap) The profile is always of size Nx21, no matter how many sequences are considered Number of G 0 0 0 0 0 0 0 0 4 Number of A 0 0 0 0 0 0 0 0 0 Number of V 0 0 0 0 0 1 0 0 0 Number of I 0 0 0 0 0 0 0 1 0 Number of L 0 0 3 0 0 0 0 0 0 Number of F 3 0 0 0 0 0 0 0 0 Number of P 0 0 0 0 5 4 0 0 0 Number of M 0 0 0 0 0 0 0 0 0 Number of W 0 0 0 0 0 0 0 0 0 Number of C 0 5 0 0 0 0 0 0 0 Number of S 0 0 0 0 0 0 0 0 0 Number of T 0 0 0 0 0 0 0 3 0 Number of N 0 0 0 0 0 0 0 0 0 Number of Q 0 0 0 0 0 0 0 0 0 Number of H 0 0 0 0 0 0 0 0 0 Number of Y 1 0 0 0 0 0 3 0 0 Number of D 1 0 1 0 0 0 1 1 0 Number of E 0 0 0 4 0 0 0 0 0 Number of K 0 0 1 1 0 0 0 0 1 Number of R 0 0 0 0 0 0 1 0 0

Chou and Fasman Amino Acid -Helix -Sheet Turn Favors a-Helix Favors Ala 1.29 0.90 0.78 Cys 1.11 0.74 0.80 Leu 1.30 1.02 0.59 Met 1.47 0.97 0.39 Glu 1.44 0.75 1.00 Gln 1.27 0.80 0.97 His 1.22 1.08 0.69 Lys 1.23 0.77 0.96 Val 0.91 1.49 0.47 Ile 0.97 1.45 0.51 Phe 1.07 1.32 0.58 Tyr 0.72 1.25 1.05 Trp 0.99 1.14 0.75 Thr 0.82 1.21 1.03 Gly 0.56 0.92 1.64 Ser 0.82 0.95 1.33 Asp 1.04 0.72 1.41 Asn 0.90 0.76 1.23 Pro 0.52 0.64 1.91 Arg 0.96 0.99 0.88 Favors a-Helix Favors b-strand Favors turn

Chou and Fasman Predicting helices: - find nucleation site: 4 out of 6 contiguous residues with P(a)>1 - extension: extend helix in both directions until a set of 4 contiguous residues has an average P(a) < 1 (breaker) - if average P(a) over whole region is >1, it is predicted to be helical Predicting strands: - find nucleation site: 3 out of 5 contiguous residues with P(b)>1 - extension: extend strand in both directions until a set of 4 contiguous residues has an average P(b) < 1 (breaker) - if average P(b) over whole region is >1, it is predicted to be a strand

Chou and Fasman f(i) f(i+1) f(i+2) f(i+3) Position-specific parameters for turn: Each position has distinct amino acid preferences. Examples: At position 2, Pro is highly preferred; Trp is disfavored At position 3, Asp, Asn and Gly are preferred At position 4, Trp, Gly and Cys preferred

Chou and Fasman Predicting turns: Chou and Fasman prediction: - for each tetrapeptide starting at residue i, compute: - PTurn (average propensity over all 4 residues) - F = f(i)*f(i+1)*f(i+2)*f(i+3) - if PTurn > Pa and PTurn > Pb and PTurn > 1 and F>0.000075 tetrapeptide is considered a turn. Chou and Fasman prediction: http://fasta.bioch.virginia.edu/fasta_www/chofas.htm

The GOR method Position-dependent propensities for helix, sheet or turn is calculated for each amino acid. For each position j in the sequence, eight residues on either side are considered. A helix propensity table contains information about propensity for residues at 17 positions when the conformation of residue j is helical. The helix propensity tables have 20 x 17 entries. Build similar tables for strands and turns. GOR simplification: The predicted state of AAj is calculated as the sum of the position-dependent propensities of all residues around AAj. GOR can be used at : http://abs.cit.nih.gov/gor/ (current version is GOR IV) j

GOR: the older standard The GOR method (version IV) was reported by the authors to perform single sequence prediction accuracy with an accuracy of 64.4% as assessed through jackknife testing over a database of 267 proteins with known structure. (Garnier, J. G., Gibrat, J.-F., , Robson, B. (1996) In: Methods in Enzymology (Doolittle, R. F., Ed.) Vol. 266, pp. 540-53.) The GOR method relies on the frequencies observed in the database for residues in a 17- residue window (i.e. eight residues N-terminal and eight C-terminal of the central window position) for each of the three structural states. 17 H E C GOR-I GOR-II GOR-III GOR-IV 20

How do secondary structure prediction methods work? They often use a window approach to include a local stretch of amino acids around a considered sequence position in predicting the secondary structure state of that position The next slides provide basic explanations of the window approach (for the GOR method as an example) and a basic technique to train a method and predict : neural nets

Sliding window H H H E E E E Central residue Sliding window Sequence of known structure H H H E E E E The frequencies of the residues in the window are converted to probabilities of observing a SS type The GOR method uses three 17*20 windows for predicting helix, strand and coil; where 17 is the window length and 20 the number of a.a. types At each position, the highest probability (helix, strand or coil) is taken. A constant window of n residues long slides along sequence

Sliding window H H H E E E E Sliding window Sequence of known structure H H H E E E E The frequencies of the residues in the window are converted to probabilities of observing a SS type The GOR method uses three 17*20 windows for predicting helix, strand and coil; where 17 is the window length and 20 the number of a.a. types At each position, the highest probability (helix, strand or coil) is taken. A constant window of n residues long slides along sequence

Sliding window H H H E E E E Sliding window Sequence of known structure H H H E E E E The frequencies of the residues in the window are converted to probabilities of observing a SS type The GOR method uses three 17*20 windows for predicting helix, strand and coil; where 17 is the window length and 20 the number of a.a. types At each position, the highest probability (helix, strand or coil) is taken. A constant window of n residues long slides along sequence

Sliding window H H H E E E E Sliding window Sequence of known structure H H H E E E E The frequencies of the residues in the window are converted to probabilities of observing a SS type The GOR method uses three 17*20 windows for predicting helix, strand and coil; where 17 is the window length and 20 the number of a.a. types At each position, the highest probability (helix, strand or coil) is taken. A constant window of n residues long slides along sequence

Accuracy Both Chou and Fasman and GOR have been assessed and their accuracy is estimated to be Q3=60-65%. (initially, higher scores were reported, but the experiments set to measure Q3 were flawed, as the test cases included proteins used to derive the propensities!)

Neural networks The most successful methods for predicting secondary structure are based on neural networks. The overall idea is that neural networks can be trained to recognize amino acid patterns in known secondary structure units, and to use these patterns to distinguish between the different types of secondary structure. Neural networks classify “input vectors” or “examples” into categories (2 or more). They are loosely based on biological neurons.

The perceptron X1 w1 T w2 X2 wN XN Input Threshold Unit Output The perceptron classifies the input vector X into two categories. If the weights and threshold T are not known in advance, the perceptron must be trained. Ideally, the perceptron must be trained to return the correct answer on all training examples, and perform well on examples it has never seen. The training set must contain both type of data (i.e. with “1” and “0” output).

The perceptron Notes: - The input is a vector X and the weights can be stored in another vector W. - the perceptron computes the dot product S = X.W - the output F is a function of S: it is often set discrete (i.e. 1 or 0), in which case the function is the step function. For continuous output, often use a sigmoid: 1 1/2 - Not all perceptrons can be trained ! (famous example: XOR)

The perceptron Training a perceptron: Find the weights W that minimizes the error function: P: number of training data Xi: training vectors F(W.Xi): output of the perceptron t(Xi) : target value for Xi Use steepest descent: - compute gradient: - update weight vector: - iterate (e: learning rate)

Neural Network A complete neural network is a set of perceptrons interconnected such that the outputs of some units becomes the inputs of other units. Many topologies are possible! Neural networks are trained just like perceptron, by minimizing an error function:

Neural networks and Secondary Structure prediction Experience from Chou and Fasman and GOR has shown that: In predicting the conformation of a residue, it is important to consider a window around it. Helices and strands occur in stretches It is important to consider multiple sequences

PHD: Secondary structure prediction using NN

PHD: Input For each residue, consider a window of size 13: 13x20=260 values

PHD: Network 1 Sequence Structure 13x20 values 3 values Pa(i) Pb(i) Pc(i)

PHD: Network 2 Structure Structure For each residue, consider a window of size 17: 3 values 3 values 17x3=51 values Network2 Pa(i) Pb(i) Pc(i) Pa(i) Pb(i) Pc(i)

PHD Sequence-Structure network: for each amino acid aj, a window of 13 residues aj-6…aj…aj+6 is considered. The corresponding rows of the sequence profile are fed into the neural network, and the output is 3 probabilities for aj: P(aj,alpha), P(aj, beta) and P(aj,other) Structure-Structure network: For each aj, PHD considers now a window of 17 residues; the probabilities P(ak,alpha), P(ak,beta) and P(ak,other) for k in [j-8,j+8] are fed into the second layer neural network, which again produces probabilities that residue aj is in each of the 3 possible conformation Jury system: PHD has trained several neural networks with different training sets; all neural networks are applied to the test sequence, and results are averaged Prediction: For each position, the secondary structure with the highest average score is output as the prediction

Secondary Structure Prediction Available servers: - JPRED : http://www.compbio.dundee.ac.uk/~www-jpred/ - PHD: http://cubic.bioc.columbia.edu/predictprotein/ - PSIPRED: http://bioinf.cs.ucl.ac.uk/psipred/ - NNPREDICT: http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html - Chou and Fassman: http://fasta.bioch.virginia.edu/fasta_www/chofas.htm Interesting paper: - Rost and Eyrich. EVA: Large-scale analysis of secondary structure prediction. Proteins 5:192-199 (2001)