Class 7: Protein Secondary Structure

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
1 Protein Structure, Structure Classification and Prediction Bioinformatics X3 January 2005 P. Johansson, D. Madsen Dept.of Cell & Molecular Biology, Uppsala.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Protein Secondary Structures
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Protein Structure Databases Databases of three dimensional structures of proteins, where structure has been solved using X-ray crystallography or nuclear.
Protein Secondary Structures Assignment and prediction.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU April 8, 2003Claus Lundegaard Protein Secondary Structures Assignment and prediction.
Computational Biology, Part 10 Protein Structure Prediction and Display Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Protein Secondary Structures Assignment and prediction.
CISC667, F05, Lec20, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction Protein Secondary Structure.
Protein Secondary Structures Assignment and prediction Pernille Haste Andersen
Protein Secondary Structure Prediction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia.
Structure Prediction in 1D
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU October 29, 2004Claus Lundegaard Protein Secondary Structures Assignment and.
Fa 05CSE182 CSE182-L6 Protein structure basics Protein sequencing.
Protein Secondary Structures Assignment and prediction.
Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs Christopher Bystroff & David Baker Paper presented by: Tal Blum.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Structures.
Artificial Neural Networks for Secondary Structure Prediction CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (slides by J. Burg)
Lecture 11, CS5671 Secondary Structure Prediction Progressive improvement –Chou-Fasman rules –Qian-Sejnowski –Burkhard-Rost PHD –Riis-Krogh Chou-Fasman.
CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.
Genomics and Personalized Care in Health Systems Lecture 9 RNA and Protein Structure Leming Zhou, PhD School of Health and Rehabilitation Sciences Department.
Rising accuracy of protein secondary structure prediction Burkhard Rost
Proteins Secondary Structure Predictions Structural Bioinformatics.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Protein Secondary Structure Prediction Some of the slides are adapted from Dr. Dong Xu’s lecture notes.
Outline What Neural Networks are and why they are desirable Historical background Applications Strengths neural networks and advantages Status N.N and.
Protein Secondary Structure Prediction. Input: protein sequence Output: for each residue its associated Secondary structure (SS): alpha-helix, beta-strand,
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
© Wiley Publishing All Rights Reserved. Protein 3D Structures.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Artificial Intelligence Techniques Multilayer Perceptrons.
Protein Secondary Structure Prediction
Secondary structure prediction
2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”
Web Servers for Predicting Protein Secondary Structure (Regular and Irregular) Dr. G.P.S. Raghava, F.N.A. Sc. Bioinformatics Centre Institute of Microbial.
Protein structure prediction May 26, 2011 HW #8 due today Quiz #3 on Tuesday, May 31 Learning objectives-Understand the biochemical basis of secondary.
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
Protein Secondary Structure Prediction G P S Raghava.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
. Protein Structure Prediction. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that have (more.
Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.
Protein Structure and Bioinformatics. Chapter 2 What is protein structure? What are proteins made of? What forces determines protein structure? What is.
Query sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDN GVDGEWTYTE Structure-Sequence alignment “Structure is better preserved than sequence” Me! Non-redundant.
Comparative methods Basic logics: The 3D structure of the protein is deduced from: 1.Similarities between the protein and other proteins 2.Statistical.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Machine Learning Methods of Protein Secondary Structure Prediction Presented by Chao Wang.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Proteins Structure Predictions Structural Bioinformatics.
Predicting Structural Features Chapter 12. Structural Features Phosphorylation sites Transmembrane helices Protein flexibility.
Improved Protein Secondary Structure Prediction. Secondary Structure Prediction Given a protein sequence a 1 a 2 …a N, secondary structure prediction.
Introduction to Bioinformatics II
Protein Structures.
network of simple neuron-like computing elements
Protein structure prediction.
The Three-Dimensional Structure of Proteins
Neural Networks for Protein Structure Prediction Dr. B Bhunia.
Presentation transcript:

Class 7: Protein Secondary Structure .

Protein Structure Amino-acid chains can fold to form 3-dimensional structures Proteins are sequences that have (more or less) stable 3-dimensional configuration

Why Structure is Important? The structure a protein takes is crucial for its function Forms “pockets” that can recognize an enzyme substrate Situates side chain of specific groups to co-locate to form areas with desired chemical/electrical properties Creates firm structures such as collagen, keratins, fibroins

Determining Structure X-Ray and NMR methods allow to determine the structure of proteins and protein complexes These methods are expensive and difficult Could take several work months to process one proteins A centralized database (PDB) contains all solved protein structures XYZ coordinate of atoms within specified precision ~11,000 solved structures

Structure is Sequence Dependent Experiments show that for many proteins, the 3-dimensional structure is a function of the sequence Force the protein to loose its structure, by introducing agents that change the environment After sequences put back in water, original conformation/activity is restored However, for complex proteins, there are cellular processes that “help” in folding

Secondary Structure -helix -strands

a Helix Single protein chain Shape maintained by intramolecular H bonding between -C=O and H-N-

Hydrogen Bonds in -Helixes

These sheets hold together by hydrogen bonds across strands -Strands form Sheets parallel Anti-parallel These sheets hold together by hydrogen bonds across strands

Angular Coordinates Secondary structures force specific angles between residues

Ramachandran Plot We can related angles to types of structures

Define "secondary structure" 3D protein coordinates may be converted to a 1D secondary structure representation using DSSP or STRIDE DSSP EEEE_SS_EEEE_GGT__EE_E_HHHHHHHHHHHHHHHGG_TT DSSP= Database of Secondary Structure in Proteins 2

DSSP symbols H = helix backbone angles (-50,-60) and H-bonding pattern (i-> i+4) E = extended strand backbone angles (-120,+120) with beta-sheet H-bonds (parallel/anti-parallel are not distinguished) S= beta-bridge (isolated backbone H-bonds) T=beta-turn (specific sets of angles and 1 i->i+3 H-bond) G=3-10 helix or turn (i,i+3 H-bonds) I=Pi-helix (i,i+5 Hbonds) (rare!) _= unclassified. None-of-the-above. Generic loop, or beta-strand with no regular H-bonding. L 3

Prediction of Secondary Structure Input: amino-acid sequence Output: Annotation sequence of three classes: alpha beta other (sometimes called coil/turn) Measure of success: Percentage of residues that were correctly labeled

Accuracy of 3-state predictions True SS: EEEE_SS_EEEE_GGT__EE_E_HHHHHHHHHHHHHHHGG_TT Prediction: EEEELLLLHHHHHHLLLLEEEEEHHHHHHHHHHHHHHHHHHLL Q3-score = % of 3-state symbols that are correct Measured on a "test set" Test set == An independent set of cases (protein) that were not used to train, or in any way derive, the method being tested. Best methods: PHD (Burkhard Rost) -- 72-74% Q3 Psi-pred (David T. Jones) -- 76-78% Q3 5

Prediction Accuracy Three-state prediction accuracy: Q3 Other measures Per-segment accuracy(SOV), Mathews’ coefficient

What can you do with a secondary structure prediction? (1) Find out if a homolog of unknown structure is missing any of the SS (secondary structure) units, i.e. a helix or a strand. (2) Find out whether a helix or strand is extended/shortened in the homolog. (3) Model a large insertion or terminal domain (4) Aid tertiary structure prediction 9

Statistical Methods From PDB database, calculate the propensity for a given amino acid to adopt a certain ss-type Example: #Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=500 P(a,aa) = 500/20,000, p(a) = 4,000/20,000, p(aa) = 2,000/20,000 P = 500 / (4,000/10) = 1.25 Used in Chou-Fasman algorithm (1974)

Chou-Fasman: Initiation Identify regions where 4/6 have a P(H) >1.00 “alpha-helix nucleus”

Chou-Fasman: Propagation Extend helix in both directions until a set of four residues have an average P(H) <1.00.

Chou-Fasman Prediction Predict as -helix segment with E[P] > 1.03 E[P] > E[P] Not including proline Predict as  -strand segment with E[P] > 1.05 E[P] > E[P] Others are labeled as turns. (Various extensions appear in the literature)

Achieved accuracy: around 50% Shortcoming of this method: ignoring the context of the sequence when predicting using amino-acids We would like to use the sequence context as an input to a classifier There are many ways to address this. The most successful to date are based on neural networks

A Neuron

… Artificial Neuron Input Output a1 a2 ak W1 a2 W2 … Wk ak A neuron is a multiple-input -> single output unit Wi = weights assigned to inputs; b = internal “bias” f = output function (linear, sigmoid)

Artificial Neural Network Input Hidden Output a1 W1 o1 W2 a2 … … … Wk a3 om Neurons in hidden layers compute “features” from outputs of previous layers Output neurons can be interpreted as a classifier

Example: Fruit Classifer Shape Texture Weight Color Apple ellipse hard heavy red Orange round soft light yellow

Qian-Sejnowski Architecture ... o o oo Hidden Input Output Si Si-w Si+w

Neural Network Prediction A neural network defines a function from inputs to outputs Inputs can be discrete or continuous valued In this case, the network defines a function from a window of size 2w+1 around a residue to a secondary structure label for it Structure element determined by max(o, o, oo)

Training Neural Networks By modifying the network weights, we change the function Training is performed by Defining an error score for training pairs <input,output> Performing gradient-descent minimization of the error score Back-propagation algorithm allows to compute the gradient efficiently We have to be careful not to overfit training data

Smoothing Outputs The Qian-Sejnowski network assigns each residue a secondary structure by taking max(o, o, oo) Some sequences of secondary structure are impossible:  To smooth the output of the network, another layer is applied on top of the three output units for each residue: Neural network Markov model

Success Rate Variants of the neural network architecture and other methods achieved accuracy of about 65% on unseen proteins Depending on the exact choice of training/test sets

Breaking the 70% Threshold A innovation that made a crucial difference uses evolutionary information to improve prediction Key idea: Structure is preserved more than sequence Surviving mutations are not random Suppose we find homologues (same structure) of the query sequence The type of replacements at position i during evolution provides us with information about the use of the residue i in the secondary structure

Nearest Neighbor Approach Select a window around the target residues Perform local alignment to sequences with known structure Choice of alignment weight matrix to match remote homologies Alignment weight takes into account the secondary structure of aligned sequence Use max (na, nb, nc) or max(sa, sb, sc) Key: Scoring measure of evolutionary similarity.

PHD Approach Multi-step procedure: Perform BLAST search to find local alignments Remove alignments that are “too close” Perform multiple alignments of sequences Construct a profile (PSSM) of amino-acid frequencies at each residue Use this profile as input to the neural network A second network performs “smoothing”

PHD Architecture

Psi-pred : same idea (Step 1) Run PSI-Blast --> output sequence profile (Step 2) 15-residue sliding window = 315 weights, multiplied by hidden weights in 1st neural net. Output is 3 weights (1 weight for each state H, E or L) per position. (Step 3) 45 input weights, multiplied by weights in 2nd neural network, summed. Output is final 3-state prediction. Performs slightly better than PHD 7

Other Classification Methods Neural Networks were used as a classifier in the described methods. We can apply the same idea, with other classifiers. E.g.: SVM Advantages: Effectively avoid overfitting Supplies prediction confidence

SVM based approach Suggested by S. Hua and Z. Sun, (2001). Multiple sequence alignment from HSSP database (same as PHD) Sliding window of w  21  w input dimension Apply SVM with RBF kernel Multiclass problem: Training: one-against-others (e.g. H/~H, E/~E, L/~L), binary (e.g. H/E) maximum output score Decision tree method Jury decision method

Decision tree H E C E C H C H E H / ~H E / C Yes No E / ~E C/ H Yes No C / ~C H / E Yes No C H E

Accuracy on CB513 set Classifier Q3 QH QE QC SOV Max 72.9 74.8 58.6 79.0 75.4 Tree1 68.9 73.5 54.0 73.1 72.1 Tree2 68.2 72.0 61.0 69.0 71.4 Tree3 67.5 69.5 46.6 77.0 70.8 NN 74.7 57.7 77.4 75.0 Vote 70.7 73.0 76.6 73.2 Jury 75.2 60.3 79.5 76.2

State of the Art Both PHD and Nearest neighbor get about 72%-74% accuracy Both predicted well in CASP2 (1996) PSI-Pred slightly better (around 76%) Recent trend: combining classification methods Best predictions in CASP3 (1998) Failures: Long term effects: S-S bonds, parallel strands Chemical patterns Wrong prediction at the ends of helices/strands