Structure Prediction in 1D

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Data Mining Classification: Alternative Techniques
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
1 Protein Structure, Structure Classification and Prediction Bioinformatics X3 January 2005 P. Johansson, D. Madsen Dept.of Cell & Molecular Biology, Uppsala.
PROTEIN SECONDARY STRUCTURE PREDICTION WITH NEURAL NETWORKS.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Protein Secondary Structures
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Protein Structure Databases Databases of three dimensional structures of proteins, where structure has been solved using X-ray crystallography or nuclear.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU April 8, 2003Claus Lundegaard Protein Secondary Structures Assignment and prediction.
Computational Biology, Part 10 Protein Structure Prediction and Display Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Protein Secondary Structures Assignment and prediction.
CISC667, F05, Lec20, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction Protein Secondary Structure.
Protein Secondary Structures Assignment and prediction Pernille Haste Andersen
Protein Secondary Structure Prediction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU October 29, 2004Claus Lundegaard Protein Secondary Structures Assignment and.
Introduction to Bioinformatics - Tutorial no. 8 Predicting protein structure PSI-BLAST.
Protein Structure July 2, 2006 Learning objectives-Understand the basis of the secondary structure prediction program- Psi-PRED. Introduce the concept.
Class 7: Protein Secondary Structure
Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs Christopher Bystroff & David Baker Paper presented by: Tal Blum.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Structures.
Artificial Neural Networks for Secondary Structure Prediction CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (slides by J. Burg)
Protein structure prediction
Lecture 11, CS5671 Secondary Structure Prediction Progressive improvement –Chou-Fasman rules –Qian-Sejnowski –Burkhard-Rost PHD –Riis-Krogh Chou-Fasman.
CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.
Genomics and Personalized Care in Health Systems Lecture 9 RNA and Protein Structure Leming Zhou, PhD School of Health and Rehabilitation Sciences Department.
Artificial Neural Networks (ANN). Output Y is 1 if at least two of the three inputs are equal to 1.
Rising accuracy of protein secondary structure prediction Burkhard Rost
Proteins Secondary Structure Predictions Structural Bioinformatics.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies,
Intelligent Systems for Bioinformatics Michael J. Watts
Protein Secondary Structure Prediction Some of the slides are adapted from Dr. Dong Xu’s lecture notes.
COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Protein Secondary Structure Prediction. Input: protein sequence Output: for each residue its associated Secondary structure (SS): alpha-helix, beta-strand,
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
© Wiley Publishing All Rights Reserved. Protein 3D Structures.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Protein Secondary Structure Prediction
Secondary structure prediction
2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”
Web Servers for Predicting Protein Secondary Structure (Regular and Irregular) Dr. G.P.S. Raghava, F.N.A. Sc. Bioinformatics Centre Institute of Microbial.
Protein structure prediction May 26, 2011 HW #8 due today Quiz #3 on Tuesday, May 31 Learning objectives-Understand the biochemical basis of secondary.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
Protein Secondary Structure Prediction G P S Raghava.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
. Protein Structure Prediction. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that have (more.
Protein Structure and Bioinformatics. Chapter 2 What is protein structure? What are proteins made of? What forces determines protein structure? What is.
Comparative methods Basic logics: The 3D structure of the protein is deduced from: 1.Similarities between the protein and other proteins 2.Statistical.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Lecture 10 CS566 Fall Structural Bioinformatics Motivation Concepts Structure Solving Structure Comparison Structure Prediction Modeling Structural.
Proteins Structure Predictions Structural Bioinformatics.
Protein Structure Prediction. Protein Sequence Analysis Molecular properties (pH, mol. wt. isoelectric point, hydrophobicity) Secondary Structure Super-secondary.
Predicting Structural Features Chapter 12. Structural Features Phosphorylation sites Transmembrane helices Protein flexibility.
Improved Protein Secondary Structure Prediction. Secondary Structure Prediction Given a protein sequence a 1 a 2 …a N, secondary structure prediction.
Introduction to Bioinformatics II
Protein Structures.
Protein structure prediction.
Protein structure prediction
Neural Networks for Protein Structure Prediction Dr. B Bhunia.
Presentation transcript:

Structure Prediction in 1D [Based on Structural Bioinformatics, chapter 28] .

Protein Structure Amino-acid chains fold to form 3d structures Proteins are sequences that have (more or less) stable 3-dimensional configuration Structure is crucial for function: Area with a specific property Enzymatic pockets Firm structures

Levels of structure: primary structure

Levels of structure: secondary structure α helix β sheet David Eisenberg, PNAS 100: 11207-11210

Levels of structure: tertiary and quaternary structure Four levels of structure in hemagglutinin, which is a long multimeric molecule whose three identical subunits are each composed of two chains, HA1 and HA2. (a) Primary structure is illustrated by the amino acid sequence of residues 68 –195 of HA1. This region is used by influenza virus to bind to animal cells. The one-letter amino acid code is used. Secondary structure is represented diagrammatically beneath the sequence, showing regions of the polypeptide chain that are folded into α helices (light blue cylinders), β strands (light green arrows), and random coils (white strands). (b) Tertiary structure constitutes the folding of the helices and strands in each HA subunit into a compact structure that is 13.5 nm long and divided into two domains. The membrane-distal domain is folded into a globular conformation. The blue and green segments in this domain correspond to the sequence shown in part (a). The proximal domain, which lies adjacent to the viral membrane, has a stemlike conformation due to alignment of two long helices of HA2 (dark blue) with β strands in HA1. Short turns and longer loops, which usually lie at the surface of the molecule, connect the helices and strands in a given chain. (c) The quaternary structure comprises the three subunits of HA; the structure is stabilized by lateral interactions among the long helices (dark blue) in the subunit stems, forming a triple-stranded coiled-coil stalk. Each of the distal globular domains in trimeric hemagglutinin has a site (red) for binding sialic acid molecules on the surface of target cells. Like many membrane proteins, HA has several covalently bound carbohydrate (CHO) chains.

Ramachandran Plot

Determining structure: X-ray crystallography

Determining structure: NMR spectroscopy

Determining Structure X-Ray and NMR methods allow to determine the structure of proteins and protein complexes These methods are expensive and difficult [several months to process one protein] A centralized database (PDB) contains all solved protein structures (www.rcsb.org/pdb/) XYZ coordinate of atoms within specified precision ~31,000 solved structures

Sequence from structure All information about the native structure of a protein is coded in the amino acid sequence + its native solution environment. Can we decipher the code? No general prediction of 3d from sequence yet. Anfinsen, 1973

One dimensional prediction Project 3d structure onto strings of structural assignments A simplification of the prediction problem Examples: Secondary structure state for each residue [α, β, L] Accessibility of each residue [buried, exposed] Transmembrane helix

Define secondary structure 3D protein coordinates may be converted into a 1D secondary structure representation using DSSP or STRIDE DSSP EEEE_SS_EEEE_GGT__EE_E_HHHHHHHHHHHHHHHGG_TT DSSP = Database of Secondary Structure in Proteins STRIDE = Secondary STRucture IDEntification method 2

Labeling Secondary Structure Use both hydrogen bond patterns and backbone dihedral angles to label secondary structure tags from XYZ coordinate of amino-acids Do not lead to absolute definition of secondary structure Why not absolutely defined??

Prediction of Secondary Structure Input: Amino-acid sequence Output: Annotation sequence of three classes [alpha, beta, other (sometimes called coil/turn)] Measure of success: Percentage of residues that were correctly labeled

Accuracy of 3-state predictions True SS: EEEE_SS_EEEE_GGT__EE_E_HHHHHHHHHHHHHHHGG_TT Prediction: EEEELLLLHHHHHHLLLLEEEEEHHHHHHHHHHHHHHHHHHLL Q3-score = % of 3-state symbols that are correctly measured on a "test set" Test set = An independent set of cases (proteins) that were not used to train, or in any way derive, the method being tested. Best methods PHD (Burkhard Rost): 72-74% Q3 Psi-pred (David T. Jones): 76-78% Q3 5

What can you do with a secondary structure prediction? Find out if a homolog of unknown structure is missing any of the SS (secondary structure) units, i.e. a helix or a strand. Find out whether a helix or strand is extended or shortened in the homolog. Model a large insertion or terminal domain Aid tertiary structure prediction 9

Statistical Methods From PDB database, calculate the propensity for a given amino acid to adopt a certain ss-type Example: #Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=500 P(a,aa) = 500/20,000, p(a) = 4,000/20,000, p(aa) = 2,000/20,000 P = 500 / (4,000/10) = 1.25 Used in Chou-Fasman algorithm (1974)

Chou-Fasman: Initiation Identify regions where 4/6 have propensity P(H) > 1.00 This forms a “alpha-helix nucleus”

Chou-Fasman: Propagation Extend helix in both directions until a set of four residues have an average P(H) <1.00.

Chou-Fasman Prediction Predict as -helix segment with E[P] > 1.03 E[P] > E[P] Not including Proline Predict as -strand segment with E[P] > 1.05 E[P] > E[P] Others are labeled as turns/loops. (Various extensions appear in the literature) http://fasta.bioch.virginia.edu/o_fasta/chofas.htm

Achieved accuracy: around 50% Shortcoming of this method: ignoring the context of the sequence when predicting using amino-acids We would like to use the sequence context as an input to a classifier There are many ways to address this. The most successful to date are based on neural networks

A Neuron

… Artificial Neuron Input Output a1 a2 ak W1 a2 W2 … Wk ak A neuron is a multiple-input, single output unit Wi = weights assigned to inputs; b = internal “bias” f = output function (linear, sigmoid)

Artificial Neural Network Input Hidden Output a1 o1 a2 … … … om ak Neurons in hidden layers compute “features” from outputs of previous layers Output neurons can be interpreted as a classifier

Example: Fruit Classifer Yellow Light Soft Round Orange Red Heavy Hard Ellipse Apple Color Weight Texture Shape

Qian-Sejnowski Architecture ... Si-w o o Si oo Si+w Input Hidden Output

Neural Network Prediction A neural network defines a function from inputs to outputs Inputs can be discrete or continuous valued In this case, the network defines a function from a window of size 2w+1 around a residue to a secondary structure label for it Structure element determined by max(o, o, oo)

Training Neural Networks By modifying the network weights, we change the function Training is performed by Defining an error score for training pairs <input,output> Performing gradient-descent minimization of the error score Back-propagation algorithm allows to compute the gradient efficiently We have to be careful not to overfit training data

Smoothing Outputs Some sequences of secondary structure are impossible:  To smooth the output of the network, another layer is applied on top of the three output units for each residue Success rate: about 65% on unseen proteins

Breaking the 70% Threshold An innovation that made a crucial difference uses evolutionary information to improve prediction Key idea: Structure is preserved more than sequence Surviving mutations are not random Exploit evolutionary information, based on conservation analysis of multiple sequence alignments.

Nearest Neighbor Approach Predict the secondary structure state, based on the secondary structure of homologous segments from proteins with known 3d structure. A key element: the choice of scoring table for evaluation of segment similarity. Use max (na, nb, nc) [NNSSP: Nearest-Neighbor Secondary Structure Prediction] Where na, nb, nc are the numbers of nearest neighbors with the helix, strand and coil types, respectively.

[The PredictProtein server] PHD Approach Perform BLAST search to find local alignments Remove alignments that are “too close” Perform multiple alignments of sequences Construct a profile (PSSM) of amino-acid frequencies at each residue Use this profile as input to the neural network A second network performs “smoothing” The third level computes jury decision of several different instantiations of the first two levels. [The PredictProtein server] The first level is composed of ANNs almost identical to the architecture described above. However, there are differences in the input representation and the fact that multiple ANNs are used, each with their input window shifted one residue. Rather than simply using a binary representation to signify which amino acids are in a sequence, PHD encodes the input signal using aligned sequences. For example, by doing a multiple alignment on the query sequence, each position along the query sequence will have some number of aligned residues from matched sequences. PHD takes the frequency of occurrence of an amino acid in the aligned sequences and transforms that into an analog value input into the ANN. The outputs from all of the level 1 ANNs are then fed into a single second level ANN. The level 2 ANN also consists of a hidden layer and three output units giving the likelihood that the central residue is an α-helix, β-sheet, or loop/coil. (It should be noted that the use of the term “likelihood” does not imply statistical significance but simply some measure of conformational state strength on a scale from 0 to 1.) The second level, therefore, maps secondary structure information from its inputs to secondary structure information at its output. The third level of PHD computes an arithmetic average or jury decision of several different instantiations of the first two levels. One instantiation added “conservation weights” to the sequence profiles. These weightings placed a higher emphasis on particularly well conserved positions. Another instantiation used a slightly different representation method for the input vectors. The third level, therefore, provides a jury decision on the likelihood of secondary structure based on many different versions of the level 1 and level 2 ANNs.

Psi-pred : same idea (Step 1) Run PSI-Blast --> output sequence profile (Step 2) 15-residue sliding window = 315 values, multiplied by hidden weights in 1st neural net. Output is 3 values (a weight for each state H, E or L) per position. (Step 3) 60 input values, multiplied by weights in 2nd neural network, summed. Output is final 3-state prediction. 315 = 15*21. 21 because you have 20 possible aa + 1 value to indicate a “missing value” near the 3’ or 5’ end of the sequence. 60 = 15*4 again 4 is for 3 possible structure prediction + 1 “missing value” for the window in the edges Performs slightly better than PHD 7

Other Classification Methods Neural Networks were used as a classifier in the described methods. We can apply the same idea, with other classifiers, e.g.: SVM Advantages: Effectively avoid over-fitting Supplies prediction confidence [S. Hua and Z. Sun, (2001)]

Secondary Structure Prediction - Summary 1st Generation - 1970s Chou & Fausman, Q3 = 50-55% 2nd Generation -1980s Qian & Sejnowski, Q3 = 60-65% 3rd Generation - 1990s PHD, PSI-PRED, Q3 = 70-80% Failures: Long term effects: S-S bonds, parallel strands Chemical patterns Wrong prediction at the ends of H/E