Structural Bioinformatics Elodie Laine Master BIM-BMC Semester 3, 2013-2014 Genomics of Microorganisms, UMR 7238, CNRS-UPMC e-documents:

Slides:



Advertisements
Similar presentations
Proteins: Structure reflects function….. Fig. 5-UN1 Amino group Carboxyl group carbon.
Advertisements

Review.
Amino Acids PHC 211.  Characteristics and Structures of amino acids  Classification of Amino Acids  Essential and Nonessential Amino Acids  Levels.
François Fages MPRI Bio-info 2007 Formal Biology of the Cell Protein structure prediction with constraint logic programming François Fages, Constraint.
Secondary structure prediction from amino acid sequence.
Review of Basic Principles of Chemistry, Amino Acids and Proteins Brian Kuhlman: The material presented here is available on the.
1 Protein Structure, Structure Classification and Prediction Bioinformatics X3 January 2005 P. Johansson, D. Madsen Dept.of Cell & Molecular Biology, Uppsala.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Protein Secondary Structures
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Applied Bioinformatics The amino acids. Overview Proteins (sneak preview) – Primary structure – Secondary structure – Tertiary structure The amino acids.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Computational Biology, Part 10 Protein Structure Prediction and Display Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Protein Secondary Structures Assignment and prediction.
Protein Secondary Structures Assignment and prediction Pernille Haste Andersen
Protein structure determination & prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray.
Computing for Bioinformatics Lecture 8: protein folding.
©CMBI 2006 Amino Acids “ When you understand the amino acids, you understand everything ”
Proteins account for more than 50% of the dry mass of most cells
Proteins Secondary Structure Predictions Structural Bioinformatics.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
©CMBI 2006 Amino Acids “ When you understand the amino acids, you understand everything ”
Protein Secondary Structure Prediction Some of the slides are adapted from Dr. Dong Xu’s lecture notes.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Now playing: Frank Sinatra “My Way” A large part of modern biology is understanding large molecules like Proteins A large part of modern biology is understanding.
Secondary structure prediction
Learning Targets “I Can...” -State how many nucleotides make up a codon. -Use a codon chart to find the corresponding amino acid.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
Protein Secondary Structure Prediction G P S Raghava.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Amino Acids ©CMBI 2001 “ When you understand the amino acids, you understand everything ”
Proteins.
Chapter 3 Proteins.
Proteins Secondary Structure Predictions
Secondary Structure Prediction Lecture 7 Structural Bioinformatics Dr. Avraham Samson
Comparative methods Basic logics: The 3D structure of the protein is deduced from: 1.Similarities between the protein and other proteins 2.Statistical.
Protein structure prediction Haixu Tang School of Informatics.
Proteins Structure Predictions Structural Bioinformatics.
Amino Acids. Amino acids are used in every cell of your body to build the proteins you need to survive. Amino Acids have a two-carbon bond: – One of the.
Sparse nonnegative matrix factorization for protein sequence motifs information discovery Presented by Wooyoung Kim Computer Science, Georgia State University.
1 4. Nucleic acids and proteins in one and more dimensions - second part.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
Improved Protein Secondary Structure Prediction. Secondary Structure Prediction Given a protein sequence a 1 a 2 …a N, secondary structure prediction.
Proteins Tertiary Protein Structure of Enzyme Lactasevideo Video 2.
Structural Bioinformatics Elodie Laine Master BIM-BMC Semester 3, Genomics of Microorganisms, UMR 7238, CNRS-UPMC e-documents:
Structural Bioinformatics Elodie Laine Master BIM-BMC Semester 3, Genomics of Microorganisms, UMR 7238, CNRS-UPMC e-documents:
Structural Bioinformatics Elodie Laine Master BIM-BMC Semestre 3, Laboratoire de Biologie Computationnelle et Quantitative (LCQB) e-documents:
Amino acids Proof. Dr. Abdulhussien Aljebory College of pharmacy
Mandatory to put some order in such a vast wealth of structural knowledge 4. Nucleic acids and proteins in one and more dimensions - second part.
Amino acids.
Protein Folding Notes.
Protein Structure September 7,
Protein Sequence Alignments
Proteins.
Conformationally changed Stability
Introduction to Bioinformatics II
Chapter 3 Proteins.
Fig. 5-UN1  carbon Amino group Carboxyl group.
Protein Structure Prediction
Proteins Genetic information in DNA codes specifically for the production of proteins Cells have thousands of different proteins, each with a specific.
Conformationally changed Stability
The 20 amino acids.
The 20 amino acids.
Example of regression by RBF-ANN
Proteins Proteins have many structures, resulting in a wide range of functions Proteins do most of the work in cells and act as enzymes 2. Proteins are.
“When you understand the amino acids,
Neural Networks for Protein Structure Prediction Dr. B Bhunia.
Presentation transcript:

Structural Bioinformatics Elodie Laine Master BIM-BMC Semester 3, Genomics of Microorganisms, UMR 7238, CNRS-UPMC e-documents:

Elodie Laine –

Secondary structure prediction Elodie Laine – A secondary structure element can be defined as a consecutive fragment of a protein sequence which corresponds to a local region in the associated protein structure showing distinct geometric features. In general about 50% of all protein residues participate in α - helices and β –strands, while the remaining half is more irregularly structured.

Secondary structure prediction Elodie Laine – Input: protein sequence Output: protein secondary structure Input: protein sequence Output: protein secondary structure Assumption: amino acids display preferences for certain secondary structures.

MotivationMotivation Elodie Laine –  Fold recognition  confirm structural and functional link when sequence identity is low  Structure determination  in conjunction with NMR data or as ab initio prediction first step  Sequence alignment refinement  possibly aiming at structure prediction  Classification of structural motifs  Protein design  Fold recognition  confirm structural and functional link when sequence identity is low  Structure determination  in conjunction with NMR data or as ab initio prediction first step  Sequence alignment refinement  possibly aiming at structure prediction  Classification of structural motifs  Protein design

General principles Elodie Laine – Preferences of amino acids for certain secondary structures can be explained at least partly by their physico-chemical properties (volume, total and partial charges, bipolar moment…). Proteins are composed of: - a hydrophobic core with compacted helices and sheets - a hydrophylic surface with loops interacting with the solvent or substrate aRginine lysine (K) aspartate (D) glutamate (E) asparagiNe glutamine (Q) Cysteine Methionine Histidine Serine Threonine Valine Leucine Isoleucine phenylalanine (F) Tyrosine tryptophan (W) Glycine Alanine Proline one amino-acid α-helix β-sheet Structure breakers

MethodsMethods Elodie Laine –  Empirical  combine amino acid physico-chemical properties and frequencies  Statistical  derived from large databases of protein structures  Machine learning  neural network, support vector machines…  Hybrid or consensus  Empirical  combine amino acid physico-chemical properties and frequencies  Statistical  derived from large databases of protein structures  Machine learning  neural network, support vector machines…  Hybrid or consensus  About 80% accuracy for the best modern methods  Weekly benchmarks for assessing accuracy (LiveBench, EVA)  About 80% accuracy for the best modern methods  Weekly benchmarks for assessing accuracy (LiveBench, EVA)

Empirical methods Elodie Laine –  Guzzo (1965) Biophys J. (Non-)Helical parts of proteins based on hemoglobin & myoglobin structures: Pro, Asp, Glu and His destabilize helices  Prothero (1966) Biophys J. Refinment of Guzzo rules based on lysozyme, ribonuclease, α-chymotrypsine & papaine structures: 5 consecutive aas are in a helix if at least 3 are Ala, Val, Leu or Glu  Kotelchuck & Sheraga (1969) PNAS A minimum of 4 and 2 residues to respectively form and break a helix  Lim (1974) J Mol Biol. 14 rules to predict α -helices and β -sheets based on a series of descriptors (compactness, core hydrophobicity, surface polarity…)

Empirical methods Elodie Laine –  Shiffer & Edmundson (1967) Biophys J. Helices are represented by helical wheels and residues are projected onto the perpendicular axis of the helix: hydophobic aas tend to localize on one side (n, n±3, n±4) Helical wheel 2D representation of an α -helix from tuna myoglobin (residues 77-92, PDB file 2NRL)

Empirical methods Elodie Laine –  Mornon et al. (1987) FEBS Letter 2D representation of the protein where hydrophobic residues within a certain distance are connected: hydrophobic clusters are assigned to secondary structure motifs ! ! Not fully automatic: visual inspection is required

Statistical methods Elodie Laine –  Chou & Fasman (1974) Biochemistry ① Count occurrences of each on of the 20 aas in each structural motif (helix, sheet, coil): ② Classify residues according to their propensities ③ Refine prediction based on a series of rules CategoryHelixSheetExamples Strong formersHαHαHβHβLys, Val Weak formershαhαhβhβ IndifferentIαIαIβIβ Weak breakersbαbαbβbβ Strong breakersBαBαBβBβPro, Glu ! ! Propensities are determined for individual residues, not accounting for their environment

Statistical methods Elodie Laine –  Garnier, Osguthorpe et Robson (GOR) (1978,1987) The GOR algorithm is based on the information theory combined with Bayesian statistics. It accounts for the influence of the neighboring residues by computing the product of the conditional probabilities of each residue to be in the same secondary structure motif: Normalization to avoid bias toward the most frequent structural motifs These first methods were improved by the use of multiple alignments, based on the assumption that proteins with similar sequences display similar secondary structures. GOR III has also started to consider all possible pairwise interactions of the neighboring residues. i

Machine learning methods Elodie Laine –  Artificial neural networks Step 1: the algorithm learns to recognize complex patterns, e.g. sequence-secondary structure associations, in a training set, i.e. known protein structures. Weights are determined so as to optimize inputs/outputs. Step 2: Once weights are fixed, the neural network is used to predict secondary structures of the test set. LKYDEFGLIVALALKYDEFGLIVALA Input Layer Hidden Layer Output Layer α β coil uiui sjsj ykyk bhjbhj w h ij w 0 jk b0kb0k sigmoidal gaussian

Machine learning methods Elodie Laine –  Artificial neural networks The initial sequence is read by sliding a window of length N (10-17 residues) Input Layer: the 20 amino acid types by the length N Output Layer: the 3 secondary structure types LKYDEFGLIVALALKYDEFGLIVALA Input Layer Hidden Layer Output Layer α β coil uiui sjsj ykyk bhjbhj w h ij w 0 jk b0kb0k sigmoidal gaussian

Machine learning methods Elodie Laine –  Artificial neural networks: PHD method (Rost & Sander, 1993) Training set: HHSP database (Schneider & Sander) Input: multiple structure alignment (local and global sequence features) 3 levels: ① sequence -> structure ② structure-> structure ③ arithmetic average

Evaluating performance Elodie Laine –  By-residue score Percentage of correctly predicted residues in each class (helix, sheet, coil): q α,q β, q γ are the numbers of residues correctly predicted in α, β, γ respectively N is the total number of residues to which secondary structure was assigned Typically the data contain 32% α, 21% β, 47% γ Random prediction performance: 32% * % * % *0.47 = 37%  By-segment score Percentage of correctly predicted secondary structure elements Segment overlap can be computed as: minOV : length of the actual overlap maxOV: length of the total extent

Evaluating performance Elodie Laine – The data are separated between training set – to determine the parameters, and test set – to evaluate performance. There should be:  No significant sequence identity between training and test sets (<25%)  Representative test set to assess possible bias from training set  Results from a variety of methods for the test set (standard) A number of cross validations should be performed, e.g. with Jack knife procedure. The data are separated between training set – to determine the parameters, and test set – to evaluate performance. There should be:  No significant sequence identity between training and test sets (<25%)  Representative test set to assess possible bias from training set  Results from a variety of methods for the test set (standard) A number of cross validations should be performed, e.g. with Jack knife procedure. Score for the historic or most popular methods:  Chou & Fasman: 52%  GOR: 62%; GOR V: 73.5%  PHD: 73% Theoretical limit is estimated as 90%. Some proteins are difficult to predict, e.g. those displaying unusual characteristics and those essentially stabilized by tertiary interactions. Score for the historic or most popular methods:  Chou & Fasman: 52%  GOR: 62%; GOR V: 73.5%  PHD: 73% Theoretical limit is estimated as 90%. Some proteins are difficult to predict, e.g. those displaying unusual characteristics and those essentially stabilized by tertiary interactions.

Consensus methods Elodie Laine – Benchmarking results showed that structure prediction meta-servers which combine results from several independent prediction methods have the highest accuracy  Jpred (Cuff & Barton 1999) Q e =82% Large comparative analysis of secondary structure prediction algorithms motivated the development of a meta-server to strandardize inputs/outputs and combine the results. These methods were then replaced by a neural network program called Jnet.  CONCORD (Wei, 2011) Q e =83% Consensus scheme based On a mixed integer liNear optimization method for seCOndary stRucture preDiction utilising several popular methods, including PSIPRED, DSC, GOR IV,Predator,Prof, PROFphd and Sspro

ConclusionConclusion Elodie Laine – A secondary structure element is a contiguous segment of a protein sequence that presents a particular 3D geometry Protein secondary structure prediction can be a first step toward tertiary structure prediction PSSP algorithms historically rely on amino acid preferences for certain types of secondary structure to infer general rules The predictions can be refined by the use of multiple sequence alignments or some 3D-structural knowledge