Protein Secondary Structure Prediction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia.

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
1 Protein Structure, Structure Classification and Prediction Bioinformatics X3 January 2005 P. Johansson, D. Madsen Dept.of Cell & Molecular Biology, Uppsala.
Protein Secondary Structures
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modeling Anne Mølgaard, CBS, BioCentrum, DTU.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU April 8, 2003Claus Lundegaard Protein Secondary Structures Assignment and prediction.
Sequence analysis June 20, 2006 Learning objectives-Understand sliding window programs. Understand difference between identity, similarity and homology.
Predicting local Protein Structure Morten Nielsen.
Protein-a chemical view A chain of amino acids folded in 3D Picture from on-line biology bookon-line biology book Peptide Protein backbone N / C terminal.
1 Levels of Protein Structure Primary to Quaternary Structure.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Sequence analysis June 18, 2008 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Protein Structure Databases Databases of three dimensional structures of proteins, where structure has been solved using X-ray crystallography or nuclear.
Protein Structure Modeling (1). Protein Folding Problem A protein folds into a unique 3D structure under physiological conditions Lysozyme sequence: KVFGRCELAA.
Protein Secondary Structures Assignment and prediction.
Sequence analysis June 19, 2007 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU April 8, 2003Claus Lundegaard Protein Secondary Structures Assignment and prediction.
Introduction to Structural Bioinformatics Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia.
Sequence analysis June 17, 2003 Learning objectives-Review amino acids structures. Understand sliding window programs. Understand difference between identity,
Thomas Blicher Center for Biological Sequence Analysis
It & Health 2009 Summary Thomas Nordahl Petersen.
Protein Secondary Structures Assignment and prediction.
Introduction to bioinformatics
CISC667, F05, Lec20, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction Protein Secondary Structure.
Protein Secondary Structures Assignment and prediction Pernille Haste Andersen
Structure Prediction in 1D
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU October 29, 2004Claus Lundegaard Protein Secondary Structures Assignment and.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modelling Thomas Blicher Center for Biological Sequence Analysis.
Protein Secondary Structures Assignment and prediction.
Predicting local Protein Structure Morten Nielsen.
Protein Structure July 2, 2006 Learning objectives-Understand the basis of the secondary structure prediction program- Psi-PRED. Introduce the concept.
Class 7: Protein Secondary Structure
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Structure Elements Primary to Quaternary Structure.
Protein Structure FDSC400. Protein Functions Biological?Food?
Protein Structures.
Burkhard Rost (Columbia New York) Some gory details of protein secondary structure prediction Burkhard Rost CUBIC Columbia University
Protein structure prediction
What are proteins? Proteins are important; e.g. for catalyzing and regulating biochemical reactions, transporting molecules, … Linear polymer chain composed.
Rising accuracy of protein secondary structure prediction Burkhard Rost
Proteins Secondary Structure Predictions Structural Bioinformatics.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
©CMBI 2006 Amino Acids “ When you understand the amino acids, you understand everything ”
Protein Secondary Structure Prediction Some of the slides are adapted from Dr. Dong Xu’s lecture notes.
Protein Secondary Structure Prediction. Input: protein sequence Output: for each residue its associated Secondary structure (SS): alpha-helix, beta-strand,
Study of Loop Length & Residue Composition of β-Hairpin Motif
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Protein Secondary Structure Prediction
Secondary structure prediction
2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”
Web Servers for Predicting Protein Secondary Structure (Regular and Irregular) Dr. G.P.S. Raghava, F.N.A. Sc. Bioinformatics Centre Institute of Microbial.
Protein structure prediction May 26, 2011 HW #8 due today Quiz #3 on Tuesday, May 31 Learning objectives-Understand the biochemical basis of secondary.
Protein Structure Prediction
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
Protein Secondary Structure Prediction G P S Raghava.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Proteins Secondary Structure Predictions
Secondary Structure Prediction Lecture 7 Structural Bioinformatics Dr. Avraham Samson
Comparative methods Basic logics: The 3D structure of the protein is deduced from: 1.Similarities between the protein and other proteins 2.Statistical.
Machine Learning Methods of Protein Secondary Structure Prediction Presented by Chao Wang.
Protein structure prediction Haixu Tang School of Informatics.
Proteins Structure Predictions Structural Bioinformatics.
Predicting Structural Features Chapter 12. Structural Features Phosphorylation sites Transmembrane helices Protein flexibility.
Improved Protein Secondary Structure Prediction. Secondary Structure Prediction Given a protein sequence a 1 a 2 …a N, secondary structure prediction.
Introduction to Bioinformatics II
Protein Structures.
Levels of Protein Structure
Presentation transcript:

Protein Secondary Structure Prediction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia, MO (O)

Outline  What is Secondary Structure  Introduction to Secondary Structure Prediction  Chou-Fasman Method  Nearest Neighbor Method  Neural Network Method

Structures in Protein Language: Letters  Words  Sentences Protein: Residues  Secondary Structure  Tertiary Structure

 helix  Single protein chain (local)  Shape maintained by intramolecular H bonding between -C=O and H-N-

 sheet  Several protein chains  Shape maintained by intramolecular H bonding between chains  Non-local on protein sequence

 -sheet (parallel, anti-parallel)

Classification of secondary structure l Defining features å Dihedral angles å Hydrogen bonds å Geometry l Assigned manually by experimentalists l Automatic å DSSP (Kabsch & Sander,1983) å STRIDE (Frishman & Argos, 1995) å Continuum (Andersen et al.)

Classification l Eight states from DSSP  H:  helix  G: 3 10 helix  I:  -helix  E:  strand  B: bridge  T:  turn  S: bend  C: coil l CASP Standard  H = (H, G, I), E = (E, B), C = (C, T, S) E H < S R H < S N < K ! C I E -cd 58 89B L E -cd 59 90B V E -cd 60 91B G E -cd 61 92B 0

Dihedral angles

Ramachandran plot (alpha)

Ramachandran plot (beta)

Outline  What is Secondary Structure  Introduction to Secondary Structure Prediction  Chou-Fasman Method  Nearest Neighbor Method  Neural Network Method

What is secondary structure prediction? l Given a protein sequence (primary structure) HWIATGQLIREAYEDYSS GHWIATRGQLIREAYEDYRHFSSECPFIP l Predict its secondary structure content (C=Coils H=Alpha Helix E=Beta Strands) EEEEEHHHHHHHHHHHHH CEEEEECHHHHHHHHHHHCCCHHCCCCCC

Why secondary structure prediction? o An easier problem than 3D structure prediction (more than 40 years of history). o Accurate secondary structure prediction can be an important information for the tertiary structure prediction o Protein function prediction o Protein classification o Predicting structural change

Prediction methods o Statistical method oChou-Fasman method, GOR I-IV oNearest neighbors oNNSSP, SSPAL oNeural network oPHD, Psi-Pred, J-Pred oSupport vector machine (SVM) oHMM

Accuracy measure l Three-state prediction accuracy: Q 3 correctly predicted residues number of residues l A prediction of all loop: Q 3 ~ 40% l Correlation coefficients

Improvement of accuracy 1974 Chou & Fasman~50-53% 1978 Garnier63% 1987 Zvelebil66% 1988 Qian & Sejnowski64.3% 1993 Rost & Sander % 1997 Frishman & Argos<75% 1999 Cuff & Barton72.9% 1999 Jones76.5% 2000 Petersen et al.77.9%

Prediction accuracy (EVA)

How far can we go?  Currently ~76%  1/5 of proteins with more than 100 homologs  >80%  Assignment is ambiguous (5-15%).  non-unique protein structures, H-bond cutoff, etc.  Some segments can have multiple structure types.  Different secondary structures between homologues (~12%). Prediction limit  88%.  Non-locality.

Assumptions o The entire information for forming secondary structure is contained in the primary sequence. o Side groups of residues will determine structure. o Examining windows of residues is sufficient to predict structure. o Basis for window size selection:   -helices 5 – 40 residues long   -strands 5 – 10 residues long

Outline  What is Secondary Structure  Introduction to Secondary Structure Prediction  Chou-Fasman Method  Nearest Neighbor Method  Neural Network Method

Secondary structure propensity l From PDB database, calculate the propensity for a given amino acid to adopt a certain ss-type l Example: #Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=500 P( ,aa i ) = 500/20,000, p(  p(aa i ) = 2,000/20,000 P = 500 / (4,000/10) = 1.25

Chou-Fasman algorithm l Helix, Strand 1. Scan for window of 6 residues where average score > 1 (4 residues for helix and 3 residues for strand) 2. Propagate in both directions until 4 (or 3) residue window with mean propensity < 1 3. Move forward and repeat l Conflict solution Any region containing overlapping alpha-helical and beta-strand assignments are taken to be helical if the average P(helix) > P(strand). It is a beta strand if the average P(strand) > P(helix). l Accuracy: ~50%  ~60% GHWIATRGQLIREAYEDYRHFSSECPFIP

Initiation Identify regions where 4/6 have a P(H) >1.00 “alpha-helix nucleus”

Propagation Extend helix in both directions until a set of four residues have an average P(H) <1.00.

Outline  What is Secondary Structure  Introduction to Secondary Structure Prediction  Chou-Fasman Method  Nearest Neighbor Method  Neural Network Method

Nearest neighbor method o Predict secondary structure of the central residue of a given segment from homologous segments (neighbors)  (i) From database, find some number of the closest sequences to a subsequence defined by a window around the central residue, or  (ii) Compute K best non-intersecting local alignments of a query sequence with each sequence. o Use max (n , n , n c ) for neighbor consensus or max(s , s , s c ) for consensus sequence hits

Environment preference score l Each amino acid has a preference to a specific structural environments. l Structural variables: å secondary structure, solvent accessibility l Non-redundant protein structure database: FSSP

“Singleton” score matrix Helix SheetLoop Buried Inter Exposed Buried Inter Exposed Buried Inter Exposed ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL

Total score l Alignment score is the sum of score in a window of length l: T R G Q L I R i-4 i-3 i-2 i-1 i i+1 i+2 i+3 i+4 E A Y E D Y R H F S S E C P F I P...E C Y E Y B R H R.. j-4 j-3 j-2 j-1 j j+1 j+2 j+3 j+4 | | | | | L H H H H H H L L

Neighbors 1 - L H H H H H H L L - S L L H H H H H L L - S L E E E E E E L L - S L E E E E E E L L - S 4 n - L L L L E E E E E - S n n+1 - H H H L L L E E E - S n+1 :  max (n , n , n L ) or max (  s ,  s ,  s L )

Evolutionary information l “All naturally evolved proteins with more than 35% pairwise identical residues over more than 100 aligned residues have similar structures.” l Stability of structure w.r.t. sequence divergence (<12% difference in secondary structure). l Position-specific sequence profile, containing crucial information on evolution of protein family, can help secondary structure prediction (increase information content). l Gaps rarely occur in helix and strand. l ~1.4%/year increase in Q3 due to database growth during past ~10 years.

How to use it l Sequence-profile alignment. l Compare a sequence against protein family. l More specific. l BLAST vs. PSI-BLAST. l Look up PSSM instead of PAM or BLOSUM. position amino acid type

Outline  What is Secondary Structure  Introduction to Secondary Structure Prediction  Chou-Fasman Method  Nearest Neighbor Method  Neural Network Method

Neurons normal state addictive state

Neural network Input signals are summed and turned into zero or one 3. J1J1 J2J2 J3J3 J4J4 Feed-forward multilayer network Input layerHidden layerOutput layer neurons

Enter sequences Compare Prediction to Reality Adjust Weights Neural network training

Simple neural network Error = | out_net – out_desired |

Training a neural network

Simple neural network with hidden layer

Neural network for secondary structure

PsiPred l D. Jones, J. Mol. Boil. 292, 195 (1999). l Method : Neural network l Input data : PSSM generated by PSI-BLAST l Bigger and better sequence database å Combining several database and data filtering l Training and test sets preparation å Secondary structure prediction only makes sense for proteins with no homologous structure. å No sequence & structural homologues between training and test sets by PSI-BLAST (mimicking realistic situation).

Psi-Pred (details) l Window size = 15 l Two networks l First network (sequence-to-structure): å 315 = (20 + 1)  15 inputs å extra unit to indicate where the windows spans either N or C terminus å Data are scaled to [0-1] range by using 1/[1+exp(-x)] å 75 hidden units å 3 outputs (H, E, L) l Second network (structure-to-structure): å Structural correlation between adjacent sequences å 60 = (3 + 1)  15 inputs å 60 hidden units å 3 outputs l Accuracy ~76%

Reading Assignments l Suggested reading: å Chapter 15 in “Current Topics in Computational Molecular Biology, edited by Tao Jiang, Ying Xu, and Michael Zhang. MIT Press ” l Optional reading: å Review by Burkhard Rost: ev_dekker/paper.html ev_dekker/paper.html

Develop a program that implements Chou- Fasman Algorithm 1. TA will give you a matrix table of Chou-Fasman indices 2. Using the FASTA as input format for sequence 3. Output format: Project Assignment KVFGRCELAA AMKRHGLDNY RGYSLGNWVC AAKFESNFNT QATNRNTDGS HHHHHH HHHH HHHHHH HHHHHH EEE TDYGILQINS RWWCNDGRTP GSRNLCNIPC EEE EE