Yuchun Tang (1), Preeti Singh (1), Yanqing Zhang (1),

Slides:

Advertisements

Similar presentations

Secondary structure prediction from amino acid sequence.

Advertisements

11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and.

Protein Structure and Physics. What I will talk about today… -Outline protein synthesis and explain the basic steps involved. -Go over the Chemistry of.

Protein Structure – Part-2 Pauling Rules The bond lengths and bond angles should be distorted as little as possible. No two atoms should approach one another.

The amino acids in their natural habitat. Topics: Hydrogen bonds Secondary Structure Alpha helix Beta strands & beta sheets Turns Loop Tertiary & Quarternary.

Copyright © 2005 Pearson Education, Inc. publishing as Benjamin Cummings Concept 5.4: Proteins have many structures, resulting in a wide range of functions.

Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.

Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.

Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]

Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]

Polypeptides – a quick review A protein is a polymer consisting of several amino acids (a polypeptide) Each protein has a unique 3-D shape or Conformation.

CISC667, F05, Lec20, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction Protein Secondary Structure.

Protein Basics Protein function Protein structure –Primary Amino acids Linkage Protein conformation framework –Dihedral angles –Ramachandran plots Sequence.

Protein Structures.

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

Automatic assignment of NMR spectral data from protein sequences using NeuroBayes Slavomira Stefkova, Michal Kreps and Rudolf A Roemer Department of Physics,

Protein Structure Prediction Dr. G.P.S. Raghava Protein Sequence + Structure.

Diverse Macromolecules. V. proteins are macromolecules that are polymers formed from amino acids monomers A. proteins have great structural diversity.

Lecture 10: Protein structure

Proteins: Secondary Structure Alpha Helix

Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.

Intelligent Systems for Bioinformatics Michael J. Watts

BINF6201/8201 Hidden Markov Models for Sequence Analysis

RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?

Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.

Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.

Protein Folding & Biospectroscopy F14PFB David Robinson Mark Searle Jon McMaster

Secondary structure prediction

Web Servers for Predicting Protein Secondary Structure (Regular and Irregular) Dr. G.P.S. Raghava, F.N.A. Sc. Bioinformatics Centre Institute of Microbial.

Mrs. Einstein Research in Molecular Biology. Importance of proteins for cell function: Proteins are the end product of the central dogma YOU are your.

Protein Secondary Structure Prediction G P S Raghava.

Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.

Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.

Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.

BIOLOGICALLY IMPORTANT MACROMOLECULES PROTEINS. A very diverse group of macromolecules characterized by their functions: - Catalysts - Structural Support.

Amino Acids & Proteins The Molecules in Cells Ch 3.

Matching Protein  -Sheet Partners by Feedforward and Recurrent Neural Network Proceedings of Eighth International Conference on Intelligent Systems for.

Protein- Secondary, Tertiary, and Quaternary Structure.

1 Proteins Proteins are polymers made of monomers called amino acids All proteins are made of 20 different amino acids linked in different orders Proteins.

Protein Structure  The structure of proteins can be described at 4 levels – primary, secondary, tertiary and quaternary.  Primary structure  The sequence.

Comparative methods Basic logics: The 3D structure of the protein is deduced from: 1.Similarities between the protein and other proteins 2.Statistical.

Protein backbone Biochemical view:

Machine Learning Methods of Protein Secondary Structure Prediction Presented by Chao Wang.

Protein Structure. Insulin Infinite variety The number of possible sequences is infinite An average protein has 300 amino acids, At each position there.

Tymoczko • Berg • Stryer © 2015 W. H. Freeman and Company

Predicting Structural Features Chapter 12. Structural Features Phosphorylation sites Transmembrane helices Protein flexibility.

Structural organization of proteins

 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.

Protein Secondary Structure Prediction Using Neural Networks (M.S. Project) Committee: Dr. Y. Zhang, Dr. Y. Pan Submitted by: Preeti Singh August 1, 2003.

Protein Structure BL

Protein Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form in a biologically functional.

Protein Proteins are found throughout living organisms.

Protein Structure Prediction Dr. G.P.S. Raghava Protein Sequence + Structure.

The heroic times of crystallography

Amino Acids and Proteins

SMA5422: Special Topics in Biotechnology

Creating fuzzy rules from numerical data using a neural network

Prediction of RNA Binding Protein Using Machine Learning Technique

Conformationally changed Stability

. Nonpolar (hydrophobic) Nonpolar (hydrophobic) Amino Acid Side Chains

Extra Tree Classifier-WS3 Bagging Classifier-WS3

Introduction to Bioinformatics II

Diverse Macromolecules

Protein Structure Prediction

Protein Structures.

Protein Structure Chapter 14.

Conformationally changed Stability

Bidirectional Dynamics for Protein Secondary Structure Prediction

Protein Structure Prediction by A Data-level Parallel Proceedings of the 1989 ACM/IEEE conference on Supercomputing Speaker : Chuan-Cheng Lin Advisor.

Presentation transcript:

Hybrid Fuzzy Neural Networks for Protein Secondary Structure Prediction Yuchun Tang (1), Preeti Singh (1), Yanqing Zhang (1), Chung-Dar Lu (2) and Irene Weber (2) (1) Department of Computer Science Georgia State University Atlanta, GA 30302-4110 USA (2) Department of Biology Atlanta, GA 30303 USA

Contents Introduction Protein Secondary Structure Prediction Fuzzy Amino Acid Sets Hybrid Fuzzy Neural Network Architecture Simulations Conclusion

Introduction Proteins: the basis of cellular and molecular life 20 natural amino acids (ACDEFGHIKLMNPQRSTVWY) joined by peptide bonds The amino acid side chains (R ) determine the structure & function of protein Secondary structure of a protein is the folding or coiling of its polypeptide chains.

Protein Secondary Structure The most commonly observed conformations in secondary structure are :- Alpha Helix Beta Sheets/Strands Loops/Coils/Turns The type is usually given from the dihedral angles along 3 residues. Stable and well defined secondary structure segments strongly influence the chain's folding

Protein Secondary Structure [3] Alpha Helix- Structure repeats itself evry5.4 Angstroms along the helix axis Every main chain CO and NH group is hydrogen bonded to a peptide bond 4 residues away Beta Sheet – Two or more polypeptide chains run alongside each other and are linked by hydrogen bonds

Secondary Structure Prediction Methods [3] Secondary structure prediction in three states (a-helix, b-sheet, and coil) from sequence has reached an averaged accuracy of more than 70%. Widely used and incorporated into many other modeling tools, such as tertiary structure prediction. Methods are: Statistical Methods Nearest Neighbor Approach Neural Networks Approach Hidden Markov Model

Neural Networks Approach One of the most efficient machine learning techniques in the analysis of biological sequences Strength is that no rules about the problem being studied needs to be incorporated in the model The network can extract the rule (relation between input and output field) from a set of representative sequences The network is trained using sequence patterns/ profiles whose structure is known The query sequence is then input and its output value calculated from the weights For a pattern similar to the training set, network recalls correct output For a pattern not seen before, network attempts to generalize

Continued.. The models use a sliding window of odd numbered consecutive residues (3,5,7,9..) as the input to the network to predict the secondary structure of the residue in the middle of the window The window is used to incorporate the influence of the neighbors into the prediction Normally there are three output nodes, each representing a class of the secondary structure

Amino Acid Pattern Recognition-Using Moving Windows Single letter codes: H -> Alpha Helix, E -> Beta Sheet, T -> Coil Protein Sequence: AABBBBCCCQQFFFAAAAQQBBA Conformation Class: HHHHHTTEEEETTTTHHHHHH 5 Residue Windows: The first ‘AABBB’ -> H The second ‘ABBBB’ -> H …, so on

Continued.. Need to encode the 20 different amino acids Orthogonal encoding is common– For each residue position, 21 input nodes are there There are 3 outputs - (1,0,0) for helix, (0,1,0) for beta sheet and (0,0,1) for coil Other ways to encode the 20 different amino acids?

Data Preparation The original data set was obtained from Hooft, Sander and Scharf CMBI work It specified the protein name in the first line, the amino acid sequence in the second line and the secondary structure of each residue in the third line. In this data, H,G, I is for alpha helix, E,B is for beta sheet and T,C is for coil. Example:- >101M >MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKS > HHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHH >102L >MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSP > HHHHHHHHH EEEEEE TTS EEEETTEEEESSS

Amino Acid Encoding Normalization Function The set of 20 amino acids is represented by ={A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}. Definition 1. A amino acid encoding normalization function is defined by x=F(x), where x, and the normalized amino acid value x[0,1].

Encoding of Amino Acid Data Currently orthogonal encoding is used which uses a lot of inputs, memory and convergence time A new coding scheme is proposed which is based on the chemical properties of the amino acids like being polar, acidic, basic, hydrophobic, hydrophilic etc This coding scheme draws inspiration from experiments with amino acids using solvents and resulting grouping of similar reacting amino acids

Experimental Data for Amino Acid Similarity

Encoding Normalization The amino acid sequence where each amino acid can be substituted by its neighbor with 95% confidence level - W,Y,F,L,I,M,V,A, P, C, S, T, G, N, D, E, Q, K, R, H Values for the training data set are taken to be in the region between 0 and 1 All the 20 amino acids were translated into numeric digits- A = 0.40, C = 0.50, D = 0.75, E = 0.80, F = 0.15, G = 0.65, H = 0.99, I = 0.25, K = 0.90, L = 0.20, M = 0.30, N = 0.70, P = 0.45, Q = 0.85, R = 0.95, S = 0.55, T = 0.60, V = 0.35, W = 0.05, Y = 0.10

Continued.. Suitable sequences were chosen from the original data set, with the same structure extending for a length of at least 7 amino acids Sequences were then converted using the new coding scheme Example: An alpha helix sequence was picked and converted as follows – >EGEWQLVLHVWAKV >0.80 0.65 0.80 0.05 0.85 0.20 0.35 0.20 0.99 0.35 0.05 0.40 >0.90 0.35 >VAGHGQDILIRLFKS >0.35 0.40 0.65 0.99 0.65 0.85 0.75 0.25 0.20 0.25 0.95 0.20 >0.15 0.90 0.55

Continued.. Sequence patterns were then taken from the converted sequence, the length depending on the window size of the network Example: From the coded alpha helix sequence EGEWQLVLHVWA i.e. “0.80 0.65 0.80 0.05 0.85 0.20 0.35 0.20 0.99 0.35 0.05 0.40”, for a 9 amino acid network, the training data patterns will be- 0.80 0.65 0.80 0.05 0.85 0.20 0.35 0.20 0.99 0.65 0.80 0.05 0.85 0.20 0.35 0.20 0.99 0.35 0.80 0.05 0.85 0.20 0.35 0.20 0.99 0.35 0.05 0.05 0.85 0.20 0.35 0.20 0.99 0.35 0.05 0.40

3 State BP Neural Network

Hybrid Network Architecture Neural Network1 (For Alpha Helix) Input Sequence (x1,x2,x3,x4…..) Output = f (o1,o2,o3) O2 Neural Network2 (For Beta Sheet) NeuralNetwork3 (For Coil/Loop) O3

Hybrid Neural Network It consists of three separate neural networks N1, N2 and N3. Each can predict only one type of structure - alpha helix, beta sheet or coil, i.e. that is each individual network is trained to recognize only one kind of structure. A query sequence is input in all three networks and the output is calculated for all three cases o1, o2 and o3. The maximum value decides the structure of the input residue sequence.

Continued.. For a 3 state network, the outputs are denoted by: Alpha helix=0.83, Beta Sheet=0.50, Coil=0.15 For a hybrid network, the outputs are denoted by- N1 – Alpha helix = 0.9; Non alpha helix = 0.1 N2 - Beta Sheet = 0.9; Non beta sheet = 0.1 N3 – Coil/ Turn = 0.9; Non coil/turn = 0.1

Fuzzy Amino Acid Sets Definition 2. A fuzzy amino acid set A in  is defined as a set of ordered pairs: A= {(x, A(x)  x}, where x is the normalized amino acid value, A(x) the fuzzy amino acid membership function which maps a amino acid x to a membership degree between 0 and 1 (A(x) [0,1]).

A Fuzzy Amino Acid Set 20!=2432902008176640000=2.41018 Sequences for defining Fuzzy Amino Acid Sets 1 “Middle” “large” “Small” Amino Acid Value 1 A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y

Fuzzy Amino Acid Rules (1) Fuzzy Rules using Numerical Linguistic Values: IF X is “Small” and Y is “Small” Then Z is “Large” ……. IF X is “Large” and Y is “Large” Then Z is “Small” (2) Fuzzy Rules using Biological Linguistic Values: IF X is “A,C,D,E,F,G” and Y is “A,C,D,E,F,G” Then Z is “R,S,T,V,W,Y” IF X is “R,S,T,V,W,Y” and Y is “R,S,T,V,W,Y” Then Z is “A,C,D,E,F,G”

Fuzzy Neural Network

Fuzzy Amino Acid Sets with Gaussion Membership Functions

Fuzzy Neural Learning Algorithm then we obtain the training algorithm : the centers of output membership functions the widths of output membership functions the centers of input membership functions

Simulation Results Type of NN Average Prediction Accuracy HNN using BP 59.9% HFNN 75.1%

Conclusion Amino Acid Encoding Normalization methods are proposaed. Fuzzy Amino Acid Sets are discussed. The Hybrid Fuzzy Neural Network is more effective than the traditional BP-based neural network in terms of prediction accuracy and speed.

Future Work Over-fitting Problem. Fuzzy Biological Knowledge Discovery. Granular Neural Networks. Other relevant intelligent techniques will be added.

References [1] Qian, N. and Sejnowski, T. J. “Predicting the secondary structure of globular proteins using neural network models", 1988. [2] Rost, B. and Sander, C., "Prediction of protein secondary structure at better than 70% accuracy", 1993. [3] Preeti Singh, "protein secondary structure Prediction Using Neural Networks", M.S. Thesis, Aug. 2003. [4] Y.-Q. Zhang, M. D. Fraser, R. A. Gagliano and A. Kandel, “Granular Neural Networks for Numerical-Linguistic Data Fusion and Knowledge Discovery,” IEEE Transactions on Neural Networks, 11(3): 658-667, 2000.

This research is partially supported by NIH P20 GM065762 Acknowledgement This research is partially supported by NIH P20 GM065762 (NIH Planning Grant: “Georgia State University Biomedical Computing Center”)

Thank you! Questions?