Yuchun Tang (1), Preeti Singh (1), Yanqing Zhang (1),

Hybrid Fuzzy Neural Networks for Protein Secondary Structure Prediction
Yuchun Tang (1), Preeti Singh (1), Yanqing Zhang (1), Chung-Dar Lu (2) and Irene Weber (2) (1) Department of Computer Science Georgia State University Atlanta, GA USA (2) Department of Biology Atlanta, GA USA

Contents Introduction Protein Secondary Structure Prediction
Fuzzy Amino Acid Sets Hybrid Fuzzy Neural Network Architecture Simulations Conclusion

Introduction Proteins: the basis of cellular and molecular life
20 natural amino acids (ACDEFGHIKLMNPQRSTVWY) joined by peptide bonds The amino acid side chains (R ) determine the structure & function of protein Secondary structure of a protein is the folding or coiling of its polypeptide chains.

Protein Secondary Structure
The most commonly observed conformations in secondary structure are :- Alpha Helix Beta Sheets/Strands Loops/Coils/Turns The type is usually given from the dihedral angles along 3 residues. Stable and well defined secondary structure segments strongly influence the chain's folding

Protein Secondary Structure [3]
Alpha Helix- Structure repeats itself evry5.4 Angstroms along the helix axis Every main chain CO and NH group is hydrogen bonded to a peptide bond 4 residues away Beta Sheet – Two or more polypeptide chains run alongside each other and are linked by hydrogen bonds

Secondary Structure Prediction Methods [3]
Secondary structure prediction in three states (a-helix, b-sheet, and coil) from sequence has reached an averaged accuracy of more than 70%. Widely used and incorporated into many other modeling tools, such as tertiary structure prediction. Methods are: Statistical Methods Nearest Neighbor Approach Neural Networks Approach Hidden Markov Model

Neural Networks Approach
One of the most efficient machine learning techniques in the analysis of biological sequences Strength is that no rules about the problem being studied needs to be incorporated in the model The network can extract the rule (relation between input and output field) from a set of representative sequences The network is trained using sequence patterns/ profiles whose structure is known The query sequence is then input and its output value calculated from the weights For a pattern similar to the training set, network recalls correct output For a pattern not seen before, network attempts to generalize

Continued.. The models use a sliding window of odd numbered consecutive residues (3,5,7,9..) as the input to the network to predict the secondary structure of the residue in the middle of the window The window is used to incorporate the influence of the neighbors into the prediction Normally there are three output nodes, each representing a class of the secondary structure

Amino Acid Pattern Recognition-Using Moving Windows
Single letter codes: H -> Alpha Helix, E -> Beta Sheet, T -> Coil Protein Sequence: AABBBBCCCQQFFFAAAAQQBBA Conformation Class: HHHHHTTEEEETTTTHHHHHH 5 Residue Windows: The first ‘AABBB’ -> H The second ‘ABBBB’ -> H …, so on

Continued.. Need to encode the 20 different amino acids
Orthogonal encoding is common– For each residue position, 21 input nodes are there There are 3 outputs - (1,0,0) for helix, (0,1,0) for beta sheet and (0,0,1) for coil Other ways to encode the 20 different amino acids?

Data Preparation The original data set was obtained from Hooft, Sander and Scharf CMBI work It specified the protein name in the first line, the amino acid sequence in the second line and the secondary structure of each residue in the third line. In this data, H,G, I is for alpha helix, E,B is for beta sheet and T,C is for coil. Example:- >101M >MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKS > HHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHH >102L >MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSP > HHHHHHHHH EEEEEE TTS EEEETTEEEESSS

Amino Acid Encoding Normalization Function
The set of 20 amino acids is represented by ={A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}. Definition 1. A amino acid encoding normalization function is defined by x=F(x), where x, and the normalized amino acid value x[0,1].

Encoding of Amino Acid Data
Currently orthogonal encoding is used which uses a lot of inputs, memory and convergence time A new coding scheme is proposed which is based on the chemical properties of the amino acids like being polar, acidic, basic, hydrophobic, hydrophilic etc This coding scheme draws inspiration from experiments with amino acids using solvents and resulting grouping of similar reacting amino acids

Experimental Data for Amino Acid Similarity

Encoding Normalization
The amino acid sequence where each amino acid can be substituted by its neighbor with 95% confidence level - W,Y,F,L,I,M,V,A, P, C, S, T, G, N, D, E, Q, K, R, H Values for the training data set are taken to be in the region between 0 and 1 All the 20 amino acids were translated into numeric digits- A = 0.40, C = 0.50, D = 0.75, E = 0.80, F = 0.15, G = 0.65, H = 0.99, I = 0.25, K = 0.90, L = 0.20, M = 0.30, N = 0.70, P = 0.45, Q = 0.85, R = 0.95, S = 0.55, T = 0.60, V = 0.35, W = 0.05, Y = 0.10

Continued.. Suitable sequences were chosen from the original data set, with the same structure extending for a length of at least 7 amino acids Sequences were then converted using the new coding scheme Example: An alpha helix sequence was picked and converted as follows – >EGEWQLVLHVWAKV > > >VAGHGQDILIRLFKS > >

Continued.. Sequence patterns were then taken from the converted sequence, the length depending on the window size of the network Example: From the coded alpha helix sequence EGEWQLVLHVWA i.e. “ ”, for a 9 amino acid network, the training data patterns will be-

3 State BP Neural Network

Hybrid Network Architecture
Neural Network1 (For Alpha Helix) Input Sequence (x1,x2,x3,x4…..) Output = f (o1,o2,o3) O2 Neural Network2 (For Beta Sheet) NeuralNetwork3 (For Coil/Loop) O3

Hybrid Neural Network It consists of three separate neural networks N1, N2 and N3. Each can predict only one type of structure - alpha helix, beta sheet or coil, i.e. that is each individual network is trained to recognize only one kind of structure. A query sequence is input in all three networks and the output is calculated for all three cases o1, o2 and o3. The maximum value decides the structure of the input residue sequence.

Continued.. For a 3 state network, the outputs are denoted by:
Alpha helix=0.83, Beta Sheet=0.50, Coil=0.15 For a hybrid network, the outputs are denoted by- N1 – Alpha helix = 0.9; Non alpha helix = 0.1 N2 - Beta Sheet = 0.9; Non beta sheet = 0.1 N3 – Coil/ Turn = 0.9; Non coil/turn = 0.1

Fuzzy Amino Acid Sets Definition 2. A fuzzy amino acid set A in  is defined as a set of ordered pairs: A= {(x, A(x)  x}, where x is the normalized amino acid value, A(x) the fuzzy amino acid membership function which maps a amino acid x to a membership degree between 0 and 1 (A(x) [0,1]).

A Fuzzy Amino Acid Set 20!=2432902008176640000=2.41018 Sequences
for defining Fuzzy Amino Acid Sets 1 “Middle” “large” “Small” Amino Acid Value 1 A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y

Fuzzy Amino Acid Rules (1) Fuzzy Rules using Numerical Linguistic Values: IF X is “Small” and Y is “Small” Then Z is “Large” ……. IF X is “Large” and Y is “Large” Then Z is “Small” (2) Fuzzy Rules using Biological Linguistic Values: IF X is “A,C,D,E,F,G” and Y is “A,C,D,E,F,G” Then Z is “R,S,T,V,W,Y” IF X is “R,S,T,V,W,Y” and Y is “R,S,T,V,W,Y” Then Z is “A,C,D,E,F,G”

Fuzzy Neural Network

Fuzzy Amino Acid Sets with Gaussion Membership Functions

Fuzzy Neural Learning Algorithm
then we obtain the training algorithm : the centers of output membership functions the widths of output membership functions the centers of input membership functions

Simulation Results Type of NN Average Prediction Accuracy HNN using BP
59.9% HFNN 75.1%

Conclusion Amino Acid Encoding Normalization methods are proposaed.
Fuzzy Amino Acid Sets are discussed. The Hybrid Fuzzy Neural Network is more effective than the traditional BP-based neural network in terms of prediction accuracy and speed.

Future Work Over-fitting Problem.
Fuzzy Biological Knowledge Discovery. Granular Neural Networks. Other relevant intelligent techniques will be added.

References [1] Qian, N. and Sejnowski, T. J. “Predicting the secondary structure of globular proteins using neural network models", 1988. [2] Rost, B. and Sander, C., "Prediction of protein secondary structure at better than 70% accuracy", 1993. [3] Preeti Singh, "protein secondary structure Prediction Using Neural Networks", M.S. Thesis, Aug [4] Y.-Q. Zhang, M. D. Fraser, R. A. Gagliano and A. Kandel, “Granular Neural Networks for Numerical-Linguistic Data Fusion and Knowledge Discovery,” IEEE Transactions on Neural Networks, 11(3): , 2000.

This research is partially supported by NIH P20 GM065762
Acknowledgement This research is partially supported by NIH P20 GM065762 (NIH Planning Grant: “Georgia State University Biomedical Computing Center”)

Thank you! Questions?

Yuchun Tang (1), Preeti Singh (1), Yanqing Zhang (1),

Similar presentations

Presentation on theme: "Yuchun Tang (1), Preeti Singh (1), Yanqing Zhang (1),"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Yuchun Tang (1), Preeti Singh (1), Yanqing Zhang (1),

Similar presentations

Presentation on theme: "Yuchun Tang (1), Preeti Singh (1), Yanqing Zhang (1),"— Presentation transcript:

Similar presentations

About project

Feedback