Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM) Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas Fall 2009
Goal of the Dissertation The main purpose is trying to obtain and extract protein sequence motifs information which are universally conserved and across protein family boundaries. And then use these information to do Protein Local 3D Structure Prediction
Research Flow Part3 Motif Information Extraction Part2 Discovering Protein Sequence Motifs Part1 Bioinformatics Knowledge and Dataset Collection Part4 Protein Local Tertiary Structure Prediction
Data set
HSSP matrix: 1b25
Representation of Segment Sliding window size: 9 Each window corresponds to a sequence segment, which is represented by a 9 × 20 matrix plus additional nine corresponding secondary structure information obtained from DSSP. More than 560,000 segments (413MB) are generated by this method. DSSP: Obtain 2 nd Structure information
Research Flow Part3 Motif Information Extraction Part2 Discovering Protein Sequence Motifs Part1 Bioinformatics Knowledge and Dataset Collection Part4 Protein Local Tertiary Structure Prediction
Granular Computing Model Original dataset Fuzzy C-Means Clustering Informatio n Granule 1 Informatio n Granule M New Improved or Greedy K-means Clustering Join Information Final Sequence Motifs Information...
Reduce Time-complexity Wei’s method: sec (15 days) * 6 = sec (90 days) Granular Model: sec sec * 6 = sec (18 days) ( FCM exe time) (2.7 Days)
Comparison of Quality Measures Different Methods>60%S.D.>70%S.D.H-B Measure Traditional25.82% % Zhong % % Zhong % % Zhong % % FCM-K-means37.14% % FIK Model FIK Model % % FIK Model % % FIK Model % % FIK Model % % FIK Model % % FGK Model FGK Model % % FGK Model % % FGK Model % % FGK Model % % FGK Model % % Best Selection44.18%015.02%
Research Flow Part3 Motif Information Extraction Part2 Discovering Protein Sequence Motifs Part1 Bioinformatics Knowledge and Dataset Collection Part4 Protein Local Tertiary Structure Prediction
Super GSVM-FE Motivation First, the information we try to generate is about sequence motifs, but the original input data are derived from whole protein sequences by a sliding window technique; Second, during fuzzy c-means clustering, it has the ability to assign one segment to more than one information granule.
Original dataset Fuzzy C-Means Clustering Information Granule 1 Information Granule M Greedy K-means Clustering Join Information Final Sequence Motifs Information... For Each Cluster Ranking SVM Feature Elimination... Ranking SVM Feature Elimination Greedy K-means Clustering... …… For Each Cluster Collect Survived Segments Collect Survived Segments …… Five iterations of traditional K-maens For Each Cluster For Each Cluster... Super GSVM-FE Additional Portion
Extracted Motif Information
Research Flow Part3 Motif Information Extraction Part2 Discovering Protein Sequence Motifs Part1 Bioinformatics Knowledge and Dataset Collection Part4 Protein Local Tertiary Structure Prediction
3D information 3D information is generated from PDB (Protein Data Bank), an example of 1a3c PDB file
3D information 3D information is generated from PDB (Protein Data Bank), an example of 1a3c PDB file
Testing Data The latest release of PISCES includes 4345 PDB files. Compare with the dataset in our experiment, 2419 PDB files are excluded. Therefore, we regard our 2710 protein files as the training dataset and 2419 protein files as the independent testing dataset.
Testing Data We convert the testing dataset by the approach we introduced more than 490,000 segments are generated as testing dataset.
Super GSVM Training dataset Fuzzy C-Means Clustering Information Granule 1 Information Granule M Greedy K-means Clustering Collect all extracted clusters and Ranking-SVMs... For Each Cluster Train Ranking SVM and then Eliminate 20% lower rank members... Train Ranking SVM and then Eliminate 20% lower rank members …… For Each Cluster Five iterations of traditional K- means All Sequence clusters All Ranking SVMs Independent testing Dataset Feed to the belonging SVM Predict the local 3D structure If the rank belongs to cluster Find the closest cluster within a given distance threshold If not, find the next closest cluster
Prediction Accuracy
Prediction Coverage
Future Works Incorporate Chou-Fasman parameter for SVM training
Future Works For each cluster, instead of building SVM model, we build Decision Tree instead Training dataset Fuzzy C-Means Clustering Information Granule 1 Information Granule M Greedy K-means Clustering Collect all extracted clusters and Ranking-SVMs... For Each Cluster Build Decision Tree... Build Decision Tree …… For Each Cluster Five iterations of traditional K- means All Sequence clusters Test by DT Independent testing Dataset Feed to the belonging DT Predict the local 3D structure If the rank belongs to cluster Find the closest cluster within a given distance threshold If not, find the next closest cluster