Download presentation
Presentation is loading. Please wait.
Published byDina Ramsey Modified over 9 years ago
1
Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM) Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas Fall 2009
2
Goal of the Dissertation The main purpose is trying to obtain and extract protein sequence motifs information which are universally conserved and across protein family boundaries. And then use these information to do Protein Local 3D Structure Prediction
3
Research Flow Part3 Motif Information Extraction Part2 Discovering Protein Sequence Motifs Part1 Bioinformatics Knowledge and Dataset Collection Part4 Protein Local Tertiary Structure Prediction
4
Data set
5
HSSP matrix: 1b25
8
Representation of Segment Sliding window size: 9 Each window corresponds to a sequence segment, which is represented by a 9 × 20 matrix plus additional nine corresponding secondary structure information obtained from DSSP. More than 560,000 segments (413MB) are generated by this method. DSSP: Obtain 2 nd Structure information
9
Research Flow Part3 Motif Information Extraction Part2 Discovering Protein Sequence Motifs Part1 Bioinformatics Knowledge and Dataset Collection Part4 Protein Local Tertiary Structure Prediction
10
Granular Computing Model Original dataset Fuzzy C-Means Clustering Informatio n Granule 1 Informatio n Granule M New Improved or Greedy K-means Clustering Join Information Final Sequence Motifs Information...
11
Reduce Time-complexity Wei’s method: 1285968 sec (15 days) * 6 = 7715568 sec (90 days) Granular Model: 154899 sec + 231720 sec * 6 = 1545219 sec (18 days) ( FCM exe time) (2.7 Days)
12
Comparison of Quality Measures Different Methods>60%S.D.>70%S.D.H-B Measure Traditional25.82%0.9310.44%0.610.2543 Zhong-60-102031.46%0.2610.42%0.590.2871 Zhong-61-98531.71%0.8110.84%0.070.2784 Zhong-62-90031.04%0.1910.29%0.640.2768 FCM-K-means37.14%1.4612.99%0.740.3589 FIK Model FIK Model 040.15%1.0913.44%0.490.3730 FIK Model 80040.23%0.4513.37%0.580.3717 FIK Model 100039.15%0.3913.27%0.290.3665 FIK Model 120038.90%0.4312.89%0.770.3697 FIK Model 140037.80%0.8012.59%0.440.3655 FGK Model FGK Model 20042.45%0.0614.14%0.020.3393 FGK Model 25042.77%0.0714.06%0.070.3443 FGK Model 30041.08%0.1413.89%0.020.3311 FGK Model 35037.47%0.5113.49%0.140.3489 FGK Model 40037.62%1.5613.86%1.290.3676 Best Selection44.18%015.02%00.3664
13
Research Flow Part3 Motif Information Extraction Part2 Discovering Protein Sequence Motifs Part1 Bioinformatics Knowledge and Dataset Collection Part4 Protein Local Tertiary Structure Prediction
14
Super GSVM-FE Motivation First, the information we try to generate is about sequence motifs, but the original input data are derived from whole protein sequences by a sliding window technique; Second, during fuzzy c-means clustering, it has the ability to assign one segment to more than one information granule.
15
Original dataset Fuzzy C-Means Clustering Information Granule 1 Information Granule M Greedy K-means Clustering Join Information Final Sequence Motifs Information... For Each Cluster Ranking SVM Feature Elimination... Ranking SVM Feature Elimination Greedy K-means Clustering... …… For Each Cluster Collect Survived Segments Collect Survived Segments …… Five iterations of traditional K-maens For Each Cluster For Each Cluster... Super GSVM-FE Additional Portion
16
Extracted Motif Information
17
Research Flow Part3 Motif Information Extraction Part2 Discovering Protein Sequence Motifs Part1 Bioinformatics Knowledge and Dataset Collection Part4 Protein Local Tertiary Structure Prediction
18
3D information 3D information is generated from PDB (Protein Data Bank), an example of 1a3c PDB file
19
3D information 3D information is generated from PDB (Protein Data Bank), an example of 1a3c PDB file
20
Testing Data The latest release of PISCES includes 4345 PDB files. Compare with the dataset in our experiment, 2419 PDB files are excluded. Therefore, we regard our 2710 protein files as the training dataset and 2419 protein files as the independent testing dataset.
21
Testing Data We convert the testing dataset by the approach we introduced more than 490,000 segments are generated as testing dataset.
22
Super GSVM Training dataset Fuzzy C-Means Clustering Information Granule 1 Information Granule M Greedy K-means Clustering Collect all extracted clusters and Ranking-SVMs... For Each Cluster Train Ranking SVM and then Eliminate 20% lower rank members... Train Ranking SVM and then Eliminate 20% lower rank members …… For Each Cluster Five iterations of traditional K- means All Sequence clusters All Ranking SVMs Independent testing Dataset Feed to the belonging SVM Predict the local 3D structure If the rank belongs to cluster Find the closest cluster within a given distance threshold If not, find the next closest cluster
23
Prediction Accuracy
24
Prediction Coverage
25
Future Works Incorporate Chou-Fasman parameter for SVM training
26
Future Works For each cluster, instead of building SVM model, we build Decision Tree instead Training dataset Fuzzy C-Means Clustering Information Granule 1 Information Granule M Greedy K-means Clustering Collect all extracted clusters and Ranking-SVMs... For Each Cluster Build Decision Tree... Build Decision Tree …… For Each Cluster Five iterations of traditional K- means All Sequence clusters Test by DT Independent testing Dataset Feed to the belonging DT Predict the local 3D structure If the rank belongs to cluster Find the closest cluster within a given distance threshold If not, find the next closest cluster
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.