Presentation is loading. Please wait.

Presentation is loading. Please wait.

Master’s Thesis Defense Junlin Wang Advisor: Dr. Yi Shang

Similar presentations


Presentation on theme: "Master’s Thesis Defense Junlin Wang Advisor: Dr. Yi Shang"— Presentation transcript:

1 Master’s Thesis Defense Junlin Wang Advisor: Dr. Yi Shang
MACHINE LEARNING METHODS FOR EVALUATING THE QUALITY OF A SINGLE PROTEIN MODEL USING ENERGY AND STRUCTURAL PROPERTIES Master’s Thesis Defense Junlin Wang Advisor: Dr. Yi Shang

2 Contents Introduction Related Work Methods and Implementation
Experimental Results Conclusion and Future Work

3 Protein Structure Prediction
Introduction Protein Structure Prediction Protein structure prediction is the prediction of the three-dimensional structure of a protein from its amino acid sequence. Experimental method vs. Computational methods Experimental method Computational methods X-ray crystallography Electron microscopes Nuclear Magnetic Resonance (NMR) Costly, Difficult and Time consuming Generate numerous alternative models Much less time using Use computation resource

4 Protein Quality Assessment
Introduction Protein Quality Assessment Most of the computational prediction methods prefer to utilize the sampling-and-selection strategy Most protein-like model Large amount of 3D structures (Pool) Computational prediction methods Structure Prediction Quality Assessment

5 Problem and Motivation
Introduction Problem and Motivation Computational methods: simple, costless, generate large numbers of models in a limited time Problem: how to pick up the best model from the model pool? The accuracy of the state-of-the-art single-model QA methods is still not high enough for practical applications Consensus QA methods has its own defect, it requires a large enough pool of models with diverse quality to perform well. Compared with experimental methods, computational methods is much more simple and costless which can generate large numbers of models in a limited time. After the model generation part, how to pick up the best model from the model pool becomes one of the most important problems.

6 Goals and Contributions
Introduction Goals and Contributions Goals A integrated dataset of features of energy and structure for CASP dataset A system used to generate different features for models A frame work used to test different machine learning methods on different datasets Contributions A integrated feature dataset of energy and structural properties including 18 features A implemented system used to generate different features and do different format process for different analysis software (e.g. Matlab, Weka…) A implemented frame work used to test five machine learning methods on different datasets and parameters

7 Contents Introduction Related Work Methods and Implementation
Experimental Results Conclusion and Future Work

8 Structure Quality Assessment
Related Work Structure Quality Assessment Quality Assessment methods can be separate into single-model QA methods and consensus-based QA methods. Single-model QA Cluster-based QA Use the knowledge of a model itself Physics-based potential function uses physics law to evaluate the models’ quality The knowledge-based statistical method is based on the thermodynamic equilibrium and known information of protein models. Use the knowledge of models in a cluster The quality of the models in the cluster affect the performance The size of the cluster affect the performance This theory assumes that good models have more structural neighbors

9 Related Work Single-model QA Proq2 DFIRE, dDFIRE OPCU-Cα
The best single-model QA method inCASP10 Proq2 uses support vector machines to predict local as well as global quality of protein models. DFIRE is a statistical energy function built on distance-scaling. dDFIRE is short for a dipolar DFIRE, which adds orientation dependence. OPCU-Cα only uses the knowledge of Cα position Pseudo-positions artificially built from a Cα trace for auxiliary purposes were used to establish the contributions from other atomic positions. DFIRE [14] is a statistical energy function built on distance-scaling, which is a reference state of uniformly distributed ideal gas points, and the statistics of the distance between two atoms in known protein structures.

10 Related Work Single-model QA RW
DOPE RAPDF A new distance-dependent atomic potential using a random-walk ideal as the reference state. A side-chain orientation-dependent term generated base on a non-redundant high-resolution structural database is added into RW DOPE (Discrete Optimized Protein Energy) is an atomic distance-dependent statistical potential extracted from a non-redundant set of 1472 crystallographic structures. An all-atom distance-dependent conditional probability discriminatory function RAPDF will generate three discriminatory functions: - two virtual atom representations - one all-heavy atom representation RAPDF will generate three discriminatory functions two virtual atom representations and one all-heavy atom representation to evaluate the quality of a model

11 Related Work Structure Properties DSSP PSIPRED
DSSP (Define Secondary Structure of Proteins) algorithm is the standard method for assigning secondary structure to the amino acids of a protein. PSIPRED PSIPRED is a simple and accurate secondary structure prediction method, incorporating two feed-forward neural networks which perform an analysis on output obtained from PSI-BLAST (Position Specific Iterated - BLAST). SCRATCH SCRATCH is a server for predicting protein tertiary structure and structural features. The SCRATCH software suite includes predictors for secondary structure, relative solvent accessibility…

12 Linear Model Regression
Related Work Linear Model Regression A common used statistic method for modeling the relationship between y and X. Model Function: Neural Network A family of statistical learning algorithms inspired by biological neural in machine learning and cognitive science By given a set of instance pairs (x, y), the aim is to find a function with f(x)→y that match the examples The cost function is related to the errors between result and the data and it implicitly contains prior knowledge about the problem domain. Linear regression is a common used statistic method for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variable) denoted X. The Linear regression algorithm can be defined as follow: Artificial neural networks (ANNs) are a family of statistical learning algorithms inspired by biological neural networks (the central nervous systems of animals, in particular the brain) in machine learning and cognitive science. In supervised learning, by given a set of instance pairs (x, y), and the aim is to find a function with f(x)→y that match the examples, that means the target is infer the mapping implied by the data. The cost function is related to the errors between our result and the data and it implicitly contains prior knowledge about the problem domain. A commonly used cost is the mean-squared error, which tries to minimize the average squared error between the network's output f(x), and the dependent variable y in the training dataset.

13 Related Work Decision Tree
Builds regression or classification models with a form of a tree structure result in a tree with decision nodes and leaf nodes. Use ID3 to building decision trees Replace gain information with Standard Deviation Reduction to generate a decision tree for regression Every time the dataset is split, calculate the decrease in standard deviation and find attribute that gets the highest standard deviation reduction. SDR(Y,X)=S(Y)- S(Y,X) After came to the termination criteria,the tree will stop grow. Decision tree can builds regression or classification models with a form of a tree structure. It brakes down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. And finally result in a tree with decision nodes and leaf nodes. The common used algorithm used to building decision trees is ID3 by J. R. Quinlan. By replace gain information with Standard Deviation Reduction, ID3 can be used to generate a decision tree for regression. The algorithm can be defined as follow:

14 Related Work Boosting Boosting is a machine learning ensemble meta-algorithm for reducing bias primarily and also variance in supervised learning, A family of machine learning algorithms that convert weak learners to strong ones. Theory: Fit a decision tree fb Update f by adding in a shrunken version of the new tree Update the residuals r Get the final result of boosting model Boosting is a machine learning ensemble meta-algorithm for reducing bias primarily and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. In this study, we will use Boosting on Decision tree, the algorithm can be described as follow:

15 Related Work Random Forest
Developed by Leo Breiman and Adele Cutler, combines the idea of “bagging” and the random selection of features Constructing a large number of decision trees at training time and outputting the mean prediction of the individual trees for regression case Theory: Use bootstrap sample to generate new sample Fit m decision trees with m new new sample datasets The out of bag samples are used to estimate the error rate of the tree The average of outputs from all the trees will be used as the final result Random forests correct the overfitting problem of decision tree, and it provides the importance information of each input variable, which is suitable for information retrieving from a dataset of high dimension with noise [36], with these advantages, random forest became a state-of-the-art machine learning method and widely used so solve different biological problems The random forest algorithm is developed by Leo Breiman and Adele Cutler, it combines the idea of “bagging” and the random selection of features it is an ensemble learning method for classification, regression and other tasks, that operate by constructing a large number of decision trees at training time and outputting the class that is mean prediction of the individual trees for regression case. The algorithm can be described as follow: By given a training dataset D of size n, random forest use bootstrap sample to generate m new sample datasets of with size n by sampling from D uniformly and with replacement. Fit m decision trees with m new new sample datasets The data not used for each tree in growing decision tree is called out of bag samples, which is used to estimate the error rate of the The average of outputs from all the trees will be used as the final result

16 Contents Introduction Related Work Methods and Implementation
Experimental Results Conclusion and Future Work

17 Methods and Implementation
A feature dataset of energy and structural properties A implemented Java system A implemented Matlab frame work

18 Methods and Implementation
Feature dataset A integrated feature dataset of energy and structural properties including 18 features Referred Model Set: Target Number: 344 Targets Model Number: models True score: GDT-TS TM-score RMSD CASP8 CASP9 CASP10 CASP10 QA stage one CASP10 QA stage two

19 Methods and Implementation
Feature dataset Score Function Results: Secondary structure features: Solvent Accessibility features(threshold = 0.2/0.25): DDfire Dfire Dope Opus Rapdf RW Proq2 Percentage of Helix Percentage of Sheet Percentage of Coil Percentage of all matching Secondary Structure Consistence Score of Secondary Structure Matching of Bury Amino Acid Matching of expose Amino Acid Percentage of matching Solvent Accessibility

20 Methods and Implementation
A feature dataset of energy and structural properties A feature generating system A implemented Matlab frame work

21 Methods and Implementation
Feature generating system architecture Score calculation module Data processing module Structural result calculation Module Utility tool module

22 Methods and Implementation
Score calculation module True score calculation module GDT-TS, RMSD, TM-score calculation Pairwise matrix calculation Batch calculation for all models under a directory Score calculation module Score function result calculation module DFIRE, DDFIRE, Dope, RAPDF, Opus-ca, RW result calculation Proq2 result calculation

23 Methods and Implementation
Structural result calculation Module DSSP result calculation SCRATCH result calculation Structural result calculation Module PSIPRED result calculation Structural feature extraction Data checking (target checking, model checking) Data combination Data processing Module Format changing (file extension, separator setting, title setting) Data filter (pick up special data)

24 Methods and Implementation
Feature extraction Energy Property Features Seven single model QA methods are used to generate seven energy property features: Secondary Structural Features The consistency of secondary structure elements (helix, strand and coil) between the result from DSSP and PSIPRED were used as three features after converted. DDfire Dfire Dope Opus Rapdf RW Proq2 For each model, the consistency of secondary structure elements (helix, strand and coil) between the result from DSSP and PSIPRED are calculated, and then converted into %helix, %sheet and %coil by dividing them by the total compared chain length, the result were used as three features. (n is the matching part)

25 Methods and Implementation
Feature extraction The total matching of three secondary structures is also calculated and used as a feature: Secondary structure confidence score: Confidence value C is given by PSIPRED (C∈ {0…9}) and are calculated by DSSP and PSIPRED ( , ∈ {H, E, C} ) d( , ) gives 1 if and are identical, otherwise 0. The total matching of three secondary structures is also calculated and used as a feature (n is the matching part): For each amino acid position i, PSIPRED will give a confidence value C, which means how confident they say the predict is right, and by compare the secondary structure type and calculated by DSSP and PSIPRED, The secondary structure confidence score is defined as:

26 Methods and Implementation
Feature extraction Solvent Accessibility Features Calculate absolute solvent accessibility surface (ASAS) of each amino acid of a model use DSSP The result is divided by their accessible surface area and got a ratio value. The consistency of Solvent Accessibility (buried, exposed) between DSSP and SCRATCH are calculated, were used as three features. - < 0.2: Buried - >= 0.2: Exposed The absolute solvent accessibility surface (ASAS) of each amino acid of a model is calculated by DSSP, the result is divided by their accessible surface area and got a ratio value. If the ratio value is lower than 0.2, it is regarded as a buried amino acid, otherwise it is exposed amino acid. For each model, the consistency of Solvent Accessibility (buried, exposed) between the result from DSSP and SCRATCH are calculated, and then converted into %buried, %exposed and %match solvent accessibility by dividing them by the total compared chain length, the result were used as three features. (n is the matching part)

27 Methods and Implementation
A feature dataset of energy and structural properties A feature generating system A implemented Matlab frame work

28 Methods and Implementation
A implemented Matlab frame work Data Processing Parameter setting Machine learning algorithm testing Load Data Correlation and true score lost calculation This frame will first load all the data into Matlab, it can be chosen to run correlation and true score lost calculation on features or consensus method, or chosen to do parameter setting and machine learning algorithm testing

29 Methods and Implementation
Parameters Optimization Linear Model Model: linear/interactions (“interactions” will take the natural join of features as terms except 15 features) Stepwise: True/False (Stepwise will add or remove terms one by one to check the performance of a model) Best Performance: basic linear model without Stepwise Decision Tree Features used every time split the dataset : Range {1…15} Prune : True/False (A decision tree without prune will grow with out limit) Best Performance: A decision tree with prune (True) and more than ten features when branch got better result When using machine learning methods to solve different problems, it is important to do the parameters optimization. In general case, there are several statistical parameters can be tuned in order to improve the learning performance. For some of the complex methods, the combination of parameters will lead to an obvious difference on its performance.

30 Methods and Implementation
Parameters Optimization Neural Network Number of layers:1, Size of each layer : Range {10, 30, 50, 100} Training function : 'trainlm', 'trainbr', 'trainscg’ Transfer function (for each layer): 'tansig', 'logsig', 'purelin' Best Performance: network with 2 layers, size 100, train liner model (trainlm), combination {'logsig', 'logsig'} and {'logsig', 'tansig'} got the best result Boosting: The number of the trees : from 100 to 1000 with the step of 100 Shrunken rate : from 0.01 to 0.1 with the step of 0.01 Best Performance: the performance with 800 and 900 trees and 0.09 shrunken rate result in a better performance.

31 Methods and Implementation
Parameters Optimization Random Forest: The number of the trees : {50, 100, 300, 500, 1000} The number of features random chosen : Range {1…15} Best Performance: random forest with more than 500 trees (from 500 to there is not a significant improve) and two features (some times one features and three features also got good results) result in a better performance on most times.

32 Contents Introduction Related Work Methods and Implementation
Experimental Results Conclusion and Future Work

33 Experimental Results Dataset Training Dataset Testing Dataset
All server models of CASP8 (34266 models of 124 models) All server models of CASP10 (26156 models of 103 targets) CASP10 QA stage one (20 models for each target) All server models of CASP9 (34137 models of 117 models) CASP10 QA stage two (150 models for each target) All server models of CASP9 (34137 models of 117 models)

34 Experimental Results (All Model)
Features’ Correlation Coefficient of CASP10 All model

35 Experimental Results (All Model)
Performance of different QA methods for CASP10 All Model as label Gdtts Correlation Gdtts lost Find Best Models TM Correlation TM lost Consensus 0.8948 0.0553 6 0.8923 0.0555 4 Linear 0.5716 0.0725 10 0.5862 0.0709 7 Decision Tree 0.4195 0.1123 0.4536 0.0986 Neural Network 0.3705 0.0902 3 0.3530 0.0944 5 Boosting 0.5746 0.0728 0.5831 0.0764 Random Forest 0.5977 0.0776 0.6046 DDFIRE 0.2653 0.1421 2 0.2760 0.1454 DOPE 0.2027 0.1955 0.1673 0.2082 OPUS_CA 0.2993 0.1552 0.2941 0.1614 Proq2 0.4539 0.1063 0.4404 0.1140 9 RAPDF 0.2368 0.1517 0.2546 0.1567 RW 0.2564 0.1632 0.2852 0.1621 Consensus is still much better than other methods Proq2 and Linear Model can find best model directly for more targets Linear Model, Boosting and Random Forest got best performance among single model QA methods.

36 Experimental Results (All Model)
Consensus got the best true score lost (0.05+) Linear Model, Boosting and Random Forest (0.07+) Proq2 (0.1+) Consensus got the highest correlation coefficient (0.9) Linear Model, Boosting and Random Forest (0.06) Proq2 (0.45)

37 Experimental Results (Set20)
Features’ Correlation Coefficient of CASP10 Set 20

38 Experimental Results (Set 20)
Performance of different QA methods for CASP10 All Model as label Gdtts Correlation Gdtts lost Find Best Models TM Correlation TM lost Consensus 0.6742 0.0647 18 0.6761 0.0655 17 Linear 0.5848 0.0632 21 0.5913 0.0637 25 Decision Tree 0.3930 0.0855 9 0.3707 0.0811 12 Neural Network 0.4818 0.0608 0.4913 0.0661 22 Boosting 0.5860 0.0512 0.5896 0.0599 Random Forest 0.5858 0.0607 0.5927 0.0549 23 DDFIRE 0.3219 0.1107 15 0.3128 0.1140 14 DOPE 0.2513 0.1105 10 0.2237 0.1136 11 OPUS_CA 0.3853 0.1061 0.3716 0.1074 Proq2 0.4947 0.0846 0.4846 0.0844 RAPDF 0.2872 0.0996 13 0.2832 0.0992 RW 0.3503 0.1128 0.3590 0.1121 16 Consensus still got higher correlation coefficient Linear Model, Boosting, Random Forest got better true score lost than consensus

39 Experimental Results (All Model)
Consensus (0.0647, ) Boosting got best GDT-TS lost (0.0512) Random Forest got best TM-score lost (0.0549) Proq2 (0.0846, ) Consensus got the highest correlation coefficient (0.67+) Linear Model, Boosting and Random Forest (0.6) Proq2 (0.5)

40 Experimental Results (Set150)
Features’ Correlation Coefficient of CASP10 Set 150

41 Experimental Results (Set 20)
Performance of different QA methods for CASP10 All Model as label Gdtts Correlation Gdtts lost Find Best Models TM Correlation TM lost Consensus 0.5003 0.0541 6 0.4905 0.0537 4 Linear 0.4083 0.0608 9 0.4062 0.0568 Decision Tree 0.2158 0.0640 1 0.2492 0.0635 Neural Network 0.2590 0.0581 0.2060 0.0627 Boosting 0.4039 0.0513 0.3929 0.0506 7 Random Forest 0.3849 0.0583 0.3816 0.0519 DDFIRE 0.3171 0.0695 3 0.3142 0.0665 DOPE 0.2708 0.0751 0.2465 0.0749 OPUS_CA 0.3642 0.0685 8 0.3518 0.0666 Proq2 0.3292 0.0566 12 0.3200 0.0549 RAPDF 0.3087 0.0617 0.3240 0.0592 RW 0.2585 0.0677 5 0.2838 0.0622 Consensus still got higher correlation coefficient Boosting got better true score lost than consensus Random Forest got more GDT-TS lost but less TM lost than Consensus.

42 Experimental Results (All Model)
Consensus (0.0541, ) Boosting got best true score lost (0.0513, ) Proq2 (0.0566, ) Consensus got the highest correlation coefficient (0.5) Linear Model, Boosting and Random Forest (0.4) Proq2 (0.32)

43 Contents Introduction Related Work Methods and Implementation
Experimental Results Conclusion and Future Work

44 Conclusion About Features: Methods:
The combining of energy and structure properties can significantly improve the performance of single model QA. The matching between predicted result and actual structure of second structure and solvent accessibility,the confidence of predicted result provide important information to evaluate the quality of a model. Methods: Linear model is an efficient and stable method with the smallest cost. Neural Network and single decision tree are not stable, when test different times they show results with a much bigger (compared with other methods) difference. Random forest and Boosting are two advanced decision tree algorithms. They show comparable (or even better) performance with special range of parameters, and much more stable than Neural Network and single decision tree. Random forest and Boosting consume much more time for generating large number of decision trees , the models generated by these two algorithms are vary large.

45 Future Work Some of the software used to generate the features are not using the newest version or newest refer database, the updating of feature dataset may provides more information. The testing of Neural Network is not enough to show the best performance of this algorithm, when different parameters are used, the performance of Neural Network showed big difference. The range of parameters of Boosting and Random forest can be more large, especially boosting, the tendency of decrease didn’t stopped when using 1000 trees. Based on the result of different methods, for some of the targets, single model QA methods are much better than consensus-based methods, if a classifier can be trained to determine which target is more suitable with single model methods,the performance of QA can be improved again.

46 Thank You Q&A


Download ppt "Master’s Thesis Defense Junlin Wang Advisor: Dr. Yi Shang"

Similar presentations


Ads by Google