Download presentation
Presentation is loading. Please wait.
Published byJared Wood Modified over 9 years ago
1
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng
2
2 Protein Disorder Prediction What is protein? Protein is usually a chain of 20 different Amino Acids (AAs). So a protein can be represented by a string of 20 characters. Usually, protein has its 3D structure, which is important to its function What is disorder protein? Disorder protein is a protein that part or all of it have NO identified 3D structures. Can protein disorder be predicted? Predictor developed by Dr. Vucetic can predict protein disorder with accuracy of 82.6% Current dataset used to train disorder predictor 145 proteins with CONFIRMED long disordered region 130 proteins that are totally ordered
3
3 The Objective What are homologous/similar sequences? Proteins that may derive from same ancestor. They tend to have SIMILAR amino acids sequences Where to find homologous/similar sequences? For a given protein (its amino acids sequences), its homologous/similar sequences can be found using the NCBI BLAST Web server (http://www.ncbi.nlm.nih.gov/BLAST/)http://www.ncbi.nlm.nih.gov/BLAST/ The hypothesis Homologous/similar sequences may have similar structures, or, similar disorder regions. So, we can use similar sequences to enhance the training set Improve disorder prediction using homologous/similar sequences
4
4 Methodology To enhance the training set using homologous sequences: Find homologous sequences that have segments similar to the disorder proteins in the original dataset Remove sequences that are too similar to original sequences Label these segments as disorder Train disorder predictors with these new data
5
5 Get homologous Sequences Each disorder segment in the original dataset is sent to the NCBI BLAST Web server Done automatically by a Visual Basic program Search against the non-redundant database (nr), return sequences with E-value < 10 6380 sequences found Discard sequences that are too similar to the original sequences Total 444 sequences left, corresponding to 55 original disorder sequences
6
6 Which BLAST to use? Standard BLAST We may need scoring matrix specially developed for disorder protein alignments PSI-BLAST It is adaptive and can build scoring matrix based on the results of previous iteration. So, the choice of initial scoring matrix is not very important Current Experiment PSI-BLAST with initial matrix BLOSUM62, use the result of the 1st iteration
7
7 Train Disorder Predictor Group sequences into families Group newly found sequences according to the original sequences they are similar to. So, there are 145 families total (only 55 families contain new sequences) Neural Network + Bagging Randomly sampling a BALANCED training set and train a NN on it. Repeat 10 times and use majority voting to combine 10 NNs Cross-Validation Randomly divide sequences into groups, use 1 group as testing set and the training set is randomly sampled from the rest groups
8
8 Results ExperimentDisorderOrderAll 176.1089.7582.92 275.0289.6182.32 374.9090.2982.60 474.0889.6481.86 573.6389.5681.59 675.0589.4782.26 775.0789.9082.48 874.8088.7281.76 974.8590.0782.46 1074.6189.5182.06 Avg 74.8189.6582.23 Std 0.650.420.41 ExperimentDisorderOrderAll 179.5990.3084.94 280.1089.7784.93 380.0989.0984.59 480.3289.5084.91 577.9989.4283.71 679.1189.8284.46 779.8189.1384.47 878.2090.4084.30 980.2389.7084.97 1078.7489.7784.26 Avg 79.4289.6984.55 Std 0.860.430.40 (a) Without Homologous Sequences (b) With Homologous Sequences The classification accuracies:
9
9 Conclusion After adding homologous sequences to training set, there are 2% increase on disorder prediction accuracy
10
10 Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.