Download presentation
Presentation is loading. Please wait.
Published byDarrell Hodges Modified over 9 years ago
1
Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU
2
Outline Sequence encoding Overfitting Method evaluation
How to represent biological data Overfitting cross-validation Method evaluation
3
Encoding of sequence data
Sequence encoding Encoding of sequence data Sparse encoding Blosum encoding Sequence profile encoding Reduced amino acid alphabets
4
Sparse encoding Inp Neuron AAcid A R N D C Q E
5
BLOSUM encoding (Blosum50 matrix)
A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V
6
Sequence encoding (continued)
Sparse encoding V: L: V.L=0 (unrelated) Blosum encoding V: L: V.L = (highly related) V.R = (close to unrelated)
7
Sequence encoding (continued)
Each amino acids is encoded by 20 variables This might be highly ineffective Can this number be reduced without losing predictive performance? Use reduced amino acid alphabet Charge, volume, hydrophobicity, Chemical descriptors Appealing, but in my experience it does not work -)
8
Evaluation of predictive performance
A prediction method contains a very large set of parameters A matrix for predicting binding for 9meric peptides has 9x20=180 weights Over fitting is a problem Temperature years
9
Evaluation of predictive performance
Train PSSM on raw data No pseudo counts, No sequence weighting Fit 9*20 parameters to 9*10 data points Evaluate on training data PCC = 0.97 AUC = 1.0 Close to a perfect prediction method ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV MRSGRVHAV VRFNIDETP ANYIGQDGL AELCGDPGD QTRAVADGK GRPVPAAHP MTAQWWLDA FARGVVHVI LQRELTRLQ AVAEEMTKS Binders None Binders
10
Evaluation of predictive performance
Train PSSM on Permuted data No pseudo counts, No sequence weighting Fit 9*20 parameters to 9*10 data points Evaluate on training data PCC = 0.97 AUC = 1.0 Close to a perfect prediction method AND Same performance as one the original data AAAMAAKLA AAKNLAAAA AKALAAAAR AAAAKLATA ALAKAVAAA IPELMRTNG FIMGVFTGL NVTKVVAWL LEPLNLVLK VAVIVSVPF MRSGRVHAV VRFNIDETP ANYIGQDGL AELCGDPGD QTRAVADGK GRPVPAAHP MTAQWWLDA FARGVVHVI LQRELTRLQ AVAEEMTKS Binders None Binders
11
Repeat on large training data (229 ligands)
12
Cross validation Train on 4/5 of data Test/evaluate on 1/5 =>
Produce 5 different methods each with a different prediction focus
13
Method evaluation Use cross validation Evaluate on concatenated data and not as an average over each cross-validated performance And even better, use an external evaluation set, that is not part of the training data
14
Method evaluation How is an external evaluation set evaluated on a 5 fold cross-validated training? The cross-validation generates 5 individual methods Predict the evaluation set on each of the 5 methods and take average
15
Method evaluation
16
Method evaluation
17
5 fold training Which PSSM to choose?
18
5 fold training. Use them all
19
Define with encoding scheme to use
Summary Define with encoding scheme to use Sparse, Blosum, reduced alphabet, and combinations of these Deal with over-fitting Evaluate method using cross-validation Evaluate external data using method ensemble
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.