Download presentation
Presentation is loading. Please wait.
Published byLiliana Daniel Modified over 9 years ago
1
In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research
2
Overview Introduction Peptide arrays, the task of prediction, machine learning Immune response prediction Prediction methods and results, insights into immune system Limitations and future work Why do we not do better and how we might
3
Peptide arrays Peptides (antigen)
4
Peptide arrays Peptides (antigen) Serum (antibodies)
5
Peptide arrays Peptides (antigen) Serum (antibodies) Immune response
6
In silico predicion Predict in silico which peptides evoke immune response
7
In silico predicion Predict in silico which peptides evoke immune response Save costs by putting only the most promising peptides on an array Gain insight into the workings of immune system
8
Task at hand Data: – 10,218 peptides (15-mers) with negative response – 3,420 peptides (15-mers) with positive response
9
Task at hand Data: – 10,218 peptides (15-mers) with negative response – 3,420 peptides (15-mers) with positive response multiplied to 10,218 for a balanced data set
10
Task at hand Data: – 10,218 peptides (15-mers) with negative response – 3,420 peptides (15-mers) with positive response multiplied to 10,218 for a balanced data set Method: – machine learning – 70 % data for training, 30 % for testing
11
Machine learning Attribute 1Attribute 2...Class a 11 a 12...neg a 21 a 22 pos... Labeled (training) data Instance (peptide)
12
Machine learning Attribute 1Attribute 2...Class a 11 a 12...neg a 21 a 22 pos... Labeled (training) data Machine learning algorithm Classifier: if Attribute 1 is such and such and Attribute 2 is such and such... then Class = neg if Attribute 1 is such and such and Attribute 2 is such and such... then Class = pos Instance (peptide)
13
Machine learning Attribute 1Attribute 2...Class b 11 b 12... b 21 b 22... Unlabeled (test) data
14
Machine learning Attribute 1Attribute 2...Class b 11 b 12... b 21 b 22... Unlabeled (test) data Classifier Attribute 1Attribute 2...Class b 11 b 12...pos b 21 b 22 neg... Classified data
15
Support vector machine (SVM) Attribute 1 Attribute 2
16
Support vector machine (SVM) Attribute 1 Attribute 2
17
Support vector machine (SVM) Attribute 3 Attribute 1 Attribute 2
18
RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Repeated Incremental Pruning to Produce Error Reduction
19
RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Grow a rule: (Attribute 1 = a 1 ) Selection based on information gain
20
RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Grow a rule: (Attribute 1 = a 1 ) and (Attribute 2 <= a 2 )
21
RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Grow a rule: (Attribute 1 = a 1 ) and (Attribute 2 <= a 2 )... Class = c Until all instances matched by the rule belong to class c (the rule never misclassifies)
22
RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Grow a rule: (Attribute 1 = a 1 ) and (Attribute 2 <= a 2 )... Class = c The rule is perfect on the growing set. But does it overfit the growing set? We test it on the pruning set.
23
RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Grow a rule: (Attribute 1 = a 1 ) and (Attribute 2 <= a 2 )... Class = c Prune a rule: (Attribute 1 = a 1 ) and (Attribute 2 <= a 2 )... Class = c
24
RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Grow a rule: (Attribute 1 = a 1 ) and (Attribute 2 <= a 2 )... Class = c Prune a rule: (Attribute 1 = a 1 )... Class = c As long as removal improves performance on the pruning set
25
RIPPER rules Grow and prune a rule Delete instances covered by the rule Repeat the process until the last rule increases length of the rules + misclassified instances by more than a constant
26
Related work EL-Manzalawy et al. (2008) Data: 934 epitopes, 934 random peptides Compared several machine learning methods Best performance by support vector machine + string kernel SVM Their data, string kernel69.59 % Our data, string kernel78.11 % First baseline
27
Immune response prediction
28
First attempt: AA counts ACDEFGHIKLMNPQRSTVWYClass Alanine count Positive or negative immune response Cysteine count...
29
First attempt: AA counts ACDEFGHIKLMNPQRSTVWYClass Alanine count Positive or negative immune response Example peptide: QGDYCRPTVQEERKL, response 35 (negative) ACDEFGHIKLMNPQRSTVWYClass 01120100110012201101neg Cysteine count...
30
First attempt: AA counts SVMRules String kernel78.11 % AA counts79.44 %73.93 % Second baseline Simple attributes are more accurate than string kernel SVM is more accurate than rules But rules can be understood by a human
31
First attempt: AA counts (Y = 0) and (F = 0) and (E >= 1) Class = neg (Y = 0) and (W = 0) and (R = 0) and (F <= 1) Class = neg (Y = 0 or <= 1) and... Class = neg... and (F = 0 or <= 1) and... Class = neg... and (W = 0 or <= 1) and... Class = neg... otherwise Class = pos Tyrosine Tryptophan Phenylalanine
32
First attempt: AA counts (Y = 0) and (F = 0) and (E >= 1) Class = neg (Y = 0) and (W = 0) and (R = 0) and (F <= 1) Class = neg (Y = 0 or <= 1) and... Class = neg... and (F = 0 or <= 1) and... Class = neg... and (W = 0 or <= 1) and... Class = neg... otherwise Class = pos No tyrosine in peptide Class = neg, otherwise Class = pos (66.53 %) No tryptophan in peptide Class = neg, otherwise Class = pos (59.27 %) No phenylalanine in peptide Class = neg, otherwise Class = pos (63.21 %) TyrosinePhenylalanine Tryptophan
33
AA counts in sections of peptide AA counts ignore position in peptide
34
AA counts in sections of peptide XYClass XXYY 22pos XXYY 22pos XXYY 22pos YYXX 22neg YYXX 22neg YYXX 22neg Peptides AA counts ignore position in peptide
35
AA counts in sections of peptide XYClassX leftY leftX rightY right XXYY 22pos2002 XXYY 22pos2002 XXYY 22pos2002 YYXX 22neg0220 YYXX 22neg0220 YYXX 22neg0220 Peptides AA counts ignore position in peptide
36
AA counts in sections of peptide SVM accuracy increases a bit Rules accuracy decreases SVM better at coping with many attributes? SVMRules AA counts79.44 %73.93 % AA counts, 2 sections79.98 %73.84 % AA counts, 3 sections80.31 %70.87 % AA counts, 4 sections79.99 %69.05 % AA counts, 5 sections79.98 %71.49 %
37
AA count differences Machine learning cannot infer all relations automatically, it needs help
38
AA count differences XYZClass 125pos 233 341 215neg 323 431 Machine learning cannot infer all relations automatically, it needs help
39
AA count differences XYZClassX – Y... 125pos–1–1 233 –1 341pos–1 215neg1 323 1 431 1 Machine learning cannot infer all relations automatically, it needs help
40
AA count differences SVMRules AA counts79.44 %73.93 % AA count differences78.48 %72.91 % AA counts + AA count differences79.39 %75.29 % SVM accuracy decreases a bit Rules accuracy increases The changes are small
41
AA count differences (E – Y <= –1) and (N – R <= –1) Class = pos (E – Y <= 0) and (D – Y <= – 1) and (Q – Y <= – 2) Class = pos (E – something <= –1 or 0) and... Class = pos (something – Y <= –2, –1 or 0) and... Class = pos... otherwise Class = neg Tyrosine Glutamic acid
42
AA count differences (E – Y <= –1) and (N – R <= –1) Class = pos (E – Y <= 0) and (D – Y <= – 1) and (Q – Y <= – 2) Class = pos (E – something <= –1 or 0) and... Class = pos (something – Y <= –2, –1 or 0) and... Class = pos... otherwise Class = neg No tyrosine in peptide Class = neg, otherwise Class = pos (66.53 %) No glutamic acid in peptide Class = pos, otherwise Class = neg (59.77 %) Tyrosine Glutamic acid
43
Substring counts Perhaps single AAs are not informative enough and we should count longer substrings
44
Substring counts Perhaps single AAs are not informative enough and we should count longer substrings XYClass XXXYY 32pos YXXXY 32pos YYXXX 32pos XYXYX 32neg XYYXX 32neg XXYYX 32neg Peptides
45
Substring counts Perhaps single AAs are not informative enough and we should count longer substrings XYClassXXX... XXXYY 32pos1 YXXXY 32pos1 YYXXX 32pos1 XYXYX 32neg0 XYYXX 32neg0 XXYYX 32neg0 Peptides
46
Substring counts SVMRules AA counts79.44 %73.93 % Counts of substrings of lengths up to 278.92 %74.00 % Counts of substrings of lengths up to 379.01 %73.91 % The changes are small Only one rule with substrings of length above 1 (Y = 0) and (F = 0) and (W = 0) and (M = 0) and (I = 0) and (LL = 0) Class = neg
47
Substrings with gaps Machine learning needs recurring patterns Small counts for substrings of length above 1 – little recurrence Increase substring counts by allowing gaps between AAs
48
Substrings with gaps Machine learning needs recurring patterns Small counts for substrings of length above 1 – little recurrence Increase substring counts by allowing gaps between AAs XYXABCXXY ABC × 1 YYAXBCYYX ABC × 0.5 XYABXXCYX ABC × 0.5 2
49
Substrings with gaps SVMRules AA counts79.44 %73.93 % Lengths up to 3, no gaps79.01 %73.91 % Lengths up to 3, gap lengths up to 179.11 %74.37 % Lengths up to 3, gap lengths up to 278.83 %74.62 % Lengths up to 3, gap lengths up to 379.10 %74.71 % Lengths up to 3, gap lengths up to 478.91 %75.54 % SVM accuracy decreases a bit Rules accuracy increases The changes are small
50
Substrings with gaps Still no rules with substrings of length 3 More rules with substrings of length 2
51
Substrings with gaps Still no rules with substrings of length 3 More rules with substrings of length 2 (Y = 0) and (F = 0) and (E >= 1) and (W = 0) and (RL = 0) Class = neg... otherwise Class = pos Leucine Positive response when in pair SubstringCountSubstringCount RL/LRRL/LR5ELEL1 LL5SLSL1 LPLP2RR1 SE2PP1 KK2
52
Classes of AAs Perhaps individual AAs are too specific and we should merge similar AAs into classes
53
Classes of AAs Perhaps individual AAs are too specific and we should merge similar AAs into classes WYFCIVHLMAGTRSNQDPEK -0.727-0.721-0.719-0.693-0.682-0.669-0.662-0.631-0.626-0.605-0.537-0.525-0.448-0.423-0.381-0.369-0.279-0.271-0.160-0.043 Inflexible (1)Medium (2)Flexible (3) Flexibility index
54
Classes of AAs Perhaps individual AAs are too specific and we should merge similar AAs into classes WYFCIVHLMAGTRSNQDPEK -0.727-0.721-0.719-0.693-0.682-0.669-0.662-0.631-0.626-0.605-0.537-0.525-0.448-0.423-0.381-0.369-0.279-0.271-0.160-0.043 Inflexible (1)Medium (2)Flexible (3) Example peptide: QGDYCRPTVQEERKL 323113321333332 Flexibility index
55
Classes of AAs Counts of substrings of lengths up to 3: 1, 2, 3 11, 12, 13, 21, 22, 23, 31, 32, 33 111, 112, 113, 121, 122, 123, 131, 132, 133, 211, 212, 213, 221, 222, 223, 231, 232, 233, 311, 312, 313, 321, 322, 323, 331, 332, 333 Gaps of length 1
56
Classes of AAs SVMRules AA counts79.44 %73.93 % AAs, lengths up to 3, gaps up to 179.11 %74.37 % Aromatic/aliphatic72.51 %72.49 % Basic/acidic61.15 %61.06 % Flexibility69.24 %67.46 % Hydrophobicity58.92 %56.85 % Polarity64.94 %63.50 % Size61.89 %60.58 % Turns index58.92 %55.57 %
57
Aromatic/aliphatic AAs AromaticAliphaticOther Phenylalanine (F)Isoleucine (I)all the rest Histidine (H)Leucine (L) Tryptophan (W)Valine (V) Tyrosine (Y)
58
Aromatic/aliphatic AAs AromaticAliphaticOther Phenylalanine (F)Isoleucine (I)all the rest Histidine (H)Leucine (L) Tryptophan (W)Valine (V) Tyrosine (Y) No tyrosine in peptide Class = neg, otherwise Class = pos (66.53 %) No tryptophan in peptide Class = neg, otherwise Class = pos (59.27 %) No phenylalanine in peptide Class = neg, otherwise Class = pos (63.21 %)
59
Aromatic/aliphatic AAs AromaticAliphaticOther Phenylalanine (F)Isoleucine (I)all the rest Histidine (H)Leucine (L) Tryptophan (W)Valine (V) Tyrosine (Y) No tyrosine in peptide Class = neg, otherwise Class = pos (66.53 %) No tryptophan in peptide Class = neg, otherwise Class = pos (59.27 %) No phenylalanine in peptide Class = neg, otherwise Class = pos (63.21 %) 1 or no aromatic AAs in peptide Class = neg, otherwise Class = pos (71.37 %)
60
Flexibility of AAs (F1 >= 5) Class = pos (F1 >= 4) and (F33 = 1.5) and (F231 = 0) and (F213 <= 1) Class = pos (F1 >= 4) and (F33 <= 3.5) Class = pos (F1 >= 3) and (F2 >= 5) and (F311 >= 1) Class = pos (F1 >= 3) and (F11 = 0) and (F322 = 1) Class = pos otherwise Class = neg
61
Flexibility of AAs Inflexible AAs indicate pos Flexibility linked to epitope propensity in the literature (F1 >= 5) Class = pos (F1 >= 4) and (F33 = 1.5) and (F231 = 0) and (F213 <= 1) Class = pos (F1 >= 4) and (F33 <= 3.5) Class = pos (F1 >= 3) and (F2 >= 5) and (F311 >= 1) Class = pos (F1 >= 3) and (F11 = 0) and (F322 = 1) Class = pos otherwise Class = neg ?
62
Flexibility of AAs Inflexible AAs indicate pos Flexibility linked to epitope propensity in the literature Y, W, F are inflexible, E is flexible (F1 >= 5) Class = pos (F1 >= 4) and (F33 = 1.5) and (F231 = 0) and (F213 <= 1) Class = pos (F1 >= 4) and (F33 <= 3.5) Class = pos (F1 >= 3) and (F2 >= 5) and (F311 >= 1) Class = pos (F1 >= 3) and (F11 = 0) and (F322 = 1) Class = pos otherwise Class = neg ?
63
Polarity of AAs (N >= 8) and (P + >= 2) and (P – <= 1) Class = pos (N >= 10) and (NP + >= 0.5) and (NNP 0 >= 1.5) Class = pos (...N... >= something) and... Class = pos (... P –... <= something) and... Class = pos... otherwise Class = neg
64
Polarity of AAs Non-polar AAs indicate pos Should not polarity be conductive to antibody binding? (N >= 8) and (P + >= 2) and (P – <= 1) Class = pos (N >= 10) and (NP + >= 0.5) and (NNP 0 >= 1.5) Class = pos (...N... >= something) and... Class = pos (... P –... <= something) and... Class = pos... otherwise Class = neg ?
65
Polarity of AAs Non-polar AAs indicate pos Should not polarity be conductive to antibody binding? Y, W, F are non-polar, E is polar negative (N >= 8) and (P + >= 2) and (P – <= 1) Class = pos (N >= 10) and (NP + >= 0.5) and (NNP 0 >= 1.5) Class = pos (...N... >= something) and... Class = pos (... P –... <= something) and... Class = pos... otherwise Class = neg ?
66
Classes of AAs SVMRules AA counts79.44 %73.93 % All AA class counts79.77 %75.07 % All AA class counts and AA counts79.89 %76.12 % SVM accuracy increases a bit Highest rules accuracy
67
Classes of AAs SVMRules AA counts79.44 %73.93 % All AA class counts79.77 %75.07 % All AA class counts and AA counts79.89 %76.12 % SVM accuracy increases a bit Highest rules accuracy (aromatic AAs >= 3) and (non-polar AAs >= 8) and (medium turn presence AAs >= 8) Class = pos (aromatic AAs >= 3) and (non-polar AAs >= 8) and (Y >= 2) Class = pos Precision 88.64 % Precision 91.71 %
68
AA pair counts Antibody Peptide
69
AA pair counts Antibody Peptide AA1 AA2 Distance d Attributes are counts of AA1 and AA2 at distance d for all AA1, AA2 and d
70
AA pair counts Too many attributes: 20 AAs × 20 AAs × 14 possible distances = 5,600 attributes
71
AA pair counts Too many attributes: 20 AAs × 20 AAs × 14 possible distances = 5,600 attributes Ways to reduce them: – Classes of AAs instead of individual AAs – Increment distances in steps > 1 d = 1 d = 2 d = 3 d = 4 d = 5 Step 3
72
AA pair counts SVMRules AA counts79.44 %73.93 % AA × AA, distance step 475.28 %71.02 % AA × aromatic/aliphatic, step 378.25 %72.80 % Aromatic/aliphatic × aromatic/aliphatic, step 2 72.56 %70.90 % Accuracy decreases Rules not particularly illuminating
73
AA pair counts with a fixed side Antibody Easily accesible side of peptide Peptide array surface
74
AA pair counts with a fixed side Antibody Easily accesible side of peptide AA at fixed position AA at d Distance d Attributes are AA at fixed position and counts of AA at distance d for all AA and d Peptide array surface
75
AA pair counts with a fixed side Fewer attributes: 20 AAs at fixed position + 20 AAs × 14 possible distances = 300 attributes
76
AA pair counts with a fixed side SVM accuracy increases a bit Rules accuracy decreases due to bad attributes SVMRules AA counts79.44 %73.93 % Pair AA × AA, distance step 475.28 %71.02 % Fixed pair, step 280.07 %69.35 % Fixed pair, step 178.82 %67.60 %
77
AA pair counts with a fixed side SVM accuracy increases a bit Rules accuracy decreases due to bad attributes SVMRules AA counts79.44 %73.93 % Pair AA × AA, distance step 475.28 %71.02 % Fixed pair, step 280.07 %69.35 % Fixed pair, step 178.82 %67.60 % (Y at d 2 >= 1) Class = pos (Y at d 6 >= 1) Class = pos (Y at d 7 >= 1) Class = pos (Y at d 9 >= 1) Class = pos (Y at d 11 >= 1) Class = pos (Y at d 14 >= 1) Class = pos... otherwise Class = neg
78
AA properties (in sections of peptide) AA properties on which AA classes are based can be used directly – averaged over peptide
79
AA properties (in sections of peptide) AA properties on which AA classes are based can be used directly – averaged over peptide SVMRules AA counts79.44 %73.93 % AA properties76.45 %72.63 % AA properties, 2 sections76.97 %73.01 % AA properties, 3 sections77.25 %71.13 % AA properties and AA counts79.30 %74.97 % AA properties and AA counts, 2 sections79.89 %74.24 % AA properties and AA counts, 3 sections80.05 %70.80 %
80
AA properties (in sections of peptide) (aromatic >= 0.3) Class = pos (polarity <= something) and... Class = pos (basic >= something) and... Class = pos... otherwise Class = neg
81
AA properties (in sections of peptide) (aromatic >= 0.3) Class = pos (polarity <= something) and... Class = pos (basic >= something) and... Class = pos... otherwise Class = neg Aromatic AAs indicate pos Non-polar AAs indicate pos (Y, W, F non-polar) Presence of basic / absence of acidic AAs indicates pos H, K, R E, D
82
Limitations and future work
83
Why are we stuck at 80 % Training data A single antibody or a group of similar antibodies A single antibody or a group of similar antibodies Single different antibodies
84
Why are we stuck at 80 % Training data A single antibody or a group of similar antibodies A single antibody or a group of similar antibodies Single different antibodies Similar peptides Each peptide different
85
Why are we stuck at 80 % Training data A single antibody or a group of similar antibodies A single antibody or a group of similar antibodies Single different antibodies Similar peptides Similarity recognized by machine learning: our 80 % Each peptide different Ignored by machine learning
86
Why are we stuck at 80 % Training data Test data A single antibody or a group of similar antibodies A single antibody or a group of similar antibodies Single different antibodies Similar peptides Similarity recognized by machine learning: our 80 % Each peptide different Ignored by machine learning Recognized by machine learning: our 80 % again Not recognized
87
Where do we go from here Tyrosine followed by another aromatic AA followed by tryptophan followed by polar AA
88
Where do we go from here Tyrosine followed by another aromatic AA followed by tryptophan followed by polar AA Aggregating rules with different attributes Kernels to use many attributes simultaneously Peptide similarity
89
Aggregating rules Check if different attribute sets cover different instances If so, pick the best rules for each attribute set Use only the best rules for classification
90
Aggregating rules Check if different attribute sets cover different instances If so, pick the best rules for each attribute set Use only the best rules for classification TrainingTest Rule 1 Rule 2 Rule 3...
91
Kernels Use many attributes without computing them explicitely Only works with some methods like SVM
92
Kernels Use many attributes without computing them explicitely Only works with some methods like SVM Instance 1: (a 11, a 12,..., a 1n ) Instance 2: (a 21, a 22,..., a 2n )
93
Kernels Use many attributes without computing them explicitely Only works with some methods like SVM Instance 1: (a 11, a 12,..., a 1n ) Instance 2: (a 21, a 22,..., a 2n ) (a 11, a 12,..., a 1n ) · (a 21, a 22,..., a 2n ) = a 11 a 21 + a 12 a 22 +... + a 1n a 2n Only need dot product Only need to compute attributes that are non-zero in both attribute vectors
94
Peptide similarity Smart similarity: – Find best alignment of two peptides – Then compute similarity
95
Peptide similarity Smart similarity: – Find best alignment of two peptides – Then compute similarity Nerest-neighbor classification
96
Peptide similarity Smart similarity: – Find best alignment of two peptides – Then compute similarity Nerest-neighbor classification Clustering: – Find groups of similar peptides – Find groups of peptides that are similar in the same way
97
Questions? Suggestions? Peptide arrays – SVM – RIPPER – EL-Manzalawy AA counts – AA counts in sections AA count differences – Substring counts (with gaps) AA classes – AA pairs (with a fixed side) – AA properties Stuck at 80 % Aggregating rules – Kernels – Peptide similarity
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.