Presentation is loading. Please wait.

Presentation is loading. Please wait.

In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research.

Similar presentations


Presentation on theme: "In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research."— Presentation transcript:

1 In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

2 Overview Introduction Peptide arrays, the task of prediction, machine learning Immune response prediction Prediction methods and results, insights into immune system Limitations and future work Why do we not do better and how we might

3 Peptide arrays Peptides (antigen)

4 Peptide arrays Peptides (antigen) Serum (antibodies)

5 Peptide arrays Peptides (antigen) Serum (antibodies) Immune response

6 In silico predicion Predict in silico which peptides evoke immune response

7 In silico predicion Predict in silico which peptides evoke immune response Save costs by putting only the most promising peptides on an array Gain insight into the workings of immune system

8 Task at hand Data: – 10,218 peptides (15-mers) with negative response – 3,420 peptides (15-mers) with positive response

9 Task at hand Data: – 10,218 peptides (15-mers) with negative response – 3,420 peptides (15-mers) with positive response multiplied to 10,218 for a balanced data set

10 Task at hand Data: – 10,218 peptides (15-mers) with negative response – 3,420 peptides (15-mers) with positive response multiplied to 10,218 for a balanced data set Method: – machine learning – 70 % data for training, 30 % for testing

11 Machine learning Attribute 1Attribute 2...Class a 11 a 12...neg a 21 a 22 pos... Labeled (training) data Instance (peptide)

12 Machine learning Attribute 1Attribute 2...Class a 11 a 12...neg a 21 a 22 pos... Labeled (training) data Machine learning algorithm Classifier: if Attribute 1 is such and such and Attribute 2 is such and such... then Class = neg if Attribute 1 is such and such and Attribute 2 is such and such... then Class = pos Instance (peptide)

13 Machine learning Attribute 1Attribute 2...Class b 11 b 12... b 21 b 22... Unlabeled (test) data

14 Machine learning Attribute 1Attribute 2...Class b 11 b 12... b 21 b 22... Unlabeled (test) data Classifier Attribute 1Attribute 2...Class b 11 b 12...pos b 21 b 22 neg... Classified data

15 Support vector machine (SVM) Attribute 1 Attribute 2

16 Support vector machine (SVM) Attribute 1 Attribute 2

17 Support vector machine (SVM) Attribute 3 Attribute 1 Attribute 2

18 RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Repeated Incremental Pruning to Produce Error Reduction

19 RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Grow a rule: (Attribute 1 = a 1 ) Selection based on information gain

20 RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Grow a rule: (Attribute 1 = a 1 ) and (Attribute 2 <= a 2 )

21 RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Grow a rule: (Attribute 1 = a 1 ) and (Attribute 2 <= a 2 )...  Class = c Until all instances matched by the rule belong to class c (the rule never misclassifies)

22 RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Grow a rule: (Attribute 1 = a 1 ) and (Attribute 2 <= a 2 )...  Class = c The rule is perfect on the growing set. But does it overfit the growing set? We test it on the pruning set.

23 RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Grow a rule: (Attribute 1 = a 1 ) and (Attribute 2 <= a 2 )...  Class = c Prune a rule: (Attribute 1 = a 1 ) and (Attribute 2 <= a 2 )...  Class = c

24 RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Grow a rule: (Attribute 1 = a 1 ) and (Attribute 2 <= a 2 )...  Class = c Prune a rule: (Attribute 1 = a 1 )...  Class = c As long as removal improves performance on the pruning set

25 RIPPER rules Grow and prune a rule Delete instances covered by the rule Repeat the process until the last rule increases length of the rules + misclassified instances by more than a constant

26 Related work EL-Manzalawy et al. (2008) Data: 934 epitopes, 934 random peptides Compared several machine learning methods Best performance by support vector machine + string kernel SVM Their data, string kernel69.59 % Our data, string kernel78.11 % First baseline

27 Immune response prediction

28 First attempt: AA counts ACDEFGHIKLMNPQRSTVWYClass Alanine count Positive or negative immune response Cysteine count...

29 First attempt: AA counts ACDEFGHIKLMNPQRSTVWYClass Alanine count Positive or negative immune response Example peptide: QGDYCRPTVQEERKL, response 35 (negative) ACDEFGHIKLMNPQRSTVWYClass 01120100110012201101neg Cysteine count...

30 First attempt: AA counts SVMRules String kernel78.11 % AA counts79.44 %73.93 % Second baseline Simple attributes are more accurate than string kernel SVM is more accurate than rules But rules can be understood by a human

31 First attempt: AA counts (Y = 0) and (F = 0) and (E >= 1)  Class = neg (Y = 0) and (W = 0) and (R = 0) and (F <= 1)  Class = neg (Y = 0 or <= 1) and...  Class = neg... and (F = 0 or <= 1) and...  Class = neg... and (W = 0 or <= 1) and...  Class = neg... otherwise Class = pos Tyrosine Tryptophan Phenylalanine

32 First attempt: AA counts (Y = 0) and (F = 0) and (E >= 1)  Class = neg (Y = 0) and (W = 0) and (R = 0) and (F <= 1)  Class = neg (Y = 0 or <= 1) and...  Class = neg... and (F = 0 or <= 1) and...  Class = neg... and (W = 0 or <= 1) and...  Class = neg... otherwise Class = pos No tyrosine in peptide  Class = neg, otherwise Class = pos (66.53 %) No tryptophan in peptide  Class = neg, otherwise Class = pos (59.27 %) No phenylalanine in peptide  Class = neg, otherwise Class = pos (63.21 %) TyrosinePhenylalanine Tryptophan

33 AA counts in sections of peptide AA counts ignore position in peptide

34 AA counts in sections of peptide XYClass XXYY 22pos XXYY 22pos XXYY 22pos YYXX 22neg YYXX 22neg YYXX 22neg Peptides AA counts ignore position in peptide

35 AA counts in sections of peptide XYClassX leftY leftX rightY right XXYY 22pos2002 XXYY 22pos2002 XXYY 22pos2002 YYXX 22neg0220 YYXX 22neg0220 YYXX 22neg0220 Peptides AA counts ignore position in peptide

36 AA counts in sections of peptide SVM accuracy increases a bit Rules accuracy decreases SVM better at coping with many attributes? SVMRules AA counts79.44 %73.93 % AA counts, 2 sections79.98 %73.84 % AA counts, 3 sections80.31 %70.87 % AA counts, 4 sections79.99 %69.05 % AA counts, 5 sections79.98 %71.49 %

37 AA count differences Machine learning cannot infer all relations automatically, it needs help

38 AA count differences XYZClass 125pos 233 341 215neg 323 431 Machine learning cannot infer all relations automatically, it needs help

39 AA count differences XYZClassX – Y... 125pos–1–1 233 –1 341pos–1 215neg1 323 1 431 1 Machine learning cannot infer all relations automatically, it needs help

40 AA count differences SVMRules AA counts79.44 %73.93 % AA count differences78.48 %72.91 % AA counts + AA count differences79.39 %75.29 % SVM accuracy decreases a bit Rules accuracy increases The changes are small

41 AA count differences (E – Y <= –1) and (N – R <= –1)  Class = pos (E – Y <= 0) and (D – Y <= – 1) and (Q – Y <= – 2)  Class = pos (E – something <= –1 or 0) and...  Class = pos (something – Y <= –2, –1 or 0) and...  Class = pos... otherwise Class = neg Tyrosine Glutamic acid

42 AA count differences (E – Y <= –1) and (N – R <= –1)  Class = pos (E – Y <= 0) and (D – Y <= – 1) and (Q – Y <= – 2)  Class = pos (E – something <= –1 or 0) and...  Class = pos (something – Y <= –2, –1 or 0) and...  Class = pos... otherwise Class = neg No tyrosine in peptide  Class = neg, otherwise Class = pos (66.53 %) No glutamic acid in peptide  Class = pos, otherwise Class = neg (59.77 %) Tyrosine Glutamic acid

43 Substring counts Perhaps single AAs are not informative enough and we should count longer substrings

44 Substring counts Perhaps single AAs are not informative enough and we should count longer substrings XYClass XXXYY 32pos YXXXY 32pos YYXXX 32pos XYXYX 32neg XYYXX 32neg XXYYX 32neg Peptides

45 Substring counts Perhaps single AAs are not informative enough and we should count longer substrings XYClassXXX... XXXYY 32pos1 YXXXY 32pos1 YYXXX 32pos1 XYXYX 32neg0 XYYXX 32neg0 XXYYX 32neg0 Peptides

46 Substring counts SVMRules AA counts79.44 %73.93 % Counts of substrings of lengths up to 278.92 %74.00 % Counts of substrings of lengths up to 379.01 %73.91 % The changes are small Only one rule with substrings of length above 1 (Y = 0) and (F = 0) and (W = 0) and (M = 0) and (I = 0) and (LL = 0)  Class = neg

47 Substrings with gaps Machine learning needs recurring patterns Small counts for substrings of length above 1 – little recurrence Increase substring counts by allowing gaps between AAs

48 Substrings with gaps Machine learning needs recurring patterns Small counts for substrings of length above 1 – little recurrence Increase substring counts by allowing gaps between AAs XYXABCXXY ABC × 1 YYAXBCYYX ABC × 0.5 XYABXXCYX ABC × 0.5 2

49 Substrings with gaps SVMRules AA counts79.44 %73.93 % Lengths up to 3, no gaps79.01 %73.91 % Lengths up to 3, gap lengths up to 179.11 %74.37 % Lengths up to 3, gap lengths up to 278.83 %74.62 % Lengths up to 3, gap lengths up to 379.10 %74.71 % Lengths up to 3, gap lengths up to 478.91 %75.54 % SVM accuracy decreases a bit Rules accuracy increases The changes are small

50 Substrings with gaps Still no rules with substrings of length 3 More rules with substrings of length 2

51 Substrings with gaps Still no rules with substrings of length 3 More rules with substrings of length 2 (Y = 0) and (F = 0) and (E >= 1) and (W = 0) and (RL = 0)  Class = neg... otherwise Class = pos Leucine Positive response when in pair SubstringCountSubstringCount RL/LRRL/LR5ELEL1 LL5SLSL1 LPLP2RR1 SE2PP1 KK2

52 Classes of AAs Perhaps individual AAs are too specific and we should merge similar AAs into classes

53 Classes of AAs Perhaps individual AAs are too specific and we should merge similar AAs into classes WYFCIVHLMAGTRSNQDPEK -0.727-0.721-0.719-0.693-0.682-0.669-0.662-0.631-0.626-0.605-0.537-0.525-0.448-0.423-0.381-0.369-0.279-0.271-0.160-0.043 Inflexible (1)Medium (2)Flexible (3) Flexibility index

54 Classes of AAs Perhaps individual AAs are too specific and we should merge similar AAs into classes WYFCIVHLMAGTRSNQDPEK -0.727-0.721-0.719-0.693-0.682-0.669-0.662-0.631-0.626-0.605-0.537-0.525-0.448-0.423-0.381-0.369-0.279-0.271-0.160-0.043 Inflexible (1)Medium (2)Flexible (3) Example peptide: QGDYCRPTVQEERKL 323113321333332 Flexibility index

55 Classes of AAs Counts of substrings of lengths up to 3: 1, 2, 3 11, 12, 13, 21, 22, 23, 31, 32, 33 111, 112, 113, 121, 122, 123, 131, 132, 133, 211, 212, 213, 221, 222, 223, 231, 232, 233, 311, 312, 313, 321, 322, 323, 331, 332, 333 Gaps of length 1

56 Classes of AAs SVMRules AA counts79.44 %73.93 % AAs, lengths up to 3, gaps up to 179.11 %74.37 % Aromatic/aliphatic72.51 %72.49 % Basic/acidic61.15 %61.06 % Flexibility69.24 %67.46 % Hydrophobicity58.92 %56.85 % Polarity64.94 %63.50 % Size61.89 %60.58 % Turns index58.92 %55.57 %

57 Aromatic/aliphatic AAs AromaticAliphaticOther Phenylalanine (F)Isoleucine (I)all the rest Histidine (H)Leucine (L) Tryptophan (W)Valine (V) Tyrosine (Y)

58 Aromatic/aliphatic AAs AromaticAliphaticOther Phenylalanine (F)Isoleucine (I)all the rest Histidine (H)Leucine (L) Tryptophan (W)Valine (V) Tyrosine (Y) No tyrosine in peptide  Class = neg, otherwise Class = pos (66.53 %) No tryptophan in peptide  Class = neg, otherwise Class = pos (59.27 %) No phenylalanine in peptide  Class = neg, otherwise Class = pos (63.21 %)

59 Aromatic/aliphatic AAs AromaticAliphaticOther Phenylalanine (F)Isoleucine (I)all the rest Histidine (H)Leucine (L) Tryptophan (W)Valine (V) Tyrosine (Y) No tyrosine in peptide  Class = neg, otherwise Class = pos (66.53 %) No tryptophan in peptide  Class = neg, otherwise Class = pos (59.27 %) No phenylalanine in peptide  Class = neg, otherwise Class = pos (63.21 %) 1 or no aromatic AAs in peptide  Class = neg, otherwise Class = pos (71.37 %)

60 Flexibility of AAs (F1 >= 5)  Class = pos (F1 >= 4) and (F33 = 1.5) and (F231 = 0) and (F213 <= 1)  Class = pos (F1 >= 4) and (F33 <= 3.5)  Class = pos (F1 >= 3) and (F2 >= 5) and (F311 >= 1)  Class = pos (F1 >= 3) and (F11 = 0) and (F322 = 1)  Class = pos otherwise Class = neg

61 Flexibility of AAs Inflexible AAs indicate pos Flexibility linked to epitope propensity in the literature (F1 >= 5)  Class = pos (F1 >= 4) and (F33 = 1.5) and (F231 = 0) and (F213 <= 1)  Class = pos (F1 >= 4) and (F33 <= 3.5)  Class = pos (F1 >= 3) and (F2 >= 5) and (F311 >= 1)  Class = pos (F1 >= 3) and (F11 = 0) and (F322 = 1)  Class = pos otherwise Class = neg ?

62 Flexibility of AAs Inflexible AAs indicate pos Flexibility linked to epitope propensity in the literature Y, W, F are inflexible, E is flexible (F1 >= 5)  Class = pos (F1 >= 4) and (F33 = 1.5) and (F231 = 0) and (F213 <= 1)  Class = pos (F1 >= 4) and (F33 <= 3.5)  Class = pos (F1 >= 3) and (F2 >= 5) and (F311 >= 1)  Class = pos (F1 >= 3) and (F11 = 0) and (F322 = 1)  Class = pos otherwise Class = neg ?

63 Polarity of AAs (N >= 8) and (P + >= 2) and (P – <= 1)  Class = pos (N >= 10) and (NP + >= 0.5) and (NNP 0 >= 1.5)  Class = pos (...N... >= something) and...  Class = pos (... P –... <= something) and...  Class = pos... otherwise Class = neg

64 Polarity of AAs Non-polar AAs indicate pos Should not polarity be conductive to antibody binding? (N >= 8) and (P + >= 2) and (P – <= 1)  Class = pos (N >= 10) and (NP + >= 0.5) and (NNP 0 >= 1.5)  Class = pos (...N... >= something) and...  Class = pos (... P –... <= something) and...  Class = pos... otherwise Class = neg ?

65 Polarity of AAs Non-polar AAs indicate pos Should not polarity be conductive to antibody binding? Y, W, F are non-polar, E is polar negative (N >= 8) and (P + >= 2) and (P – <= 1)  Class = pos (N >= 10) and (NP + >= 0.5) and (NNP 0 >= 1.5)  Class = pos (...N... >= something) and...  Class = pos (... P –... <= something) and...  Class = pos... otherwise Class = neg ?

66 Classes of AAs SVMRules AA counts79.44 %73.93 % All AA class counts79.77 %75.07 % All AA class counts and AA counts79.89 %76.12 % SVM accuracy increases a bit Highest rules accuracy

67 Classes of AAs SVMRules AA counts79.44 %73.93 % All AA class counts79.77 %75.07 % All AA class counts and AA counts79.89 %76.12 % SVM accuracy increases a bit Highest rules accuracy (aromatic AAs >= 3) and (non-polar AAs >= 8) and (medium turn presence AAs >= 8)  Class = pos (aromatic AAs >= 3) and (non-polar AAs >= 8) and (Y >= 2)  Class = pos Precision 88.64 % Precision 91.71 %

68 AA pair counts Antibody Peptide

69 AA pair counts Antibody Peptide AA1 AA2 Distance d Attributes are counts of AA1 and AA2 at distance d for all AA1, AA2 and d

70 AA pair counts Too many attributes: 20 AAs × 20 AAs × 14 possible distances = 5,600 attributes

71 AA pair counts Too many attributes: 20 AAs × 20 AAs × 14 possible distances = 5,600 attributes Ways to reduce them: – Classes of AAs instead of individual AAs – Increment distances in steps > 1 d = 1 d = 2 d = 3 d = 4 d = 5 Step 3

72 AA pair counts SVMRules AA counts79.44 %73.93 % AA × AA, distance step 475.28 %71.02 % AA × aromatic/aliphatic, step 378.25 %72.80 % Aromatic/aliphatic × aromatic/aliphatic, step 2 72.56 %70.90 % Accuracy decreases Rules not particularly illuminating

73 AA pair counts with a fixed side Antibody Easily accesible side of peptide Peptide array surface

74 AA pair counts with a fixed side Antibody Easily accesible side of peptide AA at fixed position AA at d Distance d Attributes are AA at fixed position and counts of AA at distance d for all AA and d Peptide array surface

75 AA pair counts with a fixed side Fewer attributes: 20 AAs at fixed position + 20 AAs × 14 possible distances = 300 attributes

76 AA pair counts with a fixed side SVM accuracy increases a bit Rules accuracy decreases due to bad attributes SVMRules AA counts79.44 %73.93 % Pair AA × AA, distance step 475.28 %71.02 % Fixed pair, step 280.07 %69.35 % Fixed pair, step 178.82 %67.60 %

77 AA pair counts with a fixed side SVM accuracy increases a bit Rules accuracy decreases due to bad attributes SVMRules AA counts79.44 %73.93 % Pair AA × AA, distance step 475.28 %71.02 % Fixed pair, step 280.07 %69.35 % Fixed pair, step 178.82 %67.60 % (Y at d 2 >= 1)  Class = pos (Y at d 6 >= 1)  Class = pos (Y at d 7 >= 1)  Class = pos (Y at d 9 >= 1)  Class = pos (Y at d 11 >= 1)  Class = pos (Y at d 14 >= 1)  Class = pos... otherwise Class = neg

78 AA properties (in sections of peptide) AA properties on which AA classes are based can be used directly – averaged over peptide

79 AA properties (in sections of peptide) AA properties on which AA classes are based can be used directly – averaged over peptide SVMRules AA counts79.44 %73.93 % AA properties76.45 %72.63 % AA properties, 2 sections76.97 %73.01 % AA properties, 3 sections77.25 %71.13 % AA properties and AA counts79.30 %74.97 % AA properties and AA counts, 2 sections79.89 %74.24 % AA properties and AA counts, 3 sections80.05 %70.80 %

80 AA properties (in sections of peptide) (aromatic >= 0.3)  Class = pos (polarity <= something) and...  Class = pos (basic >= something) and...  Class = pos... otherwise Class = neg

81 AA properties (in sections of peptide) (aromatic >= 0.3)  Class = pos (polarity <= something) and...  Class = pos (basic >= something) and...  Class = pos... otherwise Class = neg Aromatic AAs indicate pos Non-polar AAs indicate pos (Y, W, F non-polar) Presence of basic / absence of acidic AAs indicates pos H, K, R E, D

82 Limitations and future work

83 Why are we stuck at 80 % Training data A single antibody or a group of similar antibodies A single antibody or a group of similar antibodies Single different antibodies

84 Why are we stuck at 80 % Training data A single antibody or a group of similar antibodies A single antibody or a group of similar antibodies Single different antibodies Similar peptides Each peptide different

85 Why are we stuck at 80 % Training data A single antibody or a group of similar antibodies A single antibody or a group of similar antibodies Single different antibodies Similar peptides Similarity recognized by machine learning: our 80 % Each peptide different Ignored by machine learning

86 Why are we stuck at 80 % Training data Test data A single antibody or a group of similar antibodies A single antibody or a group of similar antibodies Single different antibodies Similar peptides Similarity recognized by machine learning: our 80 % Each peptide different Ignored by machine learning Recognized by machine learning: our 80 % again Not recognized

87 Where do we go from here Tyrosine followed by another aromatic AA followed by tryptophan followed by polar AA

88 Where do we go from here Tyrosine followed by another aromatic AA followed by tryptophan followed by polar AA Aggregating rules with different attributes Kernels to use many attributes simultaneously Peptide similarity

89 Aggregating rules Check if different attribute sets cover different instances If so, pick the best rules for each attribute set Use only the best rules for classification

90 Aggregating rules Check if different attribute sets cover different instances If so, pick the best rules for each attribute set Use only the best rules for classification TrainingTest Rule 1 Rule 2 Rule 3...

91 Kernels Use many attributes without computing them explicitely Only works with some methods like SVM

92 Kernels Use many attributes without computing them explicitely Only works with some methods like SVM Instance 1: (a 11, a 12,..., a 1n ) Instance 2: (a 21, a 22,..., a 2n )

93 Kernels Use many attributes without computing them explicitely Only works with some methods like SVM Instance 1: (a 11, a 12,..., a 1n ) Instance 2: (a 21, a 22,..., a 2n ) (a 11, a 12,..., a 1n ) · (a 21, a 22,..., a 2n ) = a 11 a 21 + a 12 a 22 +... + a 1n a 2n Only need dot product Only need to compute attributes that are non-zero in both attribute vectors

94 Peptide similarity Smart similarity: – Find best alignment of two peptides – Then compute similarity

95 Peptide similarity Smart similarity: – Find best alignment of two peptides – Then compute similarity Nerest-neighbor classification

96 Peptide similarity Smart similarity: – Find best alignment of two peptides – Then compute similarity Nerest-neighbor classification Clustering: – Find groups of similar peptides – Find groups of peptides that are similar in the same way

97 Questions? Suggestions? Peptide arrays – SVM – RIPPER – EL-Manzalawy AA counts – AA counts in sections AA count differences – Substring counts (with gaps) AA classes – AA pairs (with a fixed side) – AA properties Stuck at 80 % Aggregating rules – Kernels – Peptide similarity


Download ppt "In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research."

Similar presentations


Ads by Google