In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research.

In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Overview Introduction Peptide arrays, the task of prediction, machine learning Immune response prediction Prediction methods and results, insights into immune system Limitations and future work Why do we not do better and how we might

Peptide arrays Peptides (antigen)

Peptide arrays Peptides (antigen) Serum (antibodies)

Peptide arrays Peptides (antigen) Serum (antibodies) Immune response

In silico predicion Predict in silico which peptides evoke immune response

In silico predicion Predict in silico which peptides evoke immune response Save costs by putting only the most promising peptides on an array Gain insight into the workings of immune system

Task at hand Data: – 10,218 peptides (15-mers) with negative response – 3,420 peptides (15-mers) with positive response

Task at hand Data: – 10,218 peptides (15-mers) with negative response – 3,420 peptides (15-mers) with positive response multiplied to 10,218 for a balanced data set

Task at hand Data: – 10,218 peptides (15-mers) with negative response – 3,420 peptides (15-mers) with positive response multiplied to 10,218 for a balanced data set Method: – machine learning – 70 % data for training, 30 % for testing

Machine learning Attribute 1Attribute 2...Class a 11 a 12...neg a 21 a 22 pos... Labeled (training) data Instance (peptide)

Machine learning Attribute 1Attribute 2...Class a 11 a 12...neg a 21 a 22 pos... Labeled (training) data Machine learning algorithm Classifier: if Attribute 1 is such and such and Attribute 2 is such and such... then Class = neg if Attribute 1 is such and such and Attribute 2 is such and such... then Class = pos Instance (peptide)

Machine learning Attribute 1Attribute 2...Class b 11 b 12... b 21 b 22... Unlabeled (test) data

Machine learning Attribute 1Attribute 2...Class b 11 b 12... b 21 b 22... Unlabeled (test) data Classifier Attribute 1Attribute 2...Class b 11 b 12...pos b 21 b 22 neg... Classified data

Support vector machine (SVM) Attribute 1 Attribute 2

Support vector machine (SVM) Attribute 3 Attribute 1 Attribute 2

RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Repeated Incremental Pruning to Produce Error Reduction

RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Grow a rule: (Attribute 1 = a 1 ) Selection based on information gain

RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Grow a rule: (Attribute 1 = a 1 ) and (Attribute 2 <= a 2 )

RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Grow a rule: (Attribute 1 = a 1 ) and (Attribute 2 <= a 2 )...  Class = c Until all instances matched by the rule belong to class c (the rule never misclassifies)

RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Grow a rule: (Attribute 1 = a 1 ) and (Attribute 2 <= a 2 )...  Class = c The rule is perfect on the growing set. But does it overfit the growing set? We test it on the pruning set.

RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Grow a rule: (Attribute 1 = a 1 ) and (Attribute 2 <= a 2 )...  Class = c Prune a rule: (Attribute 1 = a 1 ) and (Attribute 2 <= a 2 )...  Class = c

RIPPER rules Split data: – growing set (2/3) to grow new rules – pruning set (1/3) to prune them Grow a rule: (Attribute 1 = a 1 ) and (Attribute 2 <= a 2 )...  Class = c Prune a rule: (Attribute 1 = a 1 )...  Class = c As long as removal improves performance on the pruning set

RIPPER rules Grow and prune a rule Delete instances covered by the rule Repeat the process until the last rule increases length of the rules + misclassified instances by more than a constant

Related work EL-Manzalawy et al. (2008) Data: 934 epitopes, 934 random peptides Compared several machine learning methods Best performance by support vector machine + string kernel SVM Their data, string kernel69.59 % Our data, string kernel78.11 % First baseline

Immune response prediction

First attempt: AA counts ACDEFGHIKLMNPQRSTVWYClass Alanine count Positive or negative immune response Cysteine count...

First attempt: AA counts ACDEFGHIKLMNPQRSTVWYClass Alanine count Positive or negative immune response Example peptide: QGDYCRPTVQEERKL, response 35 (negative) ACDEFGHIKLMNPQRSTVWYClass 01120100110012201101neg Cysteine count...

First attempt: AA counts SVMRules String kernel78.11 % AA counts79.44 %73.93 % Second baseline Simple attributes are more accurate than string kernel SVM is more accurate than rules But rules can be understood by a human

First attempt: AA counts (Y = 0) and (F = 0) and (E >= 1)  Class = neg (Y = 0) and (W = 0) and (R = 0) and (F <= 1)  Class = neg (Y = 0 or <= 1) and...  Class = neg... and (F = 0 or <= 1) and...  Class = neg... and (W = 0 or <= 1) and...  Class = neg... otherwise Class = pos Tyrosine Tryptophan Phenylalanine

First attempt: AA counts (Y = 0) and (F = 0) and (E >= 1)  Class = neg (Y = 0) and (W = 0) and (R = 0) and (F <= 1)  Class = neg (Y = 0 or <= 1) and...  Class = neg... and (F = 0 or <= 1) and...  Class = neg... and (W = 0 or <= 1) and...  Class = neg... otherwise Class = pos No tyrosine in peptide  Class = neg, otherwise Class = pos (66.53 %) No tryptophan in peptide  Class = neg, otherwise Class = pos (59.27 %) No phenylalanine in peptide  Class = neg, otherwise Class = pos (63.21 %) TyrosinePhenylalanine Tryptophan

AA counts in sections of peptide AA counts ignore position in peptide

AA counts in sections of peptide XYClass XXYY 22pos XXYY 22pos XXYY 22pos YYXX 22neg YYXX 22neg YYXX 22neg Peptides AA counts ignore position in peptide

AA counts in sections of peptide XYClassX leftY leftX rightY right XXYY 22pos2002 XXYY 22pos2002 XXYY 22pos2002 YYXX 22neg0220 YYXX 22neg0220 YYXX 22neg0220 Peptides AA counts ignore position in peptide

AA counts in sections of peptide SVM accuracy increases a bit Rules accuracy decreases SVM better at coping with many attributes? SVMRules AA counts79.44 %73.93 % AA counts, 2 sections79.98 %73.84 % AA counts, 3 sections80.31 %70.87 % AA counts, 4 sections79.99 %69.05 % AA counts, 5 sections79.98 %71.49 %

AA count differences Machine learning cannot infer all relations automatically, it needs help

AA count differences XYZClass 125pos 233 341 215neg 323 431 Machine learning cannot infer all relations automatically, it needs help

AA count differences XYZClassX – Y... 125pos–1–1 233 –1 341pos–1 215neg1 323 1 431 1 Machine learning cannot infer all relations automatically, it needs help

AA count differences SVMRules AA counts79.44 %73.93 % AA count differences78.48 %72.91 % AA counts + AA count differences79.39 %75.29 % SVM accuracy decreases a bit Rules accuracy increases The changes are small

AA count differences (E – Y <= –1) and (N – R <= –1)  Class = pos (E – Y <= 0) and (D – Y <= – 1) and (Q – Y <= – 2)  Class = pos (E – something <= –1 or 0) and...  Class = pos (something – Y <= –2, –1 or 0) and...  Class = pos... otherwise Class = neg Tyrosine Glutamic acid

AA count differences (E – Y <= –1) and (N – R <= –1)  Class = pos (E – Y <= 0) and (D – Y <= – 1) and (Q – Y <= – 2)  Class = pos (E – something <= –1 or 0) and...  Class = pos (something – Y <= –2, –1 or 0) and...  Class = pos... otherwise Class = neg No tyrosine in peptide  Class = neg, otherwise Class = pos (66.53 %) No glutamic acid in peptide  Class = pos, otherwise Class = neg (59.77 %) Tyrosine Glutamic acid

Substring counts Perhaps single AAs are not informative enough and we should count longer substrings

Substring counts Perhaps single AAs are not informative enough and we should count longer substrings XYClass XXXYY 32pos YXXXY 32pos YYXXX 32pos XYXYX 32neg XYYXX 32neg XXYYX 32neg Peptides

Substring counts Perhaps single AAs are not informative enough and we should count longer substrings XYClassXXX... XXXYY 32pos1 YXXXY 32pos1 YYXXX 32pos1 XYXYX 32neg0 XYYXX 32neg0 XXYYX 32neg0 Peptides

Substring counts SVMRules AA counts79.44 %73.93 % Counts of substrings of lengths up to 278.92 %74.00 % Counts of substrings of lengths up to 379.01 %73.91 % The changes are small Only one rule with substrings of length above 1 (Y = 0) and (F = 0) and (W = 0) and (M = 0) and (I = 0) and (LL = 0)  Class = neg

Substrings with gaps Machine learning needs recurring patterns Small counts for substrings of length above 1 – little recurrence Increase substring counts by allowing gaps between AAs

Substrings with gaps Machine learning needs recurring patterns Small counts for substrings of length above 1 – little recurrence Increase substring counts by allowing gaps between AAs XYXABCXXY ABC × 1 YYAXBCYYX ABC × 0.5 XYABXXCYX ABC × 0.5 2

Substrings with gaps SVMRules AA counts79.44 %73.93 % Lengths up to 3, no gaps79.01 %73.91 % Lengths up to 3, gap lengths up to 179.11 %74.37 % Lengths up to 3, gap lengths up to 278.83 %74.62 % Lengths up to 3, gap lengths up to 379.10 %74.71 % Lengths up to 3, gap lengths up to 478.91 %75.54 % SVM accuracy decreases a bit Rules accuracy increases The changes are small

Substrings with gaps Still no rules with substrings of length 3 More rules with substrings of length 2

Substrings with gaps Still no rules with substrings of length 3 More rules with substrings of length 2 (Y = 0) and (F = 0) and (E >= 1) and (W = 0) and (RL = 0)  Class = neg... otherwise Class = pos Leucine Positive response when in pair SubstringCountSubstringCount RL/LRRL/LR5ELEL1 LL5SLSL1 LPLP2RR1 SE2PP1 KK2

Classes of AAs Perhaps individual AAs are too specific and we should merge similar AAs into classes

Classes of AAs Perhaps individual AAs are too specific and we should merge similar AAs into classes WYFCIVHLMAGTRSNQDPEK -0.727-0.721-0.719-0.693-0.682-0.669-0.662-0.631-0.626-0.605-0.537-0.525-0.448-0.423-0.381-0.369-0.279-0.271-0.160-0.043 Inflexible (1)Medium (2)Flexible (3) Flexibility index

Classes of AAs Perhaps individual AAs are too specific and we should merge similar AAs into classes WYFCIVHLMAGTRSNQDPEK -0.727-0.721-0.719-0.693-0.682-0.669-0.662-0.631-0.626-0.605-0.537-0.525-0.448-0.423-0.381-0.369-0.279-0.271-0.160-0.043 Inflexible (1)Medium (2)Flexible (3) Example peptide: QGDYCRPTVQEERKL 323113321333332 Flexibility index

Classes of AAs Counts of substrings of lengths up to 3: 1, 2, 3 11, 12, 13, 21, 22, 23, 31, 32, 33 111, 112, 113, 121, 122, 123, 131, 132, 133, 211, 212, 213, 221, 222, 223, 231, 232, 233, 311, 312, 313, 321, 322, 323, 331, 332, 333 Gaps of length 1

Classes of AAs SVMRules AA counts79.44 %73.93 % AAs, lengths up to 3, gaps up to 179.11 %74.37 % Aromatic/aliphatic72.51 %72.49 % Basic/acidic61.15 %61.06 % Flexibility69.24 %67.46 % Hydrophobicity58.92 %56.85 % Polarity64.94 %63.50 % Size61.89 %60.58 % Turns index58.92 %55.57 %

Aromatic/aliphatic AAs AromaticAliphaticOther Phenylalanine (F)Isoleucine (I)all the rest Histidine (H)Leucine (L) Tryptophan (W)Valine (V) Tyrosine (Y)

Aromatic/aliphatic AAs AromaticAliphaticOther Phenylalanine (F)Isoleucine (I)all the rest Histidine (H)Leucine (L) Tryptophan (W)Valine (V) Tyrosine (Y) No tyrosine in peptide  Class = neg, otherwise Class = pos (66.53 %) No tryptophan in peptide  Class = neg, otherwise Class = pos (59.27 %) No phenylalanine in peptide  Class = neg, otherwise Class = pos (63.21 %)

Aromatic/aliphatic AAs AromaticAliphaticOther Phenylalanine (F)Isoleucine (I)all the rest Histidine (H)Leucine (L) Tryptophan (W)Valine (V) Tyrosine (Y) No tyrosine in peptide  Class = neg, otherwise Class = pos (66.53 %) No tryptophan in peptide  Class = neg, otherwise Class = pos (59.27 %) No phenylalanine in peptide  Class = neg, otherwise Class = pos (63.21 %) 1 or no aromatic AAs in peptide  Class = neg, otherwise Class = pos (71.37 %)

Flexibility of AAs (F1 >= 5)  Class = pos (F1 >= 4) and (F33 = 1.5) and (F231 = 0) and (F213 <= 1)  Class = pos (F1 >= 4) and (F33 <= 3.5)  Class = pos (F1 >= 3) and (F2 >= 5) and (F311 >= 1)  Class = pos (F1 >= 3) and (F11 = 0) and (F322 = 1)  Class = pos otherwise Class = neg

Flexibility of AAs Inflexible AAs indicate pos Flexibility linked to epitope propensity in the literature (F1 >= 5)  Class = pos (F1 >= 4) and (F33 = 1.5) and (F231 = 0) and (F213 <= 1)  Class = pos (F1 >= 4) and (F33 <= 3.5)  Class = pos (F1 >= 3) and (F2 >= 5) and (F311 >= 1)  Class = pos (F1 >= 3) and (F11 = 0) and (F322 = 1)  Class = pos otherwise Class = neg ?

Flexibility of AAs Inflexible AAs indicate pos Flexibility linked to epitope propensity in the literature Y, W, F are inflexible, E is flexible (F1 >= 5)  Class = pos (F1 >= 4) and (F33 = 1.5) and (F231 = 0) and (F213 <= 1)  Class = pos (F1 >= 4) and (F33 <= 3.5)  Class = pos (F1 >= 3) and (F2 >= 5) and (F311 >= 1)  Class = pos (F1 >= 3) and (F11 = 0) and (F322 = 1)  Class = pos otherwise Class = neg ?

Polarity of AAs (N >= 8) and (P + >= 2) and (P – <= 1)  Class = pos (N >= 10) and (NP + >= 0.5) and (NNP 0 >= 1.5)  Class = pos (...N... >= something) and...  Class = pos (... P –... <= something) and...  Class = pos... otherwise Class = neg

Polarity of AAs Non-polar AAs indicate pos Should not polarity be conductive to antibody binding? (N >= 8) and (P + >= 2) and (P – <= 1)  Class = pos (N >= 10) and (NP + >= 0.5) and (NNP 0 >= 1.5)  Class = pos (...N... >= something) and...  Class = pos (... P –... <= something) and...  Class = pos... otherwise Class = neg ?

Polarity of AAs Non-polar AAs indicate pos Should not polarity be conductive to antibody binding? Y, W, F are non-polar, E is polar negative (N >= 8) and (P + >= 2) and (P – <= 1)  Class = pos (N >= 10) and (NP + >= 0.5) and (NNP 0 >= 1.5)  Class = pos (...N... >= something) and...  Class = pos (... P –... <= something) and...  Class = pos... otherwise Class = neg ?

Classes of AAs SVMRules AA counts79.44 %73.93 % All AA class counts79.77 %75.07 % All AA class counts and AA counts79.89 %76.12 % SVM accuracy increases a bit Highest rules accuracy

Classes of AAs SVMRules AA counts79.44 %73.93 % All AA class counts79.77 %75.07 % All AA class counts and AA counts79.89 %76.12 % SVM accuracy increases a bit Highest rules accuracy (aromatic AAs >= 3) and (non-polar AAs >= 8) and (medium turn presence AAs >= 8)  Class = pos (aromatic AAs >= 3) and (non-polar AAs >= 8) and (Y >= 2)  Class = pos Precision 88.64 % Precision 91.71 %

AA pair counts Antibody Peptide

AA pair counts Antibody Peptide AA1 AA2 Distance d Attributes are counts of AA1 and AA2 at distance d for all AA1, AA2 and d

AA pair counts Too many attributes: 20 AAs × 20 AAs × 14 possible distances = 5,600 attributes

AA pair counts Too many attributes: 20 AAs × 20 AAs × 14 possible distances = 5,600 attributes Ways to reduce them: – Classes of AAs instead of individual AAs – Increment distances in steps > 1 d = 1 d = 2 d = 3 d = 4 d = 5 Step 3

AA pair counts SVMRules AA counts79.44 %73.93 % AA × AA, distance step 475.28 %71.02 % AA × aromatic/aliphatic, step 378.25 %72.80 % Aromatic/aliphatic × aromatic/aliphatic, step 2 72.56 %70.90 % Accuracy decreases Rules not particularly illuminating

AA pair counts with a fixed side Antibody Easily accesible side of peptide Peptide array surface

AA pair counts with a fixed side Antibody Easily accesible side of peptide AA at fixed position AA at d Distance d Attributes are AA at fixed position and counts of AA at distance d for all AA and d Peptide array surface

AA pair counts with a fixed side Fewer attributes: 20 AAs at fixed position + 20 AAs × 14 possible distances = 300 attributes

AA pair counts with a fixed side SVM accuracy increases a bit Rules accuracy decreases due to bad attributes SVMRules AA counts79.44 %73.93 % Pair AA × AA, distance step 475.28 %71.02 % Fixed pair, step 280.07 %69.35 % Fixed pair, step 178.82 %67.60 %

AA pair counts with a fixed side SVM accuracy increases a bit Rules accuracy decreases due to bad attributes SVMRules AA counts79.44 %73.93 % Pair AA × AA, distance step 475.28 %71.02 % Fixed pair, step 280.07 %69.35 % Fixed pair, step 178.82 %67.60 % (Y at d 2 >= 1)  Class = pos (Y at d 6 >= 1)  Class = pos (Y at d 7 >= 1)  Class = pos (Y at d 9 >= 1)  Class = pos (Y at d 11 >= 1)  Class = pos (Y at d 14 >= 1)  Class = pos... otherwise Class = neg

AA properties (in sections of peptide) AA properties on which AA classes are based can be used directly – averaged over peptide

AA properties (in sections of peptide) AA properties on which AA classes are based can be used directly – averaged over peptide SVMRules AA counts79.44 %73.93 % AA properties76.45 %72.63 % AA properties, 2 sections76.97 %73.01 % AA properties, 3 sections77.25 %71.13 % AA properties and AA counts79.30 %74.97 % AA properties and AA counts, 2 sections79.89 %74.24 % AA properties and AA counts, 3 sections80.05 %70.80 %

AA properties (in sections of peptide) (aromatic >= 0.3)  Class = pos (polarity <= something) and...  Class = pos (basic >= something) and...  Class = pos... otherwise Class = neg

AA properties (in sections of peptide) (aromatic >= 0.3)  Class = pos (polarity <= something) and...  Class = pos (basic >= something) and...  Class = pos... otherwise Class = neg Aromatic AAs indicate pos Non-polar AAs indicate pos (Y, W, F non-polar) Presence of basic / absence of acidic AAs indicates pos H, K, R E, D

Limitations and future work

Why are we stuck at 80 % Training data A single antibody or a group of similar antibodies A single antibody or a group of similar antibodies Single different antibodies

Why are we stuck at 80 % Training data A single antibody or a group of similar antibodies A single antibody or a group of similar antibodies Single different antibodies Similar peptides Each peptide different

Why are we stuck at 80 % Training data A single antibody or a group of similar antibodies A single antibody or a group of similar antibodies Single different antibodies Similar peptides Similarity recognized by machine learning: our 80 % Each peptide different Ignored by machine learning

Why are we stuck at 80 % Training data Test data A single antibody or a group of similar antibodies A single antibody or a group of similar antibodies Single different antibodies Similar peptides Similarity recognized by machine learning: our 80 % Each peptide different Ignored by machine learning Recognized by machine learning: our 80 % again Not recognized

Where do we go from here Tyrosine followed by another aromatic AA followed by tryptophan followed by polar AA

Where do we go from here Tyrosine followed by another aromatic AA followed by tryptophan followed by polar AA Aggregating rules with different attributes Kernels to use many attributes simultaneously Peptide similarity

Aggregating rules Check if different attribute sets cover different instances If so, pick the best rules for each attribute set Use only the best rules for classification

Aggregating rules Check if different attribute sets cover different instances If so, pick the best rules for each attribute set Use only the best rules for classification TrainingTest Rule 1 Rule 2 Rule 3...

Kernels Use many attributes without computing them explicitely Only works with some methods like SVM

Kernels Use many attributes without computing them explicitely Only works with some methods like SVM Instance 1: (a 11, a 12,..., a 1n ) Instance 2: (a 21, a 22,..., a 2n )

Kernels Use many attributes without computing them explicitely Only works with some methods like SVM Instance 1: (a 11, a 12,..., a 1n ) Instance 2: (a 21, a 22,..., a 2n ) (a 11, a 12,..., a 1n ) · (a 21, a 22,..., a 2n ) = a 11 a 21 + a 12 a 22 +... + a 1n a 2n Only need dot product Only need to compute attributes that are non-zero in both attribute vectors

Peptide similarity Smart similarity: – Find best alignment of two peptides – Then compute similarity

Peptide similarity Smart similarity: – Find best alignment of two peptides – Then compute similarity Nerest-neighbor classification

Peptide similarity Smart similarity: – Find best alignment of two peptides – Then compute similarity Nerest-neighbor classification Clustering: – Find groups of similar peptides – Find groups of peptides that are similar in the same way

Questions? Suggestions? Peptide arrays – SVM – RIPPER – EL-Manzalawy AA counts – AA counts in sections AA count differences – Substring counts (with gaps) AA classes – AA pairs (with a fixed side) – AA properties Stuck at 80 % Aggregating rules – Kernels – Peptide similarity

In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research.

Similar presentations

Presentation on theme: "In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research.

Similar presentations

Presentation on theme: "In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research."— Presentation transcript:

Similar presentations

About project

Feedback