Immunological Bioinformatics Introduction to the immune system
Vaccination Administration of a substance to a person with the purpose of preventing a disease Traditionally composed of a killed or weakened micro organism Vaccination works by creating a type of immune response that enables the memory cells to later respond to a similar organism before it can cause disease
Figure 1-20
Effectiveness of vaccines 1958 start of small pox eradication program
The Immune System The innate immune system The adaptive immune system
The innate immune system Unspecific Antigen independent Immediate response No training/selection hence no memory Pathogen independent (but response might be pathogen type dependent)
The adaptive immune system Pathogen specific –Humoral –Cellular Bacteria Virus Parasite
Adaptive immune response Signal induced –Pathogens Antigens –Epitopes B Cell T Cell
Diversity is a hallmark of the (adaptive) immune system Diversity of lymphocytes –Huge diversity within a host –At least 10 8 different T & B cell clones Receptors made by recombination & N- additions, and Somatic mutation during immune response Repertoires are (partly) random –Randomness requires self tolerance
Figure 1-14
The role of lymphocytes
Cartoon by Eric Reits Humoral immunity
Antibody - Antigen interaction Fab Antigen Epitope Paratope Antibody The antibody recognizes structural properties of the surface of the antigen
Antibody Effect Virus or ToxinNeutralizing Antibodies
Cellular immune response Cartoon by Eric Reits
MHC-I molecules present peptides on the surface of most cells
CTL response Healthy cell Virus- infected cell MHC-I
CTL response Virus- infected cell MHC-I
The death of an infected cell
Polymorphism of MHC Within a host limited number of loci (genes) –only 6 different class I molecules (two A, B and C) –only 12 different class II molecules Within a population > 100 alleles per locus
More MHC molecules: more diversity in the presented peptides 1% probability that MHC molecule presents a peptide Different hosts sample different peptides from same pathogen.
Immunological benefits of MHC polymorphism Heterozygote advantage –Heterozygotes have a selective advantage because they can present more peptides (Hughes.n88). Coevolution –Pathogens avoid presentation on common MHC alleles (HIV) –Frequency dependent selection
Figure 5-13
Heterozygote disadvantage! (for vaccine design) Few human beings will share the same set of HLA alleles –Different persons will react to a pathogen infection in a non-similar manner A CTL based vaccine must include epitopes specific for each HLA allele in a population –A CTL based vaccine must consist of ~800 HLA class I epitopes and ~400 class II epitopes
HLA specificity clustering A0201 A0101 A6802 B0702
HLA polymorphism - supertypes Each HLA molecule within a supertype binds essentially the same peptides Nine major HLA class I supertypes have been defined HLA-A1, A2, A3, A24,B7, B27, B44, B58, and B62 And maybe add three more HLA-A26, HLA-B8, and HLA-B39 => A CTL based vaccine must consist of 9-12 HLA class I epitopes Sette et al, Immunogenetics (1999) 50:
Summary The adaptive immune system is extremely diverse –A immune responds can by raised against any thing foreign! Antibodies defines the humoral response –Antibodies recognize structural properties on the surface of extra cellular antigens T cells defines the cellular response –CTL’s kill cell that present MHC molecules bound with intra cellular derived foreign peptides
Anchor positions MHC class I with peptide
What makes a peptide a potential and effective epitope? Part of a pathogen protein Successful processing –Proteasome cleavage –TAP binding Binds to MHC molecule Protein function and expression –Early in replication –Highly expressed proteins are more likely to generate immunogens Sequence conservation in evolution
Prediction of HLA binding specificity Historical overview Simple Motifs –Allowed/non allowed amino acids Extended motifs –Amino acid preferences (SYFPEITHI)SYFPEITHI) –Anchor/Preferred/other amino acids Hidden Markov models –Peptide statistics from sequence alignment Neural networks –Can take sequence correlations into account
SYFPEITHI predictions Extended motifs based on peptides from the literature and peptides eluted from cells expressing specific HLAs ( i.e., binding peptides) Scoring scheme is not readily accessible. Positions defined as anchor or auxiliary anchor positions are weighted differently (higher) The final score is the sum of the scores at each position Predictions can be made for several HLA-A, -B and - DRB1 alleles, as well as some mice K, D and L alleles.
BIMAS Matrix made from peptides with a measured T 1/2 for the MHC-peptide complex The matrices are available on the website The final score is the product of the scores of each position in the matrix multiplied with a constant, different for each MHC, to give a prediction of the T 1/2 Predictions can be obtained for several HLA-A, -B and - C alleles, mice K,D and L alleles, and a single cattle MHC.
SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAV LLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTL HLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTI ILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSL LERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGV PLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQM KLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSV KTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKV SLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYV ILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLV TGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAA GAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLA KARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIV AVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVV GLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLV VLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQC ISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGA YTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYI NMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTV VVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQ GLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYL EAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAV YLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRL FLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKL AAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYI AAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV Sequence information
Sequence Information Calculate p a at each position Entropy Information content Conserved positions –P V =1, P !v =0 => S=0, I=log(20) Mutable positions –P aa =1/20 => S=log(20), I=0 Say that a peptide must have L at P 2 in order to bind, and that A,F,W,and Y are found at P 1. Which position has most information? How many questions do I need to ask to tell if a peptide binds looking at only P 1 or P 2 ? P1: 4 questions (at most) P2: 1 question (L or not) P2 has the most information
Information content A R N D C Q E G H I L K M F P S T W Y V S I
Sequence logos Height of a column equal to I Relative height of a letter is p Highly useful tool to visualize sequence motifs High information positions HLA-A0201
Characterizing a binding motif from small data sets What can we learn? 1.A at P1 favors binding? 2.I is not allowed at P9? 3.K at P4 favors binding? 4.Which positions are important for binding? lALAKAAAAM lALAKAAAAN lALAKAAAAR lALAKAAAAT lALAKAAAAV lGMNERPILT lGILGFVFTM lTLNAWVKVV lKLNEPVLLL lAVVPFIVSV 10 MHC restricted peptides
Simple motifs Yes/No rules lALAKAAAAM lALAKAAAAN lALAKAAAAR lALAKAAAAT lALAKAAAAV lGMNERPILT lGILGFVFTM lTLNAWVKVV lKLNEPVLLL lAVVPFIVSV 10 MHC restricted peptides Only 11 of 212 peptides identified! Need more flexible rules If not fit P1 but fit P2 then ok Not all positions are equally important We know that P2 and P9 determines binding more than other positions Cannot discriminate between good and very good binders
Simple motifs Yes/No rules Example Two first peptides will not fit the motif. They are all good binders (aff< 500nM) RLLDDTPEV 84 nM GLLGNVSTV 23 nM ALAKAAAAL 309 nM lALAKAAAAM lALAKAAAAN lALAKAAAAR lALAKAAAAT lALAKAAAAV lGMNERPILT lGILGFVFTM lTLNAWVKVV lKLNEPVLLL lAVVPFIVSV 10 MHC restricted peptides
Extended motifs Fitness of aa at each position given by P(aa) Example P1 P A = 6/10 P G = 2/10 P T = P K = 1/10 P C = P D = …P V = 0 Problems –Few data –Data redundancy/duplication lALAKAAAAM lALAKAAAAN lALAKAAAAR lALAKAAAAT lALAKAAAAV lGMNERPILT lGILGFVFTM lTLNAWVKVV lKLNEPVLLL lAVVPFIVSV RLLDDTPEV 84 nM GLLGNVSTV 23 nM ALAKAAAAL 309 nM
Sequence information Raw sequence counting lALAKAAAAM lALAKAAAAN lALAKAAAAR lALAKAAAAT lALAKAAAAV lGMNERPILT lGILGFVFTM lTLNAWVKVV lKLNEPVLLL lAVVPFIVSV
Sequence weighting lALAKAAAA M lALAKAAAAN lALAKAAAAR lALAKAAAAT lALAKAAAAV lGMNERPILT lGILGFVFTM lTLNAWVKV V lKLNEPVLLL lAVVPFIVSV Poor or biased sampling of sequence space Example P1 P A = 2/6 P G = 2/6 P T = P K = 1/6 P C = P D = …P V = 0 } Similar sequences Weight 1/5 RLLDDTPEV 84 nM GLLGNVSTV 23 nM ALAKAAAAL 309 nM
Sequence weighting lALAKAAAAM lALAKAAAAN lALAKAAAAR lALAKAAAAT lALAKAAAAV lGMNERPILT lGILGFVFTM lTLNAWVKVV lKLNEPVLLL lAVVPFIVSV
Pseudo counts lALAKAAAA M lALAKAAAA N lALAKAAAA R lALAKAAAA T lALAKAAAA V lGMNERPIL T lGILGFVFT M lTLNAWVKV V lKLNEPVLL L lAVVPFIVSV I is not found at position P9. Does this mean that I is forbidden (P(I)=0)? No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9
A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V The Blosum matrix Some amino acids are highly conserved (i.e. C), some have a high change of mutation (i.e. I)
A R N D C Q E G H I L K M F P S T W Y V A R N D C …. Y V What is a pseudo count? Say I observe V at P1 Knowing that V at P1 binds, what is the probability that a peptide could have I at P1? P(I|V) = 0.16
Calculate observed amino acids frequencies f a Pseudo frequency for amino acid b Example lALAKAAAA M lALAKAAAAN lALAKAAAAR lALAKAAAAT lALAKAAAAV lGMNERPILT lGILGFVFTM lTLNAWVKV V lKLNEPVLLL lAVVPFIVSV Pseudo count estimation
lALAKAAAAM lALAKAAAAN lALAKAAAAR lALAKAAAAT lALAKAAAAV lGMNERPILT lGILGFVFTM lTLNAWVKVV lKLNEPVLLL lAVVPFIVSV Weight on pseudo count Pseudo counts are important when only limited data is available With large data sets only “true” observation should count is the effective number of sequences (N-1), is the weight on prior
Example If large, p ≈ f and only the observed data defines the motif If small, p ≈ g and the pseudo counts (or prior) defines the motif is [50-200] normally lALAKAAAAM lALAKAAAAN lALAKAAAAR lALAKAAAAT lALAKAAAAV lGMNERPILT lGILGFVFTM lTLNAWVKVV lKLNEPVLLL lAVVPFIVSV Weight on pseudo count
Sequence weighting and pseudo counts RLLDDTPEV 84nM GLLGNVSTV 23nM ALAKAAAAL 309nM P 7P and P 7S > 0 lALAKAAAA M lALAKAAAAN lALAKAAAAR lALAKAAAAT lALAKAAAAV lGMNERPILT lGILGFVFTM lTLNAWVKV V lKLNEPVLLL lAVVPFIVSV
Position specific weighting We know that positions 2 and 9 are anchor positions for most MHC binding motifs –Increase weight on high information positions Motif found on large data set
Weight matrices Estimate amino acid frequencies from alignment including sequence weighting and pseudo count What do the numbers mean? –P2(V)>P2(M). Does this mean that V enables binding more than M. –In nature not all amino acids are found equally often q M = 0.025, q V = Finding 7% V is hence not significant, but 2% M highly significant In nature V is found more often than M, so we must somehow rescale with the background A R N D C Q E G H I L K M F P S T W Y V
Weight matrices A weight matrix is given as W ij = log(p ij /q j ) –where i is a position in the motif, and j an amino acid. q j is the background frequency for amino acid j. W is a L x 20 matrix, L is motif length A R N D C Q E G H I L K M F P S T W Y V
Score sequences to weight matrix by looking up and adding L values from the matrix A R N D C Q E G H I L K M F P S T W Y V Scoring a sequence to a weight matrix RLLDDTPEV GLLGNVSTV ALAKAAAAL Which peptide is most likely to bind? Which peptide second? nM 23nM 309nM