Biological Motif Discovery Concepts Motif Modeling and Motif Information EM and Gibbs Sampling Comparative Motif Prediction Applications Transcription.

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Periodic clusters. Non periodic clusters That was only the beginning…
Hidden Markov Model in Biological Sequence Analysis – Part 2
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
CS262 Lecture 17, Win07, Batzoglou Gene Regulation and Microarrays.
Transcription factor binding motifs (part I) 10/17/07.
DNA Regulatory Binding Motif Search Dong Xu Computer Science Department 109 Engineering Building West
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
CS262 Lecture 18, Win07, Batzoglou Sequence Logos Height of each letter proportional to its frequency Height of all letters proportional to information.
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
In silico cis-analysis promoter analysis - Promoters and cis-elements - Searching for patterns - Searching redundant patterns.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Sequence analysis – an overview A.Krishnamachari
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
High-resolution computational models of genome binding events Yuan (Alan) Qi Joint work with Gifford and Young labs Dana-Farber Cancer Institute Jan 2007.
Sampling Approaches to Pattern Extraction
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Gibbs Sampler in Local Multiple Alignment Review by 온 정 헌.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Motif discovery and Protein Databases Tutorial 5.
From Genomes to Genes Rui Alves.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Cis-regulatory Modules and Module Discovery
Bayesian Machine learning and its application Alan Qi Feb. 23, 2009.
Biological Motif Discovery Concepts Motif Modeling and Motif Information EM and Gibbs Sampling Comparative Motif Prediction Applications Transcription.
1 Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas.
Local Multiple Sequence Alignment Sequence Motifs
CS 6243 Machine Learning Advanced topic: pattern recognition (DNA motif finding)
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Transcription factor binding motifs (part II) 10/22/07.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Hidden Markov Models BMI/CS 576
Regulation of Gene Expression
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Learning Sequence Motif Models Using Expectation Maximization (EM)
Algorithms for Regulatory Motif Discovery
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
(Regulatory-) Motif Finding
Finding regulatory modules
Motif finding in groups of related sequences
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Presentation transcript:

Biological Motif Discovery Concepts Motif Modeling and Motif Information EM and Gibbs Sampling Comparative Motif Prediction Applications Transcription Factor Binding Site Prediction Epitope Prediction Lab Practical DNA Motif Discovery with MEME and AlignAce Co-regulated genes from TB Boshoff data set

Regulatory Motifs Find promoter motifs associated with co-regulated or functionally related genes

Transcription Factor Binding Sites Very Small Highly Variable ~Constant Size Often repeated Low-complexity-ish Slide Credit: S. Batzoglou

Other “Motifs” Splicing Signals –Splice junctions –Exonic Splicing Enhancers (ESE) –Exonic Splicing Surpressors (ESS) Protein Domains –Glycosylation sites –Kinase targets –Targetting signals Protein Epitopes –MHC binding specificities

Modeling Motifs –How to computationally represent motifs Visualizing Motifs –Motif “Information” Predicting Motif Instances –Using the model to classify new sequences Learning Motif Structure –Finding new motifs, assessing their quality Essential Tasks

Modeling Motifs

Consensus Sequences Useful for publication IUPAC symbols for degenerate sites Not very amenable to computation Nature Biotechnology 24, (2006)

Probabilistic Model ACGTACGT M1M1 MKMK M1M1 P k (S|M) Position Frequency Matrix (PFM) 1 K Count frequencies Add pseudocounts

Scoring A Sequence To score a sequence, we compare to a null model Background DNA (B) ACGTACGT Log likelihood ratio ACGTACGT Position Weight Matrix (PWM) PFM

Scoring a Sequence MacIsaac & Fraenkel (2006) PLoS Comp Bio Common threshold = 60% of maximum score

Visualizing Motifs – Motif Logos Represent both base frequency and conservation at each position Height of letter proportional to frequency of base at that position Height of stack proportional to conservation at that position

Motif Information The height of a stack is often called the motif information at that position measured in bits Information Why is this a measure of information?

Uncertainty and probability “The sun will rise tomorrow” “The sun will not rise tomorrow” Uncertainty is inversely related to probability of event Not surprising (p~1) Very surprising (p<<1) Uncertainty is related to our surprise at an event

Average Uncertainty A “The sun will rise tomorrow” B “The sun will not rise tomorrow” P(A)=p 1 P(B)=p 2 Two possible outcomes for sun rising = Entropy What is our average uncertainty about the sun rising

Entropy Entropy measures average uncertainty Entropy measures randomness If log is base 2, then the units are called bits

Entropy versus randomness Entropy is maximum at maximum randomness P(heads) Example: Coin Toss Entropy P(heads)=0.1 Not very random H(X)=0.47 bits P(heads)=0.5 Completely random H(X)=1 bits

Entropy Examples P(x) P(x) ATGC ATGC

Information Content Information is a decrease in uncertainty Once I tell you the sun will rise, your uncertainty about the event decreases H before (X)H after (X)-Information = Information is difference in entropy after receiving information

Motif Information 2- Motif Position Information = H background (X)H motif_i (X) Prior uncertainty about nucleotide Uncertainty after learning it is position i in a motif H(X)=2 bits ATGC P(x) H(X)=0.63 bits ATGC P(x) Uncertainty at this position has been reduced by 0.37 bits

Motif Logo Conserved Residue Reduction of uncertainty of 2 bits Little Conservation Minimal reduction of uncertainty

Background DNA Frequency 2- Motif Position Information = H motif_i (X) H(X)=1.9 bits ATGC P(x) Some motifs could have negative information! = -0.2 bits The definition of information assumes a uniform background DNA nucleotide frequency What if the background frequency is not uniform? H background (X) H(X)=1.7 bits ATGC P(x) (e.g. Plasmodium) 1.7

A Different Measure Relative entropy or Kullback-Leibler (KL) divergence Divergence between a “true” distribution and another “True” DistributionOther Distribution D KL is larger the more different P motif is from P background Same as Information if P background is uniform Properties

Comparing Both Methods Information assuming uniform background DNA KL Distance assuming 20% GC content (e.g. Plasmodium)

Online Logo Generation

Finding New Motifs Learning Motif Models

A Promoter Model Length K Motif The same motif model in all promoters ACGTACGT M1M1 MKMK M1M1 Background DNA P(S|B) P k (S|M)

Probability of a Sequence Given a sequence(s), motif model and motif location S i = nucleotide at position i in the sequence A T A T G C ACGTACGT M1M1 MKMK M1M1

Parameterizing the Motif Model Given multiple sequences and motif locations but no motif model ACGTACGT M1M1 M6M6 M1M1 3/4 ETC… AATGCG ATATGG ATATCG GATGCA Count Frequencies Add pseudocounts

Finding Known Motifs Given multiple sequences and motif model but no motif locations x x x x P(Seq window |Motif) window Calculate P(Seq window |Motif) for every starting location Choose best starting location in each sequence

Discovering Motifs Given a set of co-regulated genes, we need to discover with only sequences We have neither a motif model nor motif locations Need to discover both How can we approach this problem? (Hint: start with a random motif model)

Expectation Maximization (EM) Remember the basic idea! 1.Use model to estimate (distribution of) missing data 2.Use estimate to update model 3.Repeat until convergence Model is the motif model Missing data are the motif locations

EM for Motif Discovery 1. Start with random motif model 2. E Step: estimate probability of motif positions for each sequence 3. M Step: use estimate to update motif model 4. Iterate (to convergence) ACGTACGT ACGTACGT ETC… At each iteration, P(Sequences|Model) guaranteed to increase

MEME MEME - implements EM for motif discovery in DNA and proteins MAST – search sequences for motifs given a model

P(Seq|Model) Landscape P(Sequences|params1,params2) EM searches for parameters to increase P(seqs|parameters) Useful to think of P(seqs|parameters) as a function of parameters Parameter1 Parameter2 EM starts at an initial set of parameters And then “climbs uphill” until it reaches a local maximum Where EM starts can make a big difference

Search from Many Different Starts P(Sequences|params1,params2) To minimize the effects of local maxima, you should search multiple times from different starting points Parameter1 Parameter2 MEME uses this idea Start at many points Run for one iteration Choose starting point that got the “highest” and continue

Gibbs Sampling A stochastic version of EM that differs from deterministic EM in two key ways 1. At each iteration, we only update the motif position of a single sequence 2. We may update a motif position to a “suboptimal” new position

1. Start with random motif locations and calculate a motif model 2. Randomly select a sequence, remove its motif and recalculate tempory model 3. With temporary model, calculate probability of motif at each position on sequence 4. Select new position based on this distribution 5. Update model and Iterate Gibbs Sampling ACGTACGT ACGTACGT ETC… “Best” Location New Location

Gibbs Sampling and Climbing P(Sequences|params1,params2) Because gibbs sampling does always choose the best new location it can move to another place not directly uphill Parameter1 Parameter2 In theory, Gibbs Sampling less likely to get stuck a local maxima

AlignACE Implements Gibbs sampling for motif discovery –Several enhancements ScanAce – look for motifs in a sequence given a model CompareAce – calculate “similarity” between two motifs (i.e. for clustering motifs)

Assessing Motif Quality Scan the genome for all instances and associate with nearest genes Category Enrichment – look for association between motif and sets of genes. Score using Hypergeometric distribution –Functional Category –Gene Families –Protein Complexes Group Specificity – how restricted are motif instances to the promoter sequences used to find the motif? Positional Bias – do motif instances cluster at a certain distance from ATG? Orientation Bias – do motifs appear preferentially upstream or downstream of genes?

Comparative Motif Prediction

Kellis et al. (2003) Nature

Scer TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACA Spar CTATGTTGATCTTTTCAGAATTTTT-CACTATATTAAGATGGGTGCAAAGAAGTGTGATTATTATATTACATCGCTTTCCTATCATACACA Smik GTATATTGAATTTTTCAGTTTTTTTTCACTATCTTCAAGGTTATGTAAAAAA-TGTCAAGATAATATTACATTTCGTTACTATCATACACA Sbay TTTTTTTGATTTCTTTAGTTTTCTTTCTTTAACTTCAAAATTATAAAAGAAAGTGTAGTCACATCATGCTATCT-GTCACTATCACATATA * * **** * * * ** ** * * ** ** ** * * * ** ** * * * ** * * * Scer TATCCATATCTAATCTTACTTATATGTTGT-GGAAAT-GTAAAGAGCCCCATTATCTTAGCCTAAAAAAACC--TTCTCTTTGGAACTTTCAGTAATACG Spar TATCCATATCTAGTCTTACTTATATGTTGT-GAGAGT-GTTGATAACCCCAGTATCTTAACCCAAGAAAGCC--TT-TCTATGAAACTTGAACTG-TACG Smik TACCGATGTCTAGTCTTACTTATATGTTAC-GGGAATTGTTGGTAATCCCAGTCTCCCAGATCAAAAAAGGT--CTTTCTATGGAGCTTTG-CTA-TATG Sbay TAGATATTTCTGATCTTTCTTATATATTATAGAGAGATGCCAATAAACGTGCTACCTCGAACAAAAGAAGGGGATTTTCTGTAGGGCTTTCCCTATTTTG ** ** *** **** ******* ** * * * * * * * ** ** * *** * *** * * * Scer CTTAACTGCTCATTGC-----TATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCT Spar CTAAACTGCTCATTGC-----AATATTGAAGTACGGATCAGAAGCCGCCGAGCGGACGACAGCCCTCCGACGGAATATTCCCCTCCGTGCGTCGCCGTCT Smik TTTAGCTGTTCAAG ATATTGAAATACGGATGAGAAGCCGCCGAACGGACGACAATTCCCCGACGGAACATTCTCCTCCGCGCGGCGTCCTCT Sbay TCTTATTGTCCATTACTTCGCAATGTTGAAATACGGATCAGAAGCTGCCGACCGGATGACAGTACTCCGGCGGAAAACTGTCCTCCGTGCGAAGTCGTCT ** ** ** ***** ******* ****** ***** *** **** * *** ***** * * ****** *** * *** Scer TCACCGG-TCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAA-----TACTAGCTTTT--ATGGTTATGAA Spar TCGTCGGGTTGTGTCCCTTAA-CATCGATGTACCTCGCGCCGCCCTGCTCCGAACAATAAGGATTCTACAAGAAA-TACTTGTTTTTTTATGGTTATGAC Smik ACGTTGG-TCGCGTCCCTGAA-CATAGGTACGGCTCGCACCACCGTGGTCCGAACTATAATACTGGCATAAAGAGGTACTAATTTCT--ACGGTGATGCC Sbay GTG-CGGATCACGTCCCTGAT-TACTGAAGCGTCTCGCCCCGCCATACCCCGAACAATGCAAATGCAAGAACAAA-TGCCTGTAGTG--GCAGTTATGGT ** * ** *** * * ***** ** * * ****** ** * * ** * * ** *** Scer GAGGA-AAAATTGGCAGTAA----CCTGGCCCCACAAACCTT-CAAATTAACGAATCAAATTAACAACCATA-GGATGATAATGCGA------TTAG--T Spar AGGAACAAAATAAGCAGCCC----ACTGACCCCATATACCTTTCAAACTATTGAATCAAATTGGCCAGCATA-TGGTAATAGTACAG------TTAG--G Smik CAACGCAAAATAAACAGTCC----CCCGGCCCCACATACCTT-CAAATCGATGCGTAAAACTGGCTAGCATA-GAATTTTGGTAGCAA-AATATTAG--G Sbay GAACGTGAAATGACAATTCCTTGCCCCT-CCCCAATATACTTTGTTCCGTGTACAGCACACTGGATAGAACAATGATGGGGTTGCGGTCAAGCCTACTCG **** * * ***** *** * * * * * * * * ** Scer TTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG--ATGATTTTT-GATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCAC-----TT Spar GTTTT--TCTTATTCCTGAGACAATTCATCCGCAAAAAATAATGGTTTTT-GGTCTATTAGCAAACATATAAATGCAAAAGTTGCATAGCCAC-----TT Smik TTCTCA--CCTTTCTCTGTGATAATTCATCACCGAAATG--ATGGTTTA--GGACTATTAGCAAACATATAAATGCAAAAGTCGCAGAGATCA-----AT Sbay TTTTCCGTTTTACTTCTGTAGTGGCTCAT--GCAGAAAGTAATGGTTTTCTGTTCCTTTTGCAAACATATAAATATGAAAGTAAGATCGCCTCAATTGTA * * * *** * ** * * *** *** * * ** ** * ******** **** * Scer TAACTAATACTTTCAACATTTTCAGT--TTGTATTACTT-CTTATTCAAAT----GTCATAAAAGTATCAACA-AAAAATTGTTAATATACCTCTATACT Spar TAAATAC-ATTTGCTCCTCCAAGATT--TTTAATTTCGT-TTTGTTTTATT----GTCATGGAAATATTAACA-ACAAGTAGTTAATATACATCTATACT Smik TCATTCC-ATTCGAACCTTTGAGACTAATTATATTTAGTACTAGTTTTCTTTGGAGTTATAGAAATACCAAAA-AAAAATAGTCAGTATCTATACATACA Sbay TAGTTTTTCTTTATTCCGTTTGTACTTCTTAGATTTGTTATTTCCGGTTTTACTTTGTCTCCAATTATCAAAACATCAATAACAAGTATTCAACATTTGT * * * * * * ** *** * * * * ** ** ** * * * * * *** * Scer TTAA-CGTCAAGGA---GAAAAAACTATA Spar TTAT-CGTCAAGGAAA-GAACAAACTATA Smik TCGTTCATCAAGAA----AAAAAACTA.. Sbay TTATCCCAAAAAAACAACAACAACATATA * * ** * ** ** ** GAL10 GAL1 TBP GAL4 MIG1 TBP MIG1 Factor footprint Conservation island slide credits: M. Kellis Conservation of Motifs

Spar Smik Sbay Scer Evaluate conservation within: Gal4Controls 4:11:3(2) Intergenic : coding 12:01:1(3) Upstream : downstream A signature for regulatory motifs 22%5%(1) All intergenic regions Genome-wide conservation

1.Enumerate all “mini-motifs” 2.Apply three tests 1.Look for motifs conserved in intergenic regions 2.Look for motifs more conserved intergenically than in genes 3.Look for motifs preferentially conserved upstream or downstream of genes CTACGA N slide credits: M. Kellis Finding Motifs in Yeast Genomes M. Kellis PhD Thesis

Test 1: Intergenic conservation Total count Conserved count CGG-11-CCG slide credits: M. Kellis

Test 2: Intergenic vs. Coding Coding Conservation Intergenic Conservation CGG-11-CCG Higher Conservation in Genes slide credits: M. Kellis

Test 3: Upstream vs. Downstream CGG-11-CCG Most Patterns Downstream Conservation Upstream Conservation slide credits: M. Kellis

Extend Collapse Full Motifs Constructing full motifs 2,000 Mini-motifs 72 Full motifs 6 CTA CGA R R CTGRC CGAA ACCTGCGAACTGRCCGAACTRAY CGAA Y 5 Extend Collapse Merge Test 1Test 2Test 3 slide credits: M. Kellis

RankDiscovered Motif Known TF motif Tissue Enrichment Distance bias 1RCGCAnGCGYNRF-1Yes 2CACGTGMYCYes 3SCGGAAGYELK-1Yes 4ACTAYRnnnCCCRYes 5GATTGGYNF-YYes 6GGGCGGRSP1Yes 7TGAnTCAAP-1Yes 8TMTCGCGAnRYes 9TGAYRTCAATF3Yes 10GCCATnTTGYY1Yes 11MGGAAGTGGABPYes 12CAGGTGE12Yes 13CTTTGTLEF1Yes 14TGACGTCAATF3Yes 15CAGCTGAP-4Yes 16RYTTCCTGC-ETS-2Yes 17AACTTTIRF1(*)Yes 18TCAnnTGAYSREBP- 1 Yes 19GKCGCn(7)TGAY G Yes 174 promoter motifs 70 match known TF motifs 60 show positional bias  75% have evidence Control sequences < 2% match known TF motifs < 3% show positional bias  < 7% false positives slide credits: M. Kellis Results

Antigen Epitope Prediction

Genome to “Immunome” Looking for a needle… –Only a small number of epitopes are typically antigenic …in a very big haystack –Vaccinia virus (258 ORFs): 175,716 potential epitopes (8-, 9-, and 10-mers) –M. tuberculosis (~4K genes): 433,206 potential epitopes –A. nidulans (~9K genes): 1,579,000 potential epitopes Can computational approaches predict all antigenic epitopes from a genome? Pathogen genome sequences provide define all proteins that could illicit an immune response

Antigen Processing & Presentation

Modeling MHC Epitopes Have a set of peptides that have been associate with a particular MHC allele Want to discover motif within the peptide bound by MHC allele Use motif to predict other potential epitopes

Motifs Bound by MHCs MHC 1 –Closed ends of grove –Peptides 8-10 AAs in length –Motif is the peptide MHC 2 –Grove has open ends –Peptides have broad length distribution: AAs –Need to find binding motif within peptides

MHC 2 Motif Discovery Nielsen et al (2004) Bioinf Use Gibbs Sampling! 462 peptides known to bind to MHC II HLA-DR4(B1*0401) 9-30 residues in length Goal: identify a common length 9 binding motif

Vaccinia Epitope Prediction Mutaftsi et al (2006) Nat. Biotech. Predict MHC1 binding peptides Using 4 matrices for H-2 Kb and Db Top 1% predictions experimentally validated 49 validated epitopes accounting for 95% of immune response