Motifs, logos, and Profile HMM’s

Motifs, logos, and Profile HMM’s
Ivan G. Costa Centro de Informática Universidade Federal de Pernambuco Basead em slides de Morten Nielsen e Neils Jones

Overview Visualization of binding motifs
Construction of sequence logos Understand the concepts of weight matrix construction One of the most important methods of bioinformatics Hidden Markov Profiles as they are just weight matrices with gaps

Controle da Regulação Gênica Eucarióticos

Fatores de Transcrição
Motifs indicam a preferência de ligação Não são idênticos

Binding Motif. MHC class I with peptide
Anchor positions

Sequence information SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAV LLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTL HLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTI ILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSL LERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGV PLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQM KLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSV KTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKV SLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYV ILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLV TGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAA GAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLA KARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIV AVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVV GLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLV VLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQC ISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGA YTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYI NMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTV VVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQ GLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYL EAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAV YLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRL FLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKL AAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYI AAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV

Sequence Information Say that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information? How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2? P1: 4 questions (at most) P2: 1 question (L or not) P2 has the most information Calculate pa at each position Entropy Information content Conserved positions PV=1, P!v=0 => S=0, I=log(20) Mutable positions Paa=1/20 => S=log(20), I=0

Information content We look at the frequencies of aminoacids at each position A R N D C Q E G H I L K M F P S T W Y V S I

Sequence logos Height of a column equal to I
Relative height of a letter is p Highly useful tool to visualize sequence motifs HLA-A0201 High information positions

Simple motifs Yes/No rules
10 MHC restricted peptides ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Only 11 of 212 peptides identified! Need more flexible rules If not fit P1 but fit P2 then ok Not all positions are equally important We know that P2 and P9 determines binding more than other positions Cannot discriminate between good and very good binders

Extended motifs Fitness of the amino acid at each position given by P(amino acid) Example P1 PA = 6/10 PG = 2/10 PT = PK = 1/10 PC = PD = …PV = 0 Problems Few data ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV RLLDDTPEV 84 nM GLLGNVSTV 23 nM ALAKAAAAL 309 nM

Pseudo counts ALAKAAAAM
ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV I is not found at position P9. Does this mean that I is forbidden (P(I)=0)? No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9

Weight on pseudo count ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Pseudo counts are important when only limited data is available With large data sets only “true” observation should count  is the effective number of sequences (N-1),  is the weight on prior

Weight matrices Estimate amino acid frequencies from alignment including sequence weighting and pseudo count What do the numbers mean? P2(V)>P2(M). Does this mean that V enables binding more than M. In nature not all amino acids are found equally often In nature V is found more often than M, so we must somehow rescale with the background qM = 0.025, qV = 0.073 Finding 7% V is hence not significant, but 7% M highly significant A R N D C Q E G H I L K M F P S T W Y V

Weight matrices A weight matrix is given as Wij = log(pij/qj)
where i is a position in the motif, and j an amino acid. qj is the background frequency for amino acid j. W is a L x 20 matrix, L is motif length A R N D C Q E G H I L K M F P S T W Y V

Scoring a sequence to a weight matrix
Score sequences to weight matrix by looking up and adding L values from the matrix A R N D C Q E G H I L K M F P S T W Y V Which peptide is most likely to bind? Which peptide second? RLLDDTPEV GLLGNVSTV ALAKAAAAL 11.9 14.7 4.3 84nM 23nM 309nM

HMMs and weight Matrices
Score sequences to weight matrix by looking up and adding L values from the matrix A R N D C Q E G H I L K M F P S T W Y V M1 M2 M3 M9 1 1 1 1 …

HMMs and weight Matrices
Score sequences to weight matrix by looking up and adding L values from the matrix A R N D C Q E G H I L K M F P S T W Y V Emission is a row in the weight matrix M1 M2 M3 M9 1 1 1 1 …

Example from real life 10 peptides from MHCpep database
Bind to the MHC complex Relevant for immune system recognition Estimate sequence motif and weight matrix Evaluate motif “correctness” on 528 peptides ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

Prediction accuracy Measured affinity Prediction score
Pearson correlation 0.45 Measured affinity Prediction score

Predictive performance

Hidden Markov Models Weight matrices do not deal with insertions and deletions In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension HMM is a natural frame work where insertions/deletions are dealt with explicitly

HMM (a simple example) ACA---ATG TCAACTATC ACAC--AGC AGA---ATC
Example from A. Krogh Core region defines the number of states in the HMM (red) Insertion and deletion statistics are derived from the non-core part of the alignment (black) ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC Core of alignment

HMM construction ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC Gap
.4 A C G T .2 .4 .2 .2 .6 .6 A C G T .8 A C G T A C G T .8 A C G T 1 A C G T A C G T 1. 1. .4 1. 1. .8 .2 .8 .2 .2 .2 .2 .8 ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2 = 3.3x10-2

HMM’s and weight matrices
In the case of un-gapped alignments HMM’s become simple weight matrices To achieve high performance, the emission frequencies are estimated using the techniques of Sequence weighting Pseudo counts

Profile HMM’s Alignments based on conventional scoring matrices (BLOSUM62) scores all positions in a sequence in an equal manner Some positions are highly conserved, some are highly variable (more than what is described in the BLOSUM matrix) Profile HMM’s are ideal suited to describe such position specific variations

Finding Distant Members of a Protein Family
A distant cousin of functionally related sequences in a protein family may have weak pairwise similarities with each member of the family and thus fail significance test. However, they may have weak similarities with many members of the family. The goal is to align a sequence to all members of the family at once. Family of related proteins can be represented by their multiple alignment and the corresponding profile.

What are Profile HMMs ? A Profile HMM is a probabilistic representation of a multiple alignment. A given multiple alignment (of a protein family) is used to build a profile HMM. This model then may be used to find and score less obvious potential matches of new protein sequences.

Sequences of a Protein Family
Sequence profiles Sequences of a Protein Family Conserved Non-conserved Insertion ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDGMERNTAGVP Matching any thing but G => large negative score Any thing can match

Example. (SGNH active site)

HMM vs. alignment Detailed description of core
Conserved/variable positions Price for insertions/deletions varies at different locations in sequence These features cannot be captured in conventional alignments

Profile HMM’s X = L1- Y2A3V4R5- I6 P = P1D2P3P4I4P5D6P7 Deletions
Insertions X = L1- Y2A3V4R5- I6 P = P1D2P3P4I4P5D6P7

Building a profile HMM Multiple alignment is used to construct the HMM model. Assign each column to a Match state in HMM. Add Insertion and Deletion state. Estimate the emission probabilities according to amino acid counts in column. Different positions in the protein will have different emission probabilities. Estimate the transition probabilities between Match, Deletion and Insertion states The HMM model gets trained to derive the optimal parameters.

Paths in Edit Graph and Profile HMM
A path through an edit graph and the corresponding path through a profile HMM

Making a Collection of HMM for Protein Families
Use Blast to separate a protein database into families of related proteins Construct a multiple alignment for each protein family. Construct a profile HMM model and optimize the parameters of the model (transition and emission probabilities). Align the target sequence against each HMM to find the best fit between a target sequence and an HMM

Application of Profile HMM to Modeling Globin Proteins
Globins represent a large collection of protein sequences 400 globin sequences were randomly selected from all globins and used to construct a multiple alignment. Multiple alignment was used to assign an initial HMM This model then get trained repeatedly with model lengths chosen randomly between 145 to 170, to get an HMM model optimized probabilities.

How Good is the Globin HMM?
625 remaining globin sequences in the database were aligned to the constructed HMM resulting in a multiple alignment. This multiple alignment agrees extremely well with the structurally derived alignment. 25,044 proteins, were randomly chosen from the database and compared against the globin HMM. This experiment resulted in an excellent separation between globin and non-globin families.

Motifs, logos, and Profile HMM’s

Similar presentations

Presentation on theme: "Motifs, logos, and Profile HMM’s"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Motifs, logos, and Profile HMM’s

Similar presentations

Presentation on theme: "Motifs, logos, and Profile HMM’s"— Presentation transcript:

Similar presentations

About project

Feedback