Multiple sequence alignments and motif discovery Tutorial 5
Multiple sequence alignment –ClustalW –Muscle Motif discovery –MEME –Jaspar Multiple sequence alignments and motif discovery
More than two sequences –DNA –Protein Evolutionary relation –Homology Phylogenetic tree –Detect motif Multiple Sequence Alignment GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC A DB C GTCGTAGTCGGCTCGAC GTCTAGCGAGCGTGAT GCGAAGAGGCGAGC GCCGTCGCGTCGTAAC
Dynamic Programming –Optimal alignment –Exponential in #Sequences Progressive –Efficient –Heuristic Multiple Sequence Alignment GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC A DB C GTCGTAGTCGGCTCGAC GTCTAGCGAGCGTGAT GCGAAGAGGCGAGC GCCGTCGCGTCGTAAC
ClustalW “CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice”, J D Thompson et al Pairwise alignment – calculate distance matrix Guided tree Progressive alignment using the guide tree
ClustalW Progressive –At each step align two existing alignments or sequences –Gaps present in older alignments remain fixed -TGTTAAC -TGT-AAC -TGT--AC ATGT---C ATGT-GGC
ClustalW - Input Input sequences Gap scoring Scoring matrix address Output format
ClustalW - Output Match strength in decreasing order: * :.
ClustalW - Output
Pairwise alignment scores Building alignment Final score Building tree
ClustalW - Output
ClustalW Output Sequence namesSequence positions Match strength in decreasing order: * :.
ClustalW - Output
Branch length
ClustalW - Output
Muscle
Muscle - output
What’s the difference between Muscle and ClustalW? ClustalWMuscle
Can we find motifs using multiple sequence alignment? A /61/300 D00.51/3001/65/61/60 E002/ /6 G01/60011/30000 H01/ N Y YDEEGGDAEE....YGEEGADYED....YDEEGADYEE....YNDEGDDYEE....YHDEGAADEE.. * :** *: Motif A widespread pattern with a biological significance
Can we find motifs using multiple sequence alignment? YES! NO
MEME – Multiple EM* for Motif finding Motif discovery from unaligned sequences –Genomic or protein sequences Flexible model of motif presence (Motif can be absent in some sequences or appear several times in one sequence) *Expectation-maximization
MEME - Input address Input file (fasta file) How many times in each sequence? How many motifs? How many sites? Range of motif lengths
MEME - Output Motif score
MEME - Output Motif length Number of times Motif score
MEME - Output Low uncertainty = High information content
MEME - Output Multilevel Consensus
Sequence names Position in sequence Strength of match Motif within sequence MEME - Output
Overall strength of motif matches Motif location in the input sequence MEME - Output Sequence names
MAST Searches for motifs (one or more) in sequence databases: –Like BLAST but motifs for input –Similar to iterations of PSI-BLAST Profile defines strength of match –Multiple motif matches per sequence –Combined E value for all motifs MEME uses MAST to summarize results: –Each MEME result is accompanied by the MAST result for searching the discovered motifs on the given sequences.
MEME - Input address Input file (motifs) Database
JASPARJASPAR Profiles –Transcription factor binding sites –Multicellular eukaryotes –Derived from published collections of experiments Open data accesss
JASPARJASPAR profiles –Modeled as matrices. –can be converted into PSSM for scanning genomic sequences A /61/300 D00.51/3001/65/61/60 E002/ /6 G01/60011/30000 H01/ N Y
Search profile
score organism logo Name of gene/protein