Tutorial 5 Motif discovery
Multiple sequence alignments and motif discovery MEME MAST TOMTOM GOMO PROSITE
Can we find motifs using multiple sequence alignment? A widespread pattern with a biological significance ..YDEEGGDAEE.. ..YGEEGADYED.. ..YDEEGADYEE.. ..YNDEGDDYEE.. ..YHDEGAADEE.. 1 2 3 4 5 6 7 8 9 10 A 3/6 1/6 2/6 D 5/6 E 4/6 G 1/3 H N Y
Can we find motifs using multiple sequence alignment (MSA)? YES! NO
Using MSA for motif discovery Can only work if things align nicely alone For most motifs this is not the case!
ClustalW - Input Input sequences Scoring matrix Gap scoring http://www.ebi.ac.uk/Tools/clustalw2/index.html Input sequences Scoring matrix Gap scoring Output format Email address
Muscle Input sequences Output format Email address http://www.ebi.ac.uk/Tools/muscle/index.html Input sequences Output format Email address
Motif search: from de-novo motifs to motif annotation gapped motifs Large DNA data http://meme.sdsc.edu/
MEME – Multiple EM* for Motif finding http://meme.sdsc.edu/ Motif discovery from unaligned sequences Genomic or protein sequences Flexible model of motif presence (Motif can be absent in some sequences or appear several times in one sequence) *Expectation-maximization
How many times in each sequence? Input file (fasta file) MEME - Input Email address How many times in each sequence? Input file (fasta file) Range of motif lengths How many motifs? How many sites?
MEME - Output Motif score
MEME - Output Motif score Motif length Number of times
High information content MEME - Output Low uncertainty = High information content
MEME - Output Multilevel Consensus
Patterns can be presented as regular expressions [AG]-x-V-x(2)-{YW} [] - Either residue x - Any residue x(2) - Any residue in the next 2 positions {} - Any residue except these Examples: AYVACM, GGVGAA
MEME - Output Position in sequence Strength of match Sequence names Motif within sequence
Motif location in the input sequence Overall strength of motif matches MEME - Output Sequence names Motif location in the input sequence Overall strength of motif matches
What can we do with motifs? MAST - Search for them in non annotated sequence databases (protein and DNA) TOMTOM - Find the protein who binds the DNA motifs. GOMO - Find putative target genes (DNA) of motifs and analyze their associated annotation terms. PROSITE - Search for them in annotated protein sequence databases.
MAST Searches for motifs (one or more) in sequence databases: http://meme.sdsc.edu/meme4_4_0/cgi-bin/mast.cgi Searches for motifs (one or more) in sequence databases: Like BLAST but motifs for input Similar to iterations of PSI-BLAST Profile defines strength of match Multiple motif matches per sequence Combined E value for all motifs MEME uses MAST to summarize results: Each MEME result is accompanied by the MAST result for searching the discovered motifs on the given sequences.
MAST - Input Email address Database Input file (motifs)
Presence of the motifs in a given database MAST - Output Input motifs Presence of the motifs in a given database
TOMTOM http://meme.sdsc.edu/meme/doc/tomtom.html Searches one or more query DNA motifs against one or more databases of target motifs, and reports for each query a list of target motifs, ranked by p-value. The output contains results for each query, in the order that the queries appear in the input file.
Background frequencies TOMTOM - Input Input motif Background frequencies Database
DNA IUPAC* code Example: YCAY = [TC]CA[TC] A --> adenosine M --> A C (amino) C --> cytidine S --> G C (strong) G --> guanine W --> A T (weak) T --> thymidine B --> G T C D --> G A T R --> G A (purine) H --> A C T Y --> T C (pyrimidine) V --> G C A K --> G T (keto) N --> A G C T (any) Example: YCAY = [TC]CA[TC] *IUPAC = International Union of Pure and Applied Chemistry
TOMTOM - Output Input motif Matching motifs
TOMTOM – Output Wrong input, ok results
JASPAR Profiles Open data accesss Transcription factor binding sites Multicellular eukaryotes Derived from published collections of experiments Open data accesss
logo Name of gene/protein organism score
GOMO GOMO takes DNA binding motifs to find putative target genes and analyze their associated GO terms. A list of significant GO terms that can be linked to the given motifs will be produced. GOMO returns a list of GO-terms that are significantly associated with target genes of the motif. Gene Ontology provides a controlled vocabulary to describe gene and gene product attributes in any organism.
GOMO - Input Email address Database Input file (motifs)
GOMO - Output MF - Molecular function BP - Biological process Input motifs GO annotation MF - Molecular function BP - Biological process CC - Cellular compartment
Prosite http://www.expasy.org/tools/scanprosite ProSite is a database of protein domains and motifs that can be searched by either regular expression patterns or sequence profiles.
Input motif a regular expression Prosite - input Database Filters
Location in the protein sequence Input motif Prosite - Output Location in the protein sequence protein