Special Topics in Genomics Motif Analysis
Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA TTAGAGGCACAATTGCTTGGGTGGTGCACAAAAAAACAAG AACAGCCTTGGATTAGCTGCTGGGGGGGTGAGTGGTCCAC ATCAGAATGGGTGGTCCATATATCCCAAAGAAGAGGGTAG TF TGGGTGGTC TGGGTGGTA TGGGAGGTC TGGGTGGTG TGAGTGGTC TGGGTGGTC Transcription Factor Binding Sites (TFBS) DNA motif: Protein motif:
Motif representation
Consensus sequence Example: CACSTG
Sequence Logo Schneider & Stephens, Nucleic Acids Res. 18: (1990) Entropy (Shannon) – a measurement of uncertainty The amount of uncertainty reduced by observing sequences is the amount of information (or information content) we obtained: This is the height of each position in the logo plot. Height of each nucleotide is proportional to its frequency
Two questions in motif analysis Known motif mapping Finding occurrences of a motif in nucleotide or amino acid sequences De novo motif discovery Finding motifs that are previously unknown
Known motif mapping Consensus mapping STEP 1: provide a motif (e.g. CACSTG = CAC[C,G]TG) STEP 2: specify number of mismatches allowed (e.g. <=1) STEP 3: scan the sequence CGCCGGGACCAGATCAACGCCGAGATCCGGCACATGAAGGAGCT m=3, no m=1, yes A useful tool: CisGenome (
Known motif mapping Motif matrix mapping (CisGenome) STEP 1: provide a motif and background model STEP 2: specify a likelihood ratio cutoff (e.g. LR>=500) STEP 3: scan the sequence 00 GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGACTGGGAGGTCCTCGGTTCAGAGTCACAGAGCA LR>500, yes LR<500, no Motif: Background: A C G T A C G T A C G T Another tool for matrix mapping MAST (
De novo motif discovery Two major class of methods: 1. Word enumeration 2. Matrix updating
Word enumeration Example: Sinha & Tompa, Nucleic Acids Res. 30: (2002) STEP 1: enumerate possible words; STEP 2: count word occurrences; STEP 3: compare observed word count with random expectation.
Matrix updating CONSENSUS (Stormo & Hartzell, PNAS, 86: , 1990) STEP 1: use all k-mers in the first sequence as seeds; STEP 2: find matches (often use best matches) of each seed in the second sequence; STEP 3: update seed matrices, exclude matrices with low information content; STEP 4: repeat step 2 and 3 for all sequences.
Matrix updating Mixture model 00 , W EM: Lawrence and Reilly (1990) Bailey and Elkan (1994), etc. Gibbs Sampler: Lawrence et al. (1993) Liu (1994), Liu et al. (1995), etc. S: GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGACTGGGAGGTCCTCGGTTCAGAGTCACAGAGCA A: Motif:Background: q = [q 0,q 1 ]q0q0 q1q1 A C G T A C G T A C G T ,W,q A Inference by iterative estimation/sampling
Other issues Dependencies within motif Functions of novel motifs