Download presentation
1
Comparative Motif Finding
CS 374 – Lecture 23 Mayukh Bhaowal
2
Reference Papers Xiaohui Xie, Jun Lu, E. J. Kulbokas, Todd R. Golub, Vamsi Mootha, Kerstin Lindblad-Toh, Eric S. Lander, Manolis Kellis, “Systematic discovery of regulatory motifs in human promoters and 3’UTRs by comparison of several mammals”, Nature, 2005 Mathieu Blanchette and Martin Tompa, “Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting”, Genome Res :
3
What is a Motif ? A motif is a nucleotide sequence pattern and has biological significance. Regulatory motifs are DNA fragments
4
Motif Logos Height of letters represents probability of being found in that location in the motif
5
Why is it difficult to find them?
1. Short fragments 2. Degenerate 3. Unpredictable Motifs can occur in either strands.
6
Promoter In genetics, a promoter is a DNA sequence that enables a gene to be transcribed. The promoter is recognized by RNA polymerase, which then initiates transcription.
7
3’ UTR The three prime untranslated region (3' UTR) is a particular section of messenger RNA (mRNA). An mRNA codes for a protein through translation. The mRNA also contains regions that are not translated. In eukaryotes the 5' untranslated region, 3' untranslated region, cap and polyA tail. Image source :
8
What the paper proposes
What? Discovering the regulatory motifs in human promoters and 3’ UTRs. How? By comparing sequence motifs of several mammals. That’s why it is called comparative motif finding. Which mammals? Human, mouse, rat, dog.
9
Conservation properties
10
Methods Type Total Sequnce Promoter 68 Mb 3’ UTR 15 Mb Intron Control
Chose 17,700 well annotated genes from RefSeq database. Promoters = 4kb centered at transcriptional start site (only noncoding) 3-UTRs = based on annotation of reference mRNA Intronic sequences as a control (last two introns from each gene)
11
Motif Conservation Score
A motif is said to be conserved when an exact match is found in all 4 species. Conservation = conserved occurrences/all occurrences MCS = Observed conservation – random conservation Standard deviation
12
Known highly conserved motif
Err α [TGACCTTG] Of the 434 times err α occurs in human promoter regions, 162 of them are conserved across all the 4 species. Conservation rate = 37% Random 8-mer motif shows only 6.8% conservation rate
14
Results: Promoter Region
174 highly conserved motifs (MCS > 6) 59 strong match to known motifs, 10 weaker match. 105 potential new regulatory motifs
15
Approaches to explore biological significance
So why is the motif biologically significant? 1. tissue specificity 2. positional bias
16
Tissue Specificity Tissue specificity of expression for genes containing discovered motifs Expression data for 75 tissues 59 of 69 known, and 53 of 105 unknown show tissue specificity
17
Position Bias Motifs show position bias
Conserved motifs show strong position bias Preferential occurrence within 100bases of TSS
18
Results: motifs in 3’ UTRs
In UTR 106 conserved motifs found (MCS>6) 3’-UTR motifs have not studied before Comparison of discovered motifs to a large collection of previously known motifs not possible Two unique properties Strand specificity Bias towards 8-mers
19
Property1: strand specificity
Xie, X. et al., Nature, 2005
20
Property2 : bias towards 8-mers
Xie, X. et al., Nature, 2005
21
Digression: miRNA Single stranded RNA
transcribed from DNA but not translated into protein Many mature miRNA start with U followed by a 7-base “seed” complementary to a site in the 3’ UTR of target mRNAs. Thus many are 8 mers microRNA that regulates insulin secretion by an NYU study published in Nature.
22
Inference Thus we can infer many of the conserved 8-mer motifs act as binding sites for miRNA Leads to discovery of 52% existing miRNA genes Leads to discovery of 129 new miRNA genes
23
Phylogenetic Footprinting
24
Problem Definition (why?)
Major challenge of current genomics is to understand how gene expression is regulated. An important step towards this understanding is the capability to identify regulatory elements.
25
What? Phylogenetic footprinting is
1. method for the discovery of regulatory elements 2. in orthologous regulatory regions 3. from multiple species.
26
Image source: http://www.biorecipes.com/Orthologues/code.html
27
Main idea Coding sequences evolving at a slower rate than non-coding sequences cause selective pressure Transition in a coding sequence can possibly alter the whole function of coded protein Transition in a non-coding sequence (RE) may only change expression frequency of a gene
28
Phylogenetic Footprinting
Study orthologous non-coding DNA from species that are related (phylogenetic tree) Differentiation: Tree Find one motif in many species Well conserved = possible Regulatory Element
29
Formalization Given: phylogenetic tree T,
set of orthologous sequences at leaves of T, length k of motif threshold d Problem: Find each set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in T is at most d.
30
Small Example Size of motif sought: k = 4 AGTCGTACGTGAC... (Human)
AGTAGACGTGCCG... (Chimp) ACGTGAGATACGT... (Rabbit) GAACGGAGTACGT... (Mouse) TCGTGACGGTGAT... (Rat) Size of motif sought: k = 4
31
Solution AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT...
GAACGGAGTACGT... TCGTGACGGTGAT... ACGG ACGT Parsimony score: 1 mutation
32
An Exhaustive Algorithm
Wu [s] = best parsimony score for subtree rooted at node u, if u is labeled with string s. … ACGG: + ACGT: 0 ... … ACGG: 1 ACGT: 0 ... AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG … ACGG: ACGT : … ACGG: 2 ACGT: 1 ... … ACGG: ACGT : … ACGG: ACGT : … ACGG: 1 ACGT: 1 \... … ACGG: 0 ACGT: 2 ... … ACGG: 0 ACGT: + ...
33
Simple Recurrence Wu [s] = min ( Wv [t] + h(s, t) ) Words Good:
v: children t of u Words Good: K-mer score at a node is the sum of its children’s best parsimony scores for that k-mer
34
Running Time Wu [s] = min ( Wv [t] + h(s, t) )
v: children t of u Number of species Average sequence length Motif length Total time O(n k (42k + l )) O(k 42k ) time per node
35
Results Metallothionein Gene Family Insulin Gene Family C-myc promoter
36
Metallothionein Gene Family
Large number of promoter sequences Large number of RE Binding sites occurs within 300 bp of start codon 590 bp of sequence located upstream of start codon Conserved elements of lengths 7,8,9,10 (K values) Identified 12 motifs of which 4 have been confirmed Analysis
38
Insulin Gene Family two rodents and a pig (two gene copies each)
motifs with 0 mutations, K=8 motifs with 1 mutation, K=9,10 4 conserved motifs identified Several binding sites missed as they contain very few mutations Analysis
39
C-myc Promoter 7 species analyzed
Contains members from diverse animal phyla (fishes, birds, mammals, batrachians) 4 of 9 predictions known are binding sites Most located in 120 bp promoter region Analysis
40
Drawbacks Some binding sites does not have significant matches to most other species Some binding sites show good conservation rate in sequences shorter than footprinter looked at T3R
41
Drawbacks cont’d Deletions/Insertions
Failure to meet statistical significance Some TFs bind as dimers where the binding site may consist of 2 conserved regions, separated by a few variable nucleotides
42
Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.