Download presentation
1
Regulatory Motif Finding
Wenxiu Ma CS374 Presentation 11/03/2005
2
Outline Regulation of genes Regulatory Motifs Motif Representation
Current Motif Discovery Methods
3
Regulation of Genes What turns genes on (producing a protein) and off?
When is a gene turned on or off? Where (in which cells) is a gene turned on? How many copies of the gene product are produced? 3 billions nucleotides of human genome. Different cell types differ dramatically in both structure and functions. Cell differentiation is often irreversible. Q: In what condition is each gene product made, and once made, what does it do?
4
Overview of Gene Control
The mechanisms that control the expression of genes operate at many levels. Gene expression can be regulated at many of the steps in the pathway from DNA to RNA to Protein. Thus a cell can control the proteins it makes by (1) controlling when and how often a given gene is transcribed (transcriptional control), (2) controlling how the RNA transcript is spliced or otherwise processed (RNA processing control), (3) selecting which completed mRNAs in the cell nucleus are exported to the cytosol and determining where in the cytosol they are localized (RNA transport and localization control), (4) selecting which mRNAs in the cytoplasm are translated by ribosomes (translational control), (5) selectively destabilizing certain mRNA molecules in the cytoplasm (mRNA degradation control), or (6) selectively activating, inactivating, degrading, or compartmentalizing specific protein molecules after they have been made (protein activity control) source: Molecular Biology of the Cell (4th ed.), A. Johnson, et al.
5
Transcriptional Regulation
The transcription of each gene is controlled by a regulatory region of DNA relatively near the transcription start site (TSS). two types of fundamental components short DNA regulatory elements gene regulatory proteins that recognize and bind to them.
6
Regulation of Genes Transcription Factor (Protein) RNA polymerase
DNA Gene Regulatory Element source: M. Tompa, U. of Washington
7
Regulation of Genes Transcription Factor (Protein) RNA polymerase DNA
Regulatory Element Gene source: M. Tompa, U. of Washington
8
Regulation of Genes New protein RNA polymerase Transcription Factor
DNA Regulatory Element Gene source: M. Tompa, U. of Washington
9
Outline Regulation of genes Regulatory Motifs Motif Representation
Current Motif Discovery Methods
10
What is a motif? A subsequence (substring) that occurs in multiple sequences with a biological importance. Motifs can be totally constant or have variable elements. Protein Motifs often result from structural features. DNA Motifs (regulatory elements) Binding sites for proteins Short sequences (5-25) Up to 1000 bp (or farther) from gene Inexactly repeating patterns
11
daf-19 Binding Sites in C. elegans
GTTGTCATGGTGAC GTTTCCATGGAAAC GCTACCATGGCAAC GTTACCATAGTAAC GTTTCCATGGTAAC che-2 daf-19 osm-1 osm-6 F02D8.3 -150 -1 source: Peter Swoboda
12
Motif Representing Consensus sequence: a single string with the most likely sequence(+/- wildcards) Regular expression: a string with wildcards, constrained selection Profile: a list of the letter frequencies at each position Sequence Logo: graphical depiction of a profile conservation of elements in a motif.
13
Motif Logos: an Example
(
14
Measure of Conservation
Relative heights of letters reflect their abundance in the alignment. Total height = entropy-based measurement of conservation. Entropy(i) = -SUM { f(base, i)* ln[f(base, i)] } over all bases Conservation(i) = 2- Entropy(i) Units of conservation = bits of information Entropy measures variability/disorder. High conserved = low entropy = tall stack Very variable = high entropy = low stack
15
Outline Regulation of genes Regulatory Motifs Motif Representation
Current Motif Discovery Methods
16
Finding Regulatory Motifs
. Given a collection of genes with common expression, Find the (TF-binding) motif in common
17
Identifying Motifs: Complications
We do not know the motif sequence We do not know where it is located relative to the genes start Motifs can differ slightly from one gene to another How to discern it from “random” motifs?
18
Current Motif Discovery Methods
GOAL: comprehensive identification of all the regulatory motifs in genomes. by overrepresentation MEME, Gibbs sampling by phylogenetic footprinting Footprinter Cross species comparative analysis Combine structure information Majority of non-coding functional elements remain unknown.
19
Motif Finding: Comparative Analysis
Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Xie, X. et al., Nature (2005). Identify motifs based on comparative analysis of human, mouse, rat and dog genomes A systematic catalogue of human gene regulatory motifs Short, functional sequences (6-10bp) used many times in a genome Focus regions Promoters 3’ untranslated regions (3’ UTRs) microRNAs (miRNAs) post-transcriptional regulation Comparative analysis take the advantage of the evolutionary conservation across related species.
20
Motif Discovery Procedure
Alignment of promoters & 3’ UTRs Motif conservation score (MCS) Measure the extent of excess conservation “Highly conserved motifs” MCS>6 Clustering
21
Alignment of promoters & 3’ UTRs
construct a whole-genome alignment for the four mammalian genomes Blastz1 and Multiz2 Extract the aligned promoter and 3’ UTRs portions respectively. Coordinates: the annotation of NCBI reference sequences (RefSeq) constructed a whole-genome alignment for the four mammalian genomes using the program Blastz1 and Multiz2 in two steps: we first aligned human and dog sequences based on the human/dog syntenic map we generated (to be reported elsewhere, see Lindblad-Toh et al3). We then aligned the human/dog sequences to human/mouse/rat three-way alignment.
22
Motif Conservation Score (MCS)
Consensus sequence representation Alphabet size: 11 (A,C,G,T,[AC], [AG], [AT], [CG], [CT], [GT], [ACGT]) conserved occurrence of a motif m is an instance in which an exact match to this motif is found in all four species. conservation rate p = ratio of conserved occurrences to total occurrences in human Expected conservation rate p0 = avg. conservation rate of 100 random motifs, given same length and redundancy.
23
MCS MCS = # of s.d. by which the observed conservation rate of a motif p exceeds the expected conservation rate p0. p = k/n Binomial probability of observing k out of n Estimated by way of Normal approximation to the binomial Dist.
24
Conservation Properties of Regulatory Motifs
Known 8-mer TGACCTTG Conservation rate 37% (162 out of 434) random rate 6.8% MCS = 25.2 s.d. Promoter Region TRANSFAC: 446 motifs MCS>3: 63% MCS>5: ~50% 3’ UTR no database analogous to TRANSFAC some known motifs
25
Motif Discovery Procedure
Alignment of promoters & 3’ UTRs Motif conservation score (MCS) “Highly conserved motifs” MCS>6 Clustering
26
Results: motifs in promoters
174 highly conserved motifs 59 strong match to known motifs, 10 weaker match. 105 potential new regulatory motifs Tissue specificity and position bias 89% of known motifs 69% of new motifs Xie, X. et al., Nature, 2005
27
Results: motifs in 3’ UTRs
106 highly conserved motifs Two unusual properties Strand specificity Unusual length distribution
28
Property1: strand specificity
Xie, X. et al., Nature, 2005
29
Property2 Xie, X. et al., Nature, 2005
30
Properties => miRNA
Strand specificity 3’-UTR motifs acting at the level of RNA rather than DNA have a role in post-transcriptional regulation Length distribution Many mature miRNA start with U followed by a 7-base “seed” complementary to a site in the 3’ UTR of target mRNAs. Hypothesis: many of the highly conserved 8-mer motifs might be binding sites for conserved miRNAs.
31
The microRNA pathway Adapted from Tomari & Zamore Curr Biol 2004
7mG(5’)ppp(5’)G The microRNA pathway pri-miRNA Drosha Pasha 3’-nA…AAA pre-miRNA Dicer miR/miR* duplex mature miRNA miRNP Adapted from Tomari & Zamore Curr Biol 2004
32
Relationship with miRNA
72 highly conserved 8-mer motifs Contiguous, non-degenerate ~46% of all 3’-UTR motifs 207 distinct human miRNAs From current registry Complementary matches Exactly match: ~43.5% One mismatch: ~50% 95% of matches begin at NT 1 or 2 of the miRNA gene 8-mer motifs represent target sites for miRNA
33
8-mer motifs ->new miRNA genes
RNAfold program 242 conserved and stable stem-loop sequences 113 known, 129 potential new miRNAs Biological validation 12 selected new miRNA genes 6 (50%) have clearly expression activity in tissues.
34
Prevalence of miRNA regulation
20% of 3’ UTRs may be targets for conserved miRNA-based regulation at the 8-mer motifs. Unbiased assessment of the relative importance of miRNA-based regulation in the human genome
35
Summary: comparative genome analysis
4 mammalian species an initial systematic catalogue Promoters 3’ UTRs Importance of the new miRNA regulatory mechanism Future directions: genome-wide discovery more genomes alignments: the primate
36
Now… Motif Finding Methods Cross species comparative analysis
Combine structure information
37
Motif Finding: Structural Knowledge
Ab initio prediction of transcription factor targets using structural knowledge, Kaplan T, et al., PLoS Comput Biol (2005) Propose a general framework for predicting DNA BS sequences of novel TFs from known family Structure-based approach No prior TF binding data and target gene Family-wise probabilistic model Context-specific amino acid-nucleotide recognition preferences ab initio gene prediction: Traditionally, gene prediction programs that rely only on the statistical qualities of exons have been referred to as performing ab initio predictions. Ab initio prediction of coding sequences is an undeniable success by the standards of the machine- learning algorithm field, and most of the widely used gene prediction programs belong to this class of algorithms.
38
Structure-based approach
Family-wise probabilistic model Input: pairs of TFs and their target DNA sequences structural information Output: Context-specific amino acid-nucleotide recognition preferences Position specificity Then, discover TFBSs of other TFs from the same family In previous studies, we used the empirical frequencies of amino acid–nucleotide interactions [4,5] in solved complexes (from various protein families) to build a set of “DNA-recognition preferences.” This approach assumed similar DNA-binding preferences of the amino acids for all structural domains and at all binding positions. However, there are clear experimental indications that this assumption is not always valid: a particular amino acid may have different binding preferences depending on its positional context
39
Cys2His2 Zinc Finger protein family
largest known DNA-binding family in multicellular organisms common, strict binding models Figure One type of zinc finger protein. This protein belongs to the Cys-Cys-His-His family of zinc finger proteins, named after the amino acids that grasp the zinc. (A) Schematic drawing of the amino acid sequence of a zinc finger from a frog protein of this class. (B) The three-dimensional structure of this type of zinc finger is constructed from an antiparallel b sheet (amino acids 1 to 10) followed by an a helix (amino acids 12 to 24). The four amino acids that bind the zinc (Cys 3, Cys 6, His 19, and His 23) hold one end of the a helix firmly to one end of the b sheet. (Adapted from M.S. Lee et al., Science 245: , 1989.) There Are Several Types of DNA-binding Zinc Finger Motifs The helix-turn-helix motif is composed solely of amino acids. A second important group of DNA-binding motifs adds one or more zinc atoms as structural components. Although all such zinc-coordinated DNA-binding motifs are called zinc fingers, this description refers only to their appearance in schematic drawings dating from their initial discovery (Figure 7-17A). Subsequent structural studies have shown that they fall into several distinct structural groups, two of which are considered here. The first type was initially discovered in the protein that activates the transcription of a eucaryotic ribosomal RNA gene. It is a simple structure, consisting of an a helix and a b sheet held together by the zinc (Figure 7-17B). source: Molecular Biology of the Cell (4th ed.), A. Johnson, et al.
40
Cys2His2 Zinc Finger: Canonical DNA binding model
Residues at positions 6, 3, 2, and -1 (relative to the beginning of the a-helix) at each finger interact with adjacent nucleotides in the DNA molecule (interactions shown with arrows). Kaplan. et al., PLoS Comput Biol, 2005
41
Cys2His2 Zinc Finger: DNA Binding Model
Figure DNA binding by a zinc finger protein. (A) The structure of a fragment of a mouse gene regulatory protein bound to a specific DNA site. This protein recognizes DNA using three zinc fingers of the Cys-Cys-His-His type (see Figure 7-17) arranged as direct repeats. (B) The three fingers have similar amino acid sequences and contact the DNA in similar ways. In both (A) and (B) the zinc atom in each finger is represented by a small sphere. (Adapted from N. Pavletich and C. Pabo, Science 252: , 1991.) source: Molecular Biology of the Cell (4th ed.), A. Johnson, et al.
42
Cys2His2 Zinc Finger: Compiling dataset
Goal: DNA-recognition preferences for each of the four key positions every AA v.s. every NT insufficient solved protein-DNA complex Known protein sequence data and their DNA targets TRANSFAC: 455 protein-DNA Pairs Non-canonical model Profile HMM No exact binding locations CX(2-4)CX(11-13)HX(3-5)H
43
Profile HMM “Silent” deletion states Insertion states Match states
build a model representing the consensus sequence for a family, rather than the sequence of any particular member Find potential alignment for new sequences
44
Example: full profile HMM
45
Structure-based approach
Input: set of pairs of TFs and their target DNA sequences Output: Context-specific amino acid-nucleotide recognition preferences Iterative Expectation Maximization (EM) algorithm
46
Cys2His2 Zinc Finger: Probabilistic Model
The set of interacting residues in 4 different positions of the k fingers N1,… NL be a target DNA sequence The probability that an interaction starting from jth position in the DNA where PP(N|A) is the conditional probability of nucleotide N given amino acide A at position p. Kaplan. et al., PLoS Comput Biol, 2005
47
EM algorithm Iterative EM algorithm E-step M-step Local optima
Exact binding locations for all protein-DNA pairs recognition preferences: Pp(N|A) E-step Compute expected posterior probability of binding locations, based on current preferences M-step Update DNA-recognition preferences to maximize the likelihood of current binding locations based on the distribution of possible binding locations in previous E-step Local optima
48
Estimate DNA-recognition preferences
Figure 2. Estimating DNA-Recognition Preferences The DNA-recognition preferences are estimated from unaligned pairs of transcription factors and their DNA targets [2] (above). The EM algorithm [13] is used to simultaneously assess the exact binding positions of each protein–DNA pair (bottom right), and to estimate four sets of positionspecific DNA-recognition preferences (bottom left). Kaplan. et al., PLoS Comput Biol, 2005
49
Apply on TFs from the same family
Figure 3. Predicting the DNA Binding Site Motifs of Novel Transcription Factors The protein’s DNA-binding domains are identified using the Cys2His2 conserved pattern (top left). The residues at the key positions (6, 3, 2 and 1) of each finger (marked in red in the bottom left panel) are then assigned onto the canonical binding model (bottom right), and the sets of position-specific DNA-recognition preferences (top right panel) are used to construct a probabilistic model of the DNA binding site. For example, the lysine at the sixth position of the third finger faces the first position of the binding site (dotted blue arrow). We predict the nucleotide probabilities at this position using the appropriate recognition preferences (dotted black arrow). Kaplan. et al., PLoS Comput Biol, 2005
50
Evaluation compatible with experimental results
10-fold cross validation genome-wide scan of Drosophia melanogaster 29 canonical Cys2His2 TFs GO Enrichment of predicted target genes 21 enriched with at least one GO term. mRNA expression profile of target genes 21 showed significant associations in at least one embryogenesis experiment.
51
Compare with other preferences
Kaplan. et al., PLoS Comput Biol, 2005
52
Summary Family-wise approach
Combine structure information with sequence data Learn context-specific AA-NT recognition preferences Predict binding preferences of new protein Identify TFBSs and target genes
53
Discussion Tradeoff between complexity and accuracy
Canonical model Extension to other DNA-Binding domain Restrictions: enough binding-data, common and strict binding model… Provide a promising way to predict target genes of novel proteins and to understand their function and activity
54
Thank you! Any question?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.