Download presentation
Presentation is loading. Please wait.
Published byJuliet Powell Modified over 8 years ago
1
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E
2
Outline Gene Regulation DNA Transcription factors Motifs What are they? Binding Sites Combinatoric Approaches Exhaustive searches Consensus Comparative Genomics Example Probabilistic Approaches Statistics EM algorithm Gibbs Sampling
3
www.accessexcellence.org
6
Four DNA nucleotide building blocks G-C is more strongly hydrogen-bonded than A-T
7
Degenerate code Four bases: A, C, G, T Two-fold degenerate IUB codes: R=[AG] -- Purines Y=[CT] -- Pyrimidines K=[GT] M=[AC] S=[GC] W=[AT] Four-fold degenerate: N=[AGCT]
8
Transcription Factors Required but not a part of the RNA polymerase complex Many different roles in gene regulation Binding Interaction Initiation Enhancing Repressing Various structural classes (eg. zinc finger domains) Consist of both a DNA-binding domain and an interactive domain
9
Short sequences of DNA or RNA (or amino acids) Often consist of 5- 16 nucleotides May contain gaps Examples include: Splice sites Start/stop codons Transmembrane domains Centromeres Phosphorylation sites Coiled-coil domains Transcription factor binding sites (TFBS – regulatory motifs) Motifs
10
TFBSs Difficult to identify Each transcription factor may have more than one binding site Degenerate Most occur upstream of translation start site (TSS) but are known to also occur in: introns exons 3’ UTRs Usually occur in clusters, i.e. collections of sites within a region (modules) Often repeated Sites can be experimentally verified
11
Why are TFBSs important? Aid in identification of gene networks/pathways Determine correct network structure Drug discovery Switch production of gene product on/off Gene A Gene B
12
Consensus sequences Matches all of the example sequences closely but not exactly A single site TACGAT A set of sites: TACGAT TATAAT GATACT TATGAT TATGTT Consensus sequence: TATAAT or TATRNT Trade-off: number of mismatches allowed, ambiguity in consensus sequence and the sensitivity and precision of the representation.
13
Information Content and Entropy
14
Sequence Logos
15
Given a collection of motifs, TACGAT TATAAT GATACT TATGAT TATGTT Create the matrix: Frequency Matrices TACGTACG
16
Position weight matrices
17
Two problems: Given a collection of known motifs, develop a representation of the motifs such that additional occurrences can reliably be identified in new promoter regions Given a collection of genes, thought to be related somehow, find the location of the motif common to all and a representation for it. Two approaches: Combinatorial Probabilistic Finding Motifs
18
Combinatorial Approach
19
Exhaustive Search
20
Sample-driven here refers to trying all the words as they occur in the sequences, instead of trying all possible (4 W ) words exhaustively
21
Greedy Motif Clustering
24
Main Idea: Conserved non coding regions are important Align the promoters of orthologous co-expressed genes from two (or more) species e.g. human and mouse Search for TFBS only in conserved regions Problems: Not all regulatory regions are conserved Which genomes to use? Comparative Genomics
25
Phylogenetic Footprinting Phylogenetic Footprinting refers to the task of finding conserved motifs across different species. Common ancestry and selection on these motifs has resulted in these “footprints”.
26
Xie et al. 2005 Genome-wide alignments for four species (human, mouse, rat, dog) Promoter regions and 3’UTRs then extracted for 17,700 well-annotated genes Promoter region taken to be (-2000, 2000) This set of sequences then searched exhaustively for motifs Phylogenetic Footprinting An Example Nature 434, 338-345, 2005
27
The Search Xie et al. 2005
28
Expected Rate
29
Probabilistic Approach
30
Gibbs Sampling (applied to Motif Finding)
31
Gibbs Sampling Algorithm
32
Gibbs Sampling – Motif Positions
33
AlignACE - Gibbs Sampling
34
Remainder of the lecture: Maximum likelihood and the EM algorithm The remaining slides are for your information only and will not be part of the exam
35
Basic Statistics
36
Maximum Likelihood Estimates
37
EM Algorithm
38
Basic idea (MEME) http://meme.nbcr.net/meme/meme-intro.html
39
Basic idea (MEME) MEME is a tool for discovering motifs in a group of related DNA or protein sequences. A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences. MEME represents motifs as position-dependent letter- probability matrices which describe the probability of each possible letter at each position in the pattern. Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs. MEME takes as input a group of DNA or protein sequences (the training set) and outputs as many motifs as requested. MEME uses statistical modeling techniques to automatically choose the best width, number of occurrences, and description for each motif. http://meme.nbcr.net/meme/meme-intro.html
40
Basic MEME Model
41
MEME Background frequencies
42
MEME – Hidden Variable
43
MEME – Conditional Likelihood
44
EM algorithm
45
Example
46
E-step of EM algorithm
47
Example
48
M-step of EM Algorithm
49
Example
50
Characteristics of EM
51
Gibbs Sampling (versus EM)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.