Download presentation
Presentation is loading. Please wait.
Published byTyler Perkins Modified over 9 years ago
1
Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001
2
Why? Additional annotation tool (e.g. specificity of transporters and enzymes from large families) Important for practice (in addition to metabolic reconstruction) Interesting from the evolutionary point of view
3
Overview 0. Biological introduction 1. Algorithms Representation of signals Deriving the signal Site recognition 2. Comparative genomics Phylogenetic footprinting Consistency filtering
4
Some biology Transcription (DNA RNA) Splicing (pre-mRNA mRNA) Translation (mRNA protein) Regulation of transcription in prokaryotes … and eukaryotes Initiation of translation
5
Transcription and translation in prokaryotes
6
Initiation of transcription (bacteria)
7
Translation in prokaryotes
8
Translation (details)
9
Splicing (eukaryotes)
10
Regulation of transcription in prokaryotes
11
Structure of DNA-binding domain. Example 1
12
Structure of DNA-binding domain. Example 2
13
Protein-DNA interactions
14
Regulation of transcription in eukaryotes
15
Representation of signals Consensus Pattern (consensus with degenerate positions) Positional weight matrix (PWM, or profile) Logical rules RNA signals
16
Consensus codB CCCACGAAAACGATTGCTTTTT purE GCCACGCAACCGTTTTCCTTGC pyrD GTTCGGAAAACGTTTGCGTTTT purT CACACGCAAACGTTTTCGTTTA cvpA CCTACGCAAACGTTTTCTTTTT purC GATACGCAAACGTGTGCGTCTG purM GTCTCGCAAACGTTTGCTTTCC purH GTTGCGCAAACGTTTTCGTTAC purL TCTACGCAAACGGTTTCGTCGG consensus ACGCAAACGTTTTCGT
17
Pattern codB CCCACGAAAACGATTGCTTTTT purE GCCACGCAACCGTTTTCCTTGC pyrD GTTCGGAAAACGTTTGCGTTTT purT CACACGCAAACGTTTTCGTTTA cvpA CCTACGCAAACGTTTTCTTTTT purC GATACGCAAACGTGTGCGTCTG purM GTCTCGCAAACGTTTGCTTTCC purH GTTGCGCAAACGTTTTCGTTAC purL TCTACGCAAACGGTTTCGTCGG consensus ACGCAAACGTTTTCGT pattern aCGmAAACGtTTkCkT
18
Frequency matrix I = j b f(b,j)[log f(b,j) / p(b)] Information content
19
Sequence logo
20
Positional weight matrix (PWM)
21
Probabilistic motivation: log-likelihood (up to a linear transformation) More probabilistic motivation: z-score (with the suitable base of the logarithm) Thermodynamical motivation: free energy (assuming independence of positions, up to a linear transformation) Pseudocounts
22
Logical rules, trees etc.
23
Compilation of samples Initial sample: –GenBank –specialized databases –literature (reviews) –literature (original papers) Correction of GenBank errors Checking the literature removal of predicted sites Removal of duplicates
24
Re-alignment approaches Initial alignment by a biological landmark –start of transcription for promoters –start codon for ribosome binding sites –exon-intron boundary for splicing sites Deriving the signal within a sliding window Re-alignment etc. etc. until convergence
25
Gene starts of Bacillus subtilis dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG
26
dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG cons. aaagtatataagggagggttaataATG num. 001000000000110110000000111 760666658967228106888659666
27
dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG cons. tacataaaggaggtttaaaaat num. 0000000111111000000001 5755779156663678679890
28
Positional information content before and after re-alignment
29
Positional nucleotide frequencies after re-alignment (aGGAGG pattern)
30
Enhancement of a weak signal
31
Deriving the signal ab initio “Discrete” (pattern-driven) approaches: word counting “Continuous” (profile-driven) approaches: optimization
32
Word counting. Short words Consider all k-mers For each k-mer compute the number of sequences containing this k-mer –(maybe with some mismatches) Select the most frequent k-mer
33
Problem: Complete search is possible only for short words Assumption: if a long word is over- represented, its subwords also are overrepresented Solution: select a set of over-represented words and combine them into longer words
34
Word counting. Long words Consider some k-mers For each k-mer compute the number of sequences containing this k-mer –(maybe with some mismatches) Select the most frequent k-mer
35
Problem: what k-tuples to start with? 1 st attempt: those actually occurring in the sample. But: the correct signal (the consensus word) may not be among them.
36
2 nd attempt: those actually occurring in the sample and some neighborhood. But: –again, the correct signal (the consensus word) may not be among them; –the size of the neighborhood grows exponentially
37
Graph approach Each k-mer in each sequence corresponds to a vertex. Two k-mers are linked by an arc, if they differ in at most h positions (h<<k). Thus we obtain an n-partite graph (n is the number of sequences). A signal corresponds to a clique (a complete subgraph) – or at least a dense subgraph – with vertices in each part.
38
A simple algorithm Remove vertices that cannot be extended to complete subgraphs –that is, do not have arcs to all parts of the graph Remove pairs that cannot be extended … –that is, do not form triangles with the third vertex in all parts of the graph Etc. (will not work “as is” for dense subgraphs)
39
Optimization. EM algorithms Generate an initial set of profiles (e.g. seed with all k-mers) For each profile –find the best (highest scoring) representative in each sequence –update the profile Iterate until convergence
40
This algorithm converges. However, it cannot leave the basin of attraction. Thus, if the initial approximation is bad, it will converge to nonsense. Solution: stochastic optimization.
41
Simulated annealing Goal: maximize the information content I I = j b f(b,j)[log f(b,j) / p(b)] or any other measure of homogeneity of the sites
42
Let A be the current signal (set of candidate sites), and let I(A) be the corresponding information content. Let B be a set of sites obtained by randomly choosing a different site in one sequence, and let I(B) be its information content. if I(B) I(A), B is accepted if I(B) < I(A), B is accepted with probability P = exp [(I(B) – I(A)) / T] The temperature T decreases exponentially, but slowly; the initial temperature is chosen such that almost all changes are accepted.
43
Gibbs sampler Again, A is a signal (set of sites), and I(A) is its information content. At each step a new site is selected in one sequence with probability P ~ exp [(I(A new )] For each candidate site the total time of occupation is computed. (Note that the signal changes all the time)
44
Use of symmetry DNA-binding factors and their signals Co-operative homogeneous Palindromes Repeats Co-operative non-homogeneous Cassetes Others RNA signals
45
Recognition: PWM/profiles The simplest technique: positional nucleotide weights are W(b,j)=ln(N(b,j)+0.5) – 0.25 i ln(N(i,j)+0.5) Score of a candidate site b 1 …b k is the sum of the corresponding positional nucleotide weights: S(b 1 …b k ) = j=1,…,k W(b j,j)
46
Distribution of RBS profile scores on sites (green) and non-sites (red)
47
Pattern recognition Linear discriminant analysis Logical rules Syntactic analysis Context-sensitive grammars Perceptron Neural networks
48
Neural networks: architecture 4 k input neurons (sensors), each responsible for observing a particular nucleotide at particular position OR 2 k neurons (one discriminates between purines and pyrimidines, the other, between AT and GC) One or more layers of hidden neurons One output neuron
49
Each neuron is connected to all neurons of the next layer Each connection is ascribed a numerical weight A neuron Sums the signals at incoming connections Compares the total with the threshold (or transforms it according to a fixed function) If the threshold is passed, excites the outcoming connections (resp. sends the modified value)
50
Training: Sites and non-sites from the training sample are presented one by one. The output neuron produces the prediction. The connection weights and thresholds are modified if the prediction is incorrect. Networks differ by architecture, particulars of the signal processing, the training schedule
51
Use of sequence context Presence of multiple co-operative sites –ArgR (E. coli), purine regulator (Pyrococcus) –XylR+CRP; CytR+CRP (E. coli) –MEF+MyoD in muscle-specific promoters (mammals) Location relative to promoters –repressors vs. activators
52
Benchmarking Difficult, because: Different algorithms are optimized for different performance parameters Incompatible training sets Difficult to construct a homogeneous and unambiguous testing set: –Unobserved sites –Competition between closely located sites –Activation in specific conditions –non-specific binding (52 out of 54 candidate HNF-1 binding sites do bind the factor)
53
Promoters of E. coli PWM at false positive rate 1 per 2000 bp: –25% of all promoters, –60% of constitutive (non-activated) promoters PWM perform as well as neural networks
54
Eukaryotic promoters
55
Ribosome binding sites Information content of the profile predicts the average reliability of predictions
56
CRP (E. coli)
57
Comparative approach to the analysis of regulation Making good predictions with bad rules
58
Regulation of transcription in prokaryotes Difficult: Small sample size Weak signals (or we do not know what features are relevant, maybe the DNA structure)
59
CRP (E. coli)
60
GenBank entry for the E. coli genome
61
Many genomes are available => comparative approach Basic assumption Regulons (sets of co-regulated genes) are conserved well …in some cases in fact, in many cases
62
Corollary: The consistency check True sutes occur upstream of orthologous genes False sites are scattered at random
63
Orthologs Orthologous genes: –diverged by specitation –retain cellular role Paralogous genes: –diverged by duplication –retain biochemical function only
64
Orthology (definition) Genomes are shown as black “pipes” 1st event: duplication 2nd event: specitation Genes of the same color are orthologous Genes of different color are paralogous duplication A1 B1 A2 B2 Genome 1 Genome 2 A1 and A2 are orthologs, B1 and B2 are orthologs, all other pairs are paralogs
65
Search for orthologs (fast and dirty)
66
The basic procedure Genome 2 Genome 1 Set of known sites Profile Genome N
67
Accounting for the operon structure
68
Checklist Presence of orthologous transcription factors Really orthologous (BETs, COGs etc. are not sufficient) * Conservation of the DNA-binding domain * Conservation of the core pathway
69
Purine regulons of E. coli and H. influenzae
70
Predicted purine transporters YgfO YicE UAPA_En UAPC_En YgfU 2635740_Bs 2635741_Bs YcdG_Ec UraA_Hi UraA_Ec 2895752_EfPyrP_Bc PyrP_Bs YjcD_Hi YjcDYgfQ YtiP_Bs 2239289_Bs YieGYicO Y326_Mj 2314333_Hp 2689889_Bb 2689890_Bb 997 746 979 PbuX_Bs 965 969 981 997 980 965 758 940 714 996 997 999 994 778 749 998 1000
71
Changes in the operon structure: more examples glnK-amtB loci of methanogenic acrhaebacteria
72
Tryptophan operons
73
Heat chock (HrcA) regulons / CIRCE elements
74
Closely related genomes: Phylogenetic footprinting Regulatory sites are more conserved than non-coding regions in general and are often seen as conserved islands in alignments of gene upstream regions.
75
High conservation
76
Low conservation
77
Degeneration of sites
78
Problems and solutions Unique members of regulons may be lost: use of additional genomes decreases the number of “orphan” regulon members. Closely related factors may have similar sites: careful analysis of function and analysis of particular sites is usually sufficient to resolve ambiguities. Too many genomes and regulons: apply preliminary automated screening.
79
Modification: ubiquitous regulators Present in many genomes Only core regulon is conserved Mode of regulation may vary Signals may be slightly different
80
Arginine repressor ArgR/AhrC
81
ABC transporters (periplasmic components)
82
Modification: horizontal transfer Impossible to resolve the orthology relationships: a homologous regulated gene is sufficient for corroboration Often rgulate large loci (several adjacent operons) Signals are mainly conserved
83
New signals Select a group of related genomes In each genome select metabolically related genes Add possibly co-transcribed genes Compare upstream regions for each genome independently Construct profiles Compare constructed profiles: if similar, then relevant
84
The purine regulon of Pyrococcus spp. Use functional annotation and COGs to select genes encoding enzymes from purine pathway: purA, purB, purC, purF, purD, purE, purL-I, purL-II, purT, guaA. Construct profiles for each genome. The quality of profiles is weak (< 1 bit/position). However, the profiles are almost identical. There is no significant similarity of upstream regions (outside sites). Thus the profiles are probably correct. Low specificity of profiles, thus >300 candidate genes in each genome. Observation: in upstream regions of all genes from the initial sample the candidate sites occur twice with 22 bp spacer. The new rule is absolutely specific: only one additional gene in each genome.
85
YgfO YicE UAPA_En UAPC_En YgfU 2635740_Bs 2635741_Bs YcdG_Ec UraA_Hi UraA_Ec 2895752_EfPyrP_Bc PyrP_Bs YjcD_Hi YjcDYgfQ YtiP_Bs 2239289_Bs YieGYicO Y326_Mj 2314333_Hp 2689889_Bb 2689890_Bb 997 746 979 PbuX_Bs 965 969 981 997 980 965 758 940 714 996 997 999 994 778 749 998 1000 PH PA A PF
86
Sources G. Stormo J. Fickett W. Miller I. Dubchak Yuh et al. (1998) Tronche et al. (1997) textbooks
87
Discussions and collaboration Farid Chetouani (Institute Pasteur) Eugene Koonin (NCBI) Yuri Kozlov (Aginomoto) Leonid Mirny (Harvard - MIT) Alexander Mironov (GosNIIGenetika) Vasily Lybetsky (Inst. Probl. Inform. Trans.) Andrey Osterman (IntegratedGenomics) Danila Perumov (Inst. Nucl. Phys.) Pavel Pevzner (UC San Diego) Michael Roytberg (Inst. Math. Probl. Biol.)
88
Collaborators Andrey A. Mironov A. B. Rakhmaninova Vadim Brodyansky Lyudmila Danilova Anna Gerasimova Alexey Kazakov Ekaterina Kotelnikova Olga Laikova Pavel Novichkov Ekaterina Panina Elya Permina Dmitry Ravcheev Dmitry Rodionov Natalya Sadovskaya Alexey Vitreschak
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.