Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001
Why? Additional annotation tool (e.g. specificity of transporters and enzymes from large families) Important for practice (in addition to metabolic reconstruction) Interesting from the evolutionary point of view
Overview 0. Biological introduction 1. Algorithms Representation of signals Deriving the signal Site recognition 2. Comparative genomics Phylogenetic footprinting Consistency filtering
Some biology Transcription (DNA RNA) Splicing (pre-mRNA mRNA) Translation (mRNA protein) Regulation of transcription in prokaryotes … and eukaryotes Initiation of translation
Transcription and translation in prokaryotes
Initiation of transcription (bacteria)
Translation in prokaryotes
Translation (details)
Splicing (eukaryotes)
Regulation of transcription in prokaryotes
Structure of DNA-binding domain. Example 1
Structure of DNA-binding domain. Example 2
Protein-DNA interactions
Regulation of transcription in eukaryotes
Representation of signals Consensus Pattern (consensus with degenerate positions) Positional weight matrix (PWM, or profile) Logical rules RNA signals
Consensus codB CCCACGAAAACGATTGCTTTTT purE GCCACGCAACCGTTTTCCTTGC pyrD GTTCGGAAAACGTTTGCGTTTT purT CACACGCAAACGTTTTCGTTTA cvpA CCTACGCAAACGTTTTCTTTTT purC GATACGCAAACGTGTGCGTCTG purM GTCTCGCAAACGTTTGCTTTCC purH GTTGCGCAAACGTTTTCGTTAC purL TCTACGCAAACGGTTTCGTCGG consensus ACGCAAACGTTTTCGT
Pattern codB CCCACGAAAACGATTGCTTTTT purE GCCACGCAACCGTTTTCCTTGC pyrD GTTCGGAAAACGTTTGCGTTTT purT CACACGCAAACGTTTTCGTTTA cvpA CCTACGCAAACGTTTTCTTTTT purC GATACGCAAACGTGTGCGTCTG purM GTCTCGCAAACGTTTGCTTTCC purH GTTGCGCAAACGTTTTCGTTAC purL TCTACGCAAACGGTTTCGTCGG consensus ACGCAAACGTTTTCGT pattern aCGmAAACGtTTkCkT
Frequency matrix I = j b f(b,j)[log f(b,j) / p(b)] Information content
Sequence logo
Positional weight matrix (PWM)
Probabilistic motivation: log-likelihood (up to a linear transformation) More probabilistic motivation: z-score (with the suitable base of the logarithm) Thermodynamical motivation: free energy (assuming independence of positions, up to a linear transformation) Pseudocounts
Logical rules, trees etc.
Compilation of samples Initial sample: –GenBank –specialized databases –literature (reviews) –literature (original papers) Correction of GenBank errors Checking the literature removal of predicted sites Removal of duplicates
Re-alignment approaches Initial alignment by a biological landmark –start of transcription for promoters –start codon for ribosome binding sites –exon-intron boundary for splicing sites Deriving the signal within a sliding window Re-alignment etc. etc. until convergence
Gene starts of Bacillus subtilis dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG
dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG cons. aaagtatataagggagggttaataATG num
dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG cons. tacataaaggaggtttaaaaat num
Positional information content before and after re-alignment
Positional nucleotide frequencies after re-alignment (aGGAGG pattern)
Enhancement of a weak signal
Deriving the signal ab initio “Discrete” (pattern-driven) approaches: word counting “Continuous” (profile-driven) approaches: optimization
Word counting. Short words Consider all k-mers For each k-mer compute the number of sequences containing this k-mer –(maybe with some mismatches) Select the most frequent k-mer
Problem: Complete search is possible only for short words Assumption: if a long word is over- represented, its subwords also are overrepresented Solution: select a set of over-represented words and combine them into longer words
Word counting. Long words Consider some k-mers For each k-mer compute the number of sequences containing this k-mer –(maybe with some mismatches) Select the most frequent k-mer
Problem: what k-tuples to start with? 1 st attempt: those actually occurring in the sample. But: the correct signal (the consensus word) may not be among them.
2 nd attempt: those actually occurring in the sample and some neighborhood. But: –again, the correct signal (the consensus word) may not be among them; –the size of the neighborhood grows exponentially
Graph approach Each k-mer in each sequence corresponds to a vertex. Two k-mers are linked by an arc, if they differ in at most h positions (h<<k). Thus we obtain an n-partite graph (n is the number of sequences). A signal corresponds to a clique (a complete subgraph) – or at least a dense subgraph – with vertices in each part.
A simple algorithm Remove vertices that cannot be extended to complete subgraphs –that is, do not have arcs to all parts of the graph Remove pairs that cannot be extended … –that is, do not form triangles with the third vertex in all parts of the graph Etc. (will not work “as is” for dense subgraphs)
Optimization. EM algorithms Generate an initial set of profiles (e.g. seed with all k-mers) For each profile –find the best (highest scoring) representative in each sequence –update the profile Iterate until convergence
This algorithm converges. However, it cannot leave the basin of attraction. Thus, if the initial approximation is bad, it will converge to nonsense. Solution: stochastic optimization.
Simulated annealing Goal: maximize the information content I I = j b f(b,j)[log f(b,j) / p(b)] or any other measure of homogeneity of the sites
Let A be the current signal (set of candidate sites), and let I(A) be the corresponding information content. Let B be a set of sites obtained by randomly choosing a different site in one sequence, and let I(B) be its information content. if I(B) I(A), B is accepted if I(B) < I(A), B is accepted with probability P = exp [(I(B) – I(A)) / T] The temperature T decreases exponentially, but slowly; the initial temperature is chosen such that almost all changes are accepted.
Gibbs sampler Again, A is a signal (set of sites), and I(A) is its information content. At each step a new site is selected in one sequence with probability P ~ exp [(I(A new )] For each candidate site the total time of occupation is computed. (Note that the signal changes all the time)
Use of symmetry DNA-binding factors and their signals Co-operative homogeneous Palindromes Repeats Co-operative non-homogeneous Cassetes Others RNA signals
Recognition: PWM/profiles The simplest technique: positional nucleotide weights are W(b,j)=ln(N(b,j)+0.5) – 0.25 i ln(N(i,j)+0.5) Score of a candidate site b 1 …b k is the sum of the corresponding positional nucleotide weights: S(b 1 …b k ) = j=1,…,k W(b j,j)
Distribution of RBS profile scores on sites (green) and non-sites (red)
Pattern recognition Linear discriminant analysis Logical rules Syntactic analysis Context-sensitive grammars Perceptron Neural networks
Neural networks: architecture 4 k input neurons (sensors), each responsible for observing a particular nucleotide at particular position OR 2 k neurons (one discriminates between purines and pyrimidines, the other, between AT and GC) One or more layers of hidden neurons One output neuron
Each neuron is connected to all neurons of the next layer Each connection is ascribed a numerical weight A neuron Sums the signals at incoming connections Compares the total with the threshold (or transforms it according to a fixed function) If the threshold is passed, excites the outcoming connections (resp. sends the modified value)
Training: Sites and non-sites from the training sample are presented one by one. The output neuron produces the prediction. The connection weights and thresholds are modified if the prediction is incorrect. Networks differ by architecture, particulars of the signal processing, the training schedule
Use of sequence context Presence of multiple co-operative sites –ArgR (E. coli), purine regulator (Pyrococcus) –XylR+CRP; CytR+CRP (E. coli) –MEF+MyoD in muscle-specific promoters (mammals) Location relative to promoters –repressors vs. activators
Benchmarking Difficult, because: Different algorithms are optimized for different performance parameters Incompatible training sets Difficult to construct a homogeneous and unambiguous testing set: –Unobserved sites –Competition between closely located sites –Activation in specific conditions –non-specific binding (52 out of 54 candidate HNF-1 binding sites do bind the factor)
Promoters of E. coli PWM at false positive rate 1 per 2000 bp: –25% of all promoters, –60% of constitutive (non-activated) promoters PWM perform as well as neural networks
Eukaryotic promoters
Ribosome binding sites Information content of the profile predicts the average reliability of predictions
CRP (E. coli)
Comparative approach to the analysis of regulation Making good predictions with bad rules
Regulation of transcription in prokaryotes Difficult: Small sample size Weak signals (or we do not know what features are relevant, maybe the DNA structure)
CRP (E. coli)
GenBank entry for the E. coli genome
Many genomes are available => comparative approach Basic assumption Regulons (sets of co-regulated genes) are conserved well …in some cases in fact, in many cases
Corollary: The consistency check True sutes occur upstream of orthologous genes False sites are scattered at random
Orthologs Orthologous genes: –diverged by specitation –retain cellular role Paralogous genes: –diverged by duplication –retain biochemical function only
Orthology (definition) Genomes are shown as black “pipes” 1st event: duplication 2nd event: specitation Genes of the same color are orthologous Genes of different color are paralogous duplication A1 B1 A2 B2 Genome 1 Genome 2 A1 and A2 are orthologs, B1 and B2 are orthologs, all other pairs are paralogs
Search for orthologs (fast and dirty)
The basic procedure Genome 2 Genome 1 Set of known sites Profile Genome N
Accounting for the operon structure
Checklist Presence of orthologous transcription factors Really orthologous (BETs, COGs etc. are not sufficient) * Conservation of the DNA-binding domain * Conservation of the core pathway
Purine regulons of E. coli and H. influenzae
Predicted purine transporters YgfO YicE UAPA_En UAPC_En YgfU _Bs _Bs YcdG_Ec UraA_Hi UraA_Ec _EfPyrP_Bc PyrP_Bs YjcD_Hi YjcDYgfQ YtiP_Bs _Bs YieGYicO Y326_Mj _Hp _Bb _Bb PbuX_Bs
Changes in the operon structure: more examples glnK-amtB loci of methanogenic acrhaebacteria
Tryptophan operons
Heat chock (HrcA) regulons / CIRCE elements
Closely related genomes: Phylogenetic footprinting Regulatory sites are more conserved than non-coding regions in general and are often seen as conserved islands in alignments of gene upstream regions.
High conservation
Low conservation
Degeneration of sites
Problems and solutions Unique members of regulons may be lost: use of additional genomes decreases the number of “orphan” regulon members. Closely related factors may have similar sites: careful analysis of function and analysis of particular sites is usually sufficient to resolve ambiguities. Too many genomes and regulons: apply preliminary automated screening.
Modification: ubiquitous regulators Present in many genomes Only core regulon is conserved Mode of regulation may vary Signals may be slightly different
Arginine repressor ArgR/AhrC
ABC transporters (periplasmic components)
Modification: horizontal transfer Impossible to resolve the orthology relationships: a homologous regulated gene is sufficient for corroboration Often rgulate large loci (several adjacent operons) Signals are mainly conserved
New signals Select a group of related genomes In each genome select metabolically related genes Add possibly co-transcribed genes Compare upstream regions for each genome independently Construct profiles Compare constructed profiles: if similar, then relevant
The purine regulon of Pyrococcus spp. Use functional annotation and COGs to select genes encoding enzymes from purine pathway: purA, purB, purC, purF, purD, purE, purL-I, purL-II, purT, guaA. Construct profiles for each genome. The quality of profiles is weak (< 1 bit/position). However, the profiles are almost identical. There is no significant similarity of upstream regions (outside sites). Thus the profiles are probably correct. Low specificity of profiles, thus >300 candidate genes in each genome. Observation: in upstream regions of all genes from the initial sample the candidate sites occur twice with 22 bp spacer. The new rule is absolutely specific: only one additional gene in each genome.
YgfO YicE UAPA_En UAPC_En YgfU _Bs _Bs YcdG_Ec UraA_Hi UraA_Ec _EfPyrP_Bc PyrP_Bs YjcD_Hi YjcDYgfQ YtiP_Bs _Bs YieGYicO Y326_Mj _Hp _Bb _Bb PbuX_Bs PH PA A PF
Sources G. Stormo J. Fickett W. Miller I. Dubchak Yuh et al. (1998) Tronche et al. (1997) textbooks
Discussions and collaboration Farid Chetouani (Institute Pasteur) Eugene Koonin (NCBI) Yuri Kozlov (Aginomoto) Leonid Mirny (Harvard - MIT) Alexander Mironov (GosNIIGenetika) Vasily Lybetsky (Inst. Probl. Inform. Trans.) Andrey Osterman (IntegratedGenomics) Danila Perumov (Inst. Nucl. Phys.) Pavel Pevzner (UC San Diego) Michael Roytberg (Inst. Math. Probl. Biol.)
Collaborators Andrey A. Mironov A. B. Rakhmaninova Vadim Brodyansky Lyudmila Danilova Anna Gerasimova Alexey Kazakov Ekaterina Kotelnikova Olga Laikova Pavel Novichkov Ekaterina Panina Elya Permina Dmitry Ravcheev Dmitry Rodionov Natalya Sadovskaya Alexey Vitreschak