Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001

Why? Additional annotation tool (e.g. specificity of transporters and enzymes from large families) Important for practice (in addition to metabolic reconstruction) Interesting from the evolutionary point of view

Overview 0. Biological introduction 1. Algorithms Representation of signals Deriving the signal Site recognition 2. Comparative genomics Phylogenetic footprinting Consistency filtering

Some biology Transcription (DNA  RNA) Splicing (pre-mRNA  mRNA) Translation (mRNA  protein) Regulation of transcription in prokaryotes … and eukaryotes Initiation of translation

Transcription and translation in prokaryotes

Initiation of transcription (bacteria)

Translation in prokaryotes

Translation (details)

Splicing (eukaryotes)

Regulation of transcription in prokaryotes

Structure of DNA-binding domain. Example 1

Structure of DNA-binding domain. Example 2

Protein-DNA interactions

Regulation of transcription in eukaryotes

Representation of signals Consensus Pattern (consensus with degenerate positions) Positional weight matrix (PWM, or profile) Logical rules RNA signals

Consensus codB CCCACGAAAACGATTGCTTTTT purE GCCACGCAACCGTTTTCCTTGC pyrD GTTCGGAAAACGTTTGCGTTTT purT CACACGCAAACGTTTTCGTTTA cvpA CCTACGCAAACGTTTTCTTTTT purC GATACGCAAACGTGTGCGTCTG purM GTCTCGCAAACGTTTGCTTTCC purH GTTGCGCAAACGTTTTCGTTAC purL TCTACGCAAACGGTTTCGTCGG consensus ACGCAAACGTTTTCGT

Pattern codB CCCACGAAAACGATTGCTTTTT purE GCCACGCAACCGTTTTCCTTGC pyrD GTTCGGAAAACGTTTGCGTTTT purT CACACGCAAACGTTTTCGTTTA cvpA CCTACGCAAACGTTTTCTTTTT purC GATACGCAAACGTGTGCGTCTG purM GTCTCGCAAACGTTTGCTTTCC purH GTTGCGCAAACGTTTTCGTTAC purL TCTACGCAAACGGTTTCGTCGG consensus ACGCAAACGTTTTCGT pattern aCGmAAACGtTTkCkT

Frequency matrix I =  j  b f(b,j)[log f(b,j) / p(b)] Information content

Sequence logo

Positional weight matrix (PWM)

Probabilistic motivation: log-likelihood (up to a linear transformation) More probabilistic motivation: z-score (with the suitable base of the logarithm) Thermodynamical motivation: free energy (assuming independence of positions, up to a linear transformation) Pseudocounts

Logical rules, trees etc.

Compilation of samples Initial sample: –GenBank –specialized databases –literature (reviews) –literature (original papers) Correction of GenBank errors Checking the literature removal of predicted sites Removal of duplicates

Re-alignment approaches Initial alignment by a biological landmark –start of transcription for promoters –start codon for ribosome binding sites –exon-intron boundary for splicing sites Deriving the signal within a sliding window Re-alignment etc. etc. until convergence

Gene starts of Bacillus subtilis dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG

dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG cons. aaagtatataagggagggttaataATG num. 001000000000110110000000111 760666658967228106888659666

dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG cons. tacataaaggaggtttaaaaat num. 0000000111111000000001 5755779156663678679890

Positional information content before and after re-alignment

Positional nucleotide frequencies after re-alignment (aGGAGG pattern)

Enhancement of a weak signal

Deriving the signal ab initio “Discrete” (pattern-driven) approaches: word counting “Continuous” (profile-driven) approaches: optimization

Word counting. Short words Consider all k-mers For each k-mer compute the number of sequences containing this k-mer –(maybe with some mismatches) Select the most frequent k-mer

Problem: Complete search is possible only for short words Assumption: if a long word is overrepresented, its subwords also are overrepresented Solution: select a set of over-represented words and combine them into longer words

Word counting. Long words Consider some k-mers For each k-mer compute the number of sequences containing this k-mer –(maybe with some mismatches) Select the most frequent k-mer

Problem: what k-tuples to start with? 1 st attempt: those actually occurring in the sample. But: the correct signal (the consensus word) may not be among them.

2 nd attempt: those actually occurring in the sample and some neighborhood. But: –again, the correct signal (the consensus word) may not be among them; –the size of the neighborhood grows exponentially

Graph approach Each k-mer in each sequence corresponds to a vertex. Two k-mers are linked by an arc, if they differ in at most h positions (h<<k). Thus we obtain an n-partite graph (n is the number of sequences). A signal corresponds to a clique (a complete subgraph) – or at least a dense subgraph – with vertices in each part.

A simple algorithm Remove vertices that cannot be extended to complete subgraphs –that is, do not have arcs to all parts of the graph Remove pairs that cannot be extended … –that is, do not form triangles with the third vertex in all parts of the graph Etc. (will not work “as is” for dense subgraphs)

Optimization. EM algorithms Generate an initial set of profiles (e.g. seed with all k-mers) For each profile –find the best (highest scoring) representative in each sequence –update the profile Iterate until convergence

This algorithm converges. However, it cannot leave the basin of attraction. Thus, if the initial approximation is bad, it will converge to nonsense. Solution: stochastic optimization.

Simulated annealing Goal: maximize the information content I I =  j  b f(b,j)[log f(b,j) / p(b)] or any other measure of homogeneity of the sites

Let A be the current signal (set of candidate sites), and let I(A) be the corresponding information content. Let B be a set of sites obtained by randomly choosing a different site in one sequence, and let I(B) be its information content. if I(B)  I(A), B is accepted if I(B) < I(A), B is accepted with probability P = exp [(I(B) – I(A)) / T] The temperature T decreases exponentially, but slowly; the initial temperature is chosen such that almost all changes are accepted.

Gibbs sampler Again, A is a signal (set of sites), and I(A) is its information content. At each step a new site is selected in one sequence with probability P ~ exp [(I(A new )] For each candidate site the total time of occupation is computed. (Note that the signal changes all the time)

Use of symmetry DNA-binding factors and their signals  Co-operative homogeneous  Palindromes  Repeats  Co-operative non-homogeneous  Cassetes  Others  RNA signals

Recognition: PWM/profiles The simplest technique: positional nucleotide weights are W(b,j)=ln(N(b,j)+0.5) – 0.25  i ln(N(i,j)+0.5) Score of a candidate site b 1 …b k is the sum of the corresponding positional nucleotide weights: S(b 1 …b k ) =  j=1,…,k W(b j,j)

Distribution of RBS profile scores on sites (green) and non-sites (red)

Pattern recognition Linear discriminant analysis Logical rules Syntactic analysis Context-sensitive grammars Perceptron Neural networks

Neural networks: architecture 4  k input neurons (sensors), each responsible for observing a particular nucleotide at particular position OR 2  k neurons (one discriminates between purines and pyrimidines, the other, between AT and GC) One or more layers of hidden neurons One output neuron

Each neuron is connected to all neurons of the next layer Each connection is ascribed a numerical weight A neuron Sums the signals at incoming connections Compares the total with the threshold (or transforms it according to a fixed function) If the threshold is passed, excites the outcoming connections (resp. sends the modified value)

Training: Sites and non-sites from the training sample are presented one by one. The output neuron produces the prediction. The connection weights and thresholds are modified if the prediction is incorrect. Networks differ by architecture, particulars of the signal processing, the training schedule

Use of sequence context Presence of multiple co-operative sites –ArgR (E. coli), purine regulator (Pyrococcus) –XylR+CRP; CytR+CRP (E. coli) –MEF+MyoD in muscle-specific promoters (mammals) Location relative to promoters –repressors vs. activators

Benchmarking Difficult, because: Different algorithms are optimized for different performance parameters Incompatible training sets Difficult to construct a homogeneous and unambiguous testing set: –Unobserved sites –Competition between closely located sites –Activation in specific conditions –non-specific binding (52 out of 54 candidate HNF-1 binding sites do bind the factor)

Promoters of E. coli PWM at false positive rate 1 per 2000 bp: –25% of all promoters, –60% of constitutive (non-activated) promoters PWM perform as well as neural networks

Eukaryotic promoters

Ribosome binding sites Information content of the profile predicts the average reliability of predictions

CRP (E. coli)

Comparative approach to the analysis of regulation Making good predictions with bad rules

Regulation of transcription in prokaryotes Difficult: Small sample size Weak signals (or we do not know what features are relevant, maybe the DNA structure)

CRP (E. coli)

GenBank entry for the E. coli genome

Many genomes are available =>  comparative approach Basic assumption Regulons (sets of co-regulated genes) are conserved well …in some cases in fact, in many cases

Corollary: The consistency check True sutes occur upstream of orthologous genes False sites are scattered at random

Orthologs Orthologous genes: –diverged by specitation –retain cellular role Paralogous genes: –diverged by duplication –retain biochemical function only

Orthology (definition) Genomes are shown as black “pipes” 1st event: duplication 2nd event: specitation Genes of the same color are orthologous Genes of different color are paralogous duplication A1 B1 A2 B2 Genome 1 Genome 2 A1 and A2 are orthologs, B1 and B2 are orthologs, all other pairs are paralogs

Search for orthologs (fast and dirty)

The basic procedure Genome 2 Genome 1 Set of known sites Profile Genome N

Accounting for the operon structure

Checklist Presence of orthologous transcription factors Really orthologous (BETs, COGs etc. are not sufficient) * Conservation of the DNA-binding domain * Conservation of the core pathway

Purine regulons of E. coli and H. influenzae

Predicted purine transporters YgfO YicE UAPA_En UAPC_En YgfU 2635740_Bs 2635741_Bs YcdG_Ec UraA_Hi UraA_Ec 2895752_EfPyrP_Bc PyrP_Bs YjcD_Hi YjcDYgfQ YtiP_Bs 2239289_Bs YieGYicO Y326_Mj 2314333_Hp 2689889_Bb 2689890_Bb 997 746 979 PbuX_Bs 965 969 981 997 980 965 758 940 714 996 997 999 994 778 749 998 1000

Changes in the operon structure: more examples glnK-amtB loci of methanogenic acrhaebacteria

Tryptophan operons

Heat chock (HrcA) regulons / CIRCE elements

Closely related genomes: Phylogenetic footprinting Regulatory sites are more conserved than non-coding regions in general and are often seen as conserved islands in alignments of gene upstream regions.

High conservation

Low conservation

Degeneration of sites

Problems and solutions  Unique members of regulons may be lost: use of additional genomes decreases the number of “orphan” regulon members.  Closely related factors may have similar sites: careful analysis of function and analysis of particular sites is usually sufficient to resolve ambiguities.  Too many genomes and regulons: apply preliminary automated screening.

Modification: ubiquitous regulators Present in many genomes Only core regulon is conserved Mode of regulation may vary Signals may be slightly different

Arginine repressor ArgR/AhrC

ABC transporters (periplasmic components)

Modification: horizontal transfer Impossible to resolve the orthology relationships: a homologous regulated gene is sufficient for corroboration Often rgulate large loci (several adjacent operons) Signals are mainly conserved

New signals Select a group of related genomes In each genome select metabolically related genes Add possibly co-transcribed genes Compare upstream regions for each genome independently Construct profiles Compare constructed profiles: if similar, then relevant

The purine regulon of Pyrococcus spp. Use functional annotation and COGs to select genes encoding enzymes from purine pathway: purA, purB, purC, purF, purD, purE, purL-I, purL-II, purT, guaA. Construct profiles for each genome. The quality of profiles is weak (< 1 bit/position). However, the profiles are almost identical. There is no significant similarity of upstream regions (outside sites). Thus the profiles are probably correct. Low specificity of profiles, thus >300 candidate genes in each genome. Observation: in upstream regions of all genes from the initial sample the candidate sites occur twice with 22 bp spacer. The new rule is absolutely specific: only one additional gene in each genome.

YgfO YicE UAPA_En UAPC_En YgfU 2635740_Bs 2635741_Bs YcdG_Ec UraA_Hi UraA_Ec 2895752_EfPyrP_Bc PyrP_Bs YjcD_Hi YjcDYgfQ YtiP_Bs 2239289_Bs YieGYicO Y326_Mj 2314333_Hp 2689889_Bb 2689890_Bb 997 746 979 PbuX_Bs 965 969 981 997 980 965 758 940 714 996 997 999 994 778 749 998 1000 PH PA A PF

Sources G. Stormo J. Fickett W. Miller I. Dubchak Yuh et al. (1998) Tronche et al. (1997) textbooks

Discussions and collaboration Farid Chetouani (Institute Pasteur) Eugene Koonin (NCBI) Yuri Kozlov (Aginomoto) Leonid Mirny (Harvard - MIT) Alexander Mironov (GosNIIGenetika) Vasily Lybetsky (Inst. Probl. Inform. Trans.) Andrey Osterman (IntegratedGenomics) Danila Perumov (Inst. Nucl. Phys.) Pavel Pevzner (UC San Diego) Michael Roytberg (Inst. Math. Probl. Biol.)

Collaborators Andrey A. Mironov A. B. Rakhmaninova Vadim Brodyansky Lyudmila Danilova Anna Gerasimova Alexey Kazakov Ekaterina Kotelnikova Olga Laikova Pavel Novichkov Ekaterina Panina Elya Permina Dmitry Ravcheev Dmitry Rodionov Natalya Sadovskaya Alexey Vitreschak

Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Similar presentations

Presentation on theme: "Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Similar presentations

Presentation on theme: "Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001."— Presentation transcript:

Similar presentations

About project

Feedback