Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Slides:



Advertisements
Similar presentations
Slides from: Doug Gray, David Poole
Advertisements

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Simple Neural Nets For Pattern Classification
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Genomics Irena Artamonova Second European School of Bioinformatics Nijmegen, January 22, 2005.
Transcription factor binding motifs (part I) 10/17/07.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Graph, Search Algorithms Ka-Lok Ng Department of Bioinformatics Asia University.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Ab initio motif finding
Lecture 12 Splicing and gene prediction in eukaryotes
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Biological Motivation Gene Finding in Eukaryotic Genomes
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Optimization of thermal processes2007/2008 Optimization of thermal processes Maciej Marek Czestochowa University of Technology Institute of Thermal Machinery.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Sequence Alignment
Automatic methods for functional annotation of sequences Petri Törönen.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
Sequence analysis – an overview A.Krishnamachari
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Reconstruction of Transcriptional Regulatory Networks
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Protein and RNA Families
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
From Genomes to Genes Rui Alves.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Sequence Alignment.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Motif Search and RNA Structure Prediction Lesson 9.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
Transcription factor binding motifs (part II) 10/22/07.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Regulation of Gene Expression
bacteria and eukaryotes
Sequence similarity, BLAST alignments & multiple sequence alignments
Basics of Comparative Genomics
Learning Sequence Motif Models Using Expectation Maximization (EM)
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
SEG5010 Presentation Zhou Lanjun.
Basics of Comparative Genomics
Nora Pierstorff Dept. of Genetics University of Cologne
Presentation transcript:

Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001

Why? Additional annotation tool (e.g. specificity of transporters and enzymes from large families) Important for practice (in addition to metabolic reconstruction) Interesting from the evolutionary point of view

Overview 0. Biological introduction 1. Algorithms Representation of signals Deriving the signal Site recognition 2. Comparative genomics Phylogenetic footprinting Consistency filtering

Some biology Transcription (DNA  RNA) Splicing (pre-mRNA  mRNA) Translation (mRNA  protein) Regulation of transcription in prokaryotes … and eukaryotes Initiation of translation

Transcription and translation in prokaryotes

Initiation of transcription (bacteria)

Translation in prokaryotes

Translation (details)

Splicing (eukaryotes)

Regulation of transcription in prokaryotes

Structure of DNA-binding domain. Example 1

Structure of DNA-binding domain. Example 2

Protein-DNA interactions

Regulation of transcription in eukaryotes

Representation of signals Consensus Pattern (consensus with degenerate positions) Positional weight matrix (PWM, or profile) Logical rules RNA signals

Consensus codB CCCACGAAAACGATTGCTTTTT purE GCCACGCAACCGTTTTCCTTGC pyrD GTTCGGAAAACGTTTGCGTTTT purT CACACGCAAACGTTTTCGTTTA cvpA CCTACGCAAACGTTTTCTTTTT purC GATACGCAAACGTGTGCGTCTG purM GTCTCGCAAACGTTTGCTTTCC purH GTTGCGCAAACGTTTTCGTTAC purL TCTACGCAAACGGTTTCGTCGG consensus ACGCAAACGTTTTCGT

Pattern codB CCCACGAAAACGATTGCTTTTT purE GCCACGCAACCGTTTTCCTTGC pyrD GTTCGGAAAACGTTTGCGTTTT purT CACACGCAAACGTTTTCGTTTA cvpA CCTACGCAAACGTTTTCTTTTT purC GATACGCAAACGTGTGCGTCTG purM GTCTCGCAAACGTTTGCTTTCC purH GTTGCGCAAACGTTTTCGTTAC purL TCTACGCAAACGGTTTCGTCGG consensus ACGCAAACGTTTTCGT pattern aCGmAAACGtTTkCkT

Frequency matrix I =  j  b f(b,j)[log f(b,j) / p(b)] Information content

Sequence logo

Positional weight matrix (PWM)

Probabilistic motivation: log-likelihood (up to a linear transformation) More probabilistic motivation: z-score (with the suitable base of the logarithm) Thermodynamical motivation: free energy (assuming independence of positions, up to a linear transformation) Pseudocounts

Logical rules, trees etc.

Compilation of samples Initial sample: –GenBank –specialized databases –literature (reviews) –literature (original papers) Correction of GenBank errors Checking the literature removal of predicted sites Removal of duplicates

Re-alignment approaches Initial alignment by a biological landmark –start of transcription for promoters –start codon for ribosome binding sites –exon-intron boundary for splicing sites Deriving the signal within a sliding window Re-alignment etc. etc. until convergence

Gene starts of Bacillus subtilis dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG

dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG cons. aaagtatataagggagggttaataATG num

dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG cons. tacataaaggaggtttaaaaat num

Positional information content before and after re-alignment

Positional nucleotide frequencies after re-alignment (aGGAGG pattern)

Enhancement of a weak signal

Deriving the signal ab initio “Discrete” (pattern-driven) approaches: word counting “Continuous” (profile-driven) approaches: optimization

Word counting. Short words Consider all k-mers For each k-mer compute the number of sequences containing this k-mer –(maybe with some mismatches) Select the most frequent k-mer

Problem: Complete search is possible only for short words Assumption: if a long word is over- represented, its subwords also are overrepresented Solution: select a set of over-represented words and combine them into longer words

Word counting. Long words Consider some k-mers For each k-mer compute the number of sequences containing this k-mer –(maybe with some mismatches) Select the most frequent k-mer

Problem: what k-tuples to start with? 1 st attempt: those actually occurring in the sample. But: the correct signal (the consensus word) may not be among them.

2 nd attempt: those actually occurring in the sample and some neighborhood. But: –again, the correct signal (the consensus word) may not be among them; –the size of the neighborhood grows exponentially

Graph approach Each k-mer in each sequence corresponds to a vertex. Two k-mers are linked by an arc, if they differ in at most h positions (h<<k). Thus we obtain an n-partite graph (n is the number of sequences). A signal corresponds to a clique (a complete subgraph) – or at least a dense subgraph – with vertices in each part.

A simple algorithm Remove vertices that cannot be extended to complete subgraphs –that is, do not have arcs to all parts of the graph Remove pairs that cannot be extended … –that is, do not form triangles with the third vertex in all parts of the graph Etc. (will not work “as is” for dense subgraphs)

Optimization. EM algorithms Generate an initial set of profiles (e.g. seed with all k-mers) For each profile –find the best (highest scoring) representative in each sequence –update the profile Iterate until convergence

This algorithm converges. However, it cannot leave the basin of attraction. Thus, if the initial approximation is bad, it will converge to nonsense. Solution: stochastic optimization.

Simulated annealing Goal: maximize the information content I I =  j  b f(b,j)[log f(b,j) / p(b)] or any other measure of homogeneity of the sites

Let A be the current signal (set of candidate sites), and let I(A) be the corresponding information content. Let B be a set of sites obtained by randomly choosing a different site in one sequence, and let I(B) be its information content. if I(B)  I(A), B is accepted if I(B) < I(A), B is accepted with probability P = exp [(I(B) – I(A)) / T] The temperature T decreases exponentially, but slowly; the initial temperature is chosen such that almost all changes are accepted.

Gibbs sampler Again, A is a signal (set of sites), and I(A) is its information content. At each step a new site is selected in one sequence with probability P ~ exp [(I(A new )] For each candidate site the total time of occupation is computed. (Note that the signal changes all the time)

Use of symmetry DNA-binding factors and their signals  Co-operative homogeneous  Palindromes  Repeats  Co-operative non-homogeneous  Cassetes  Others  RNA signals

Recognition: PWM/profiles The simplest technique: positional nucleotide weights are W(b,j)=ln(N(b,j)+0.5) – 0.25  i ln(N(i,j)+0.5) Score of a candidate site b 1 …b k is the sum of the corresponding positional nucleotide weights: S(b 1 …b k ) =  j=1,…,k W(b j,j)

Distribution of RBS profile scores on sites (green) and non-sites (red)

Pattern recognition Linear discriminant analysis Logical rules Syntactic analysis Context-sensitive grammars Perceptron Neural networks

Neural networks: architecture 4  k input neurons (sensors), each responsible for observing a particular nucleotide at particular position OR 2  k neurons (one discriminates between purines and pyrimidines, the other, between AT and GC) One or more layers of hidden neurons One output neuron

Each neuron is connected to all neurons of the next layer Each connection is ascribed a numerical weight A neuron Sums the signals at incoming connections Compares the total with the threshold (or transforms it according to a fixed function) If the threshold is passed, excites the outcoming connections (resp. sends the modified value)

Training: Sites and non-sites from the training sample are presented one by one. The output neuron produces the prediction. The connection weights and thresholds are modified if the prediction is incorrect. Networks differ by architecture, particulars of the signal processing, the training schedule

Use of sequence context Presence of multiple co-operative sites –ArgR (E. coli), purine regulator (Pyrococcus) –XylR+CRP; CytR+CRP (E. coli) –MEF+MyoD in muscle-specific promoters (mammals) Location relative to promoters –repressors vs. activators

Benchmarking Difficult, because: Different algorithms are optimized for different performance parameters Incompatible training sets Difficult to construct a homogeneous and unambiguous testing set: –Unobserved sites –Competition between closely located sites –Activation in specific conditions –non-specific binding (52 out of 54 candidate HNF-1 binding sites do bind the factor)

Promoters of E. coli PWM at false positive rate 1 per 2000 bp: –25% of all promoters, –60% of constitutive (non-activated) promoters PWM perform as well as neural networks

Eukaryotic promoters

Ribosome binding sites Information content of the profile predicts the average reliability of predictions

CRP (E. coli)

Comparative approach to the analysis of regulation Making good predictions with bad rules

Regulation of transcription in prokaryotes Difficult: Small sample size Weak signals (or we do not know what features are relevant, maybe the DNA structure)

CRP (E. coli)

GenBank entry for the E. coli genome

Many genomes are available =>  comparative approach Basic assumption Regulons (sets of co-regulated genes) are conserved well …in some cases in fact, in many cases

Corollary: The consistency check True sutes occur upstream of orthologous genes False sites are scattered at random

Orthologs Orthologous genes: –diverged by specitation –retain cellular role Paralogous genes: –diverged by duplication –retain biochemical function only

Orthology (definition) Genomes are shown as black “pipes” 1st event: duplication 2nd event: specitation Genes of the same color are orthologous Genes of different color are paralogous duplication A1 B1 A2 B2 Genome 1 Genome 2 A1 and A2 are orthologs, B1 and B2 are orthologs, all other pairs are paralogs

Search for orthologs (fast and dirty)

The basic procedure Genome 2 Genome 1 Set of known sites Profile Genome N

Accounting for the operon structure

Checklist Presence of orthologous transcription factors Really orthologous (BETs, COGs etc. are not sufficient) * Conservation of the DNA-binding domain * Conservation of the core pathway

Purine regulons of E. coli and H. influenzae

Predicted purine transporters YgfO YicE UAPA_En UAPC_En YgfU _Bs _Bs YcdG_Ec UraA_Hi UraA_Ec _EfPyrP_Bc PyrP_Bs YjcD_Hi YjcDYgfQ YtiP_Bs _Bs YieGYicO Y326_Mj _Hp _Bb _Bb PbuX_Bs

Changes in the operon structure: more examples glnK-amtB loci of methanogenic acrhaebacteria

Tryptophan operons

Heat chock (HrcA) regulons / CIRCE elements

Closely related genomes: Phylogenetic footprinting Regulatory sites are more conserved than non-coding regions in general and are often seen as conserved islands in alignments of gene upstream regions.

High conservation

Low conservation

Degeneration of sites

Problems and solutions  Unique members of regulons may be lost: use of additional genomes decreases the number of “orphan” regulon members.  Closely related factors may have similar sites: careful analysis of function and analysis of particular sites is usually sufficient to resolve ambiguities.  Too many genomes and regulons: apply preliminary automated screening.

Modification: ubiquitous regulators Present in many genomes Only core regulon is conserved Mode of regulation may vary Signals may be slightly different

Arginine repressor ArgR/AhrC

ABC transporters (periplasmic components)

Modification: horizontal transfer Impossible to resolve the orthology relationships: a homologous regulated gene is sufficient for corroboration Often rgulate large loci (several adjacent operons) Signals are mainly conserved

New signals Select a group of related genomes In each genome select metabolically related genes Add possibly co-transcribed genes Compare upstream regions for each genome independently Construct profiles Compare constructed profiles: if similar, then relevant

The purine regulon of Pyrococcus spp. Use functional annotation and COGs to select genes encoding enzymes from purine pathway: purA, purB, purC, purF, purD, purE, purL-I, purL-II, purT, guaA. Construct profiles for each genome. The quality of profiles is weak (< 1 bit/position). However, the profiles are almost identical. There is no significant similarity of upstream regions (outside sites). Thus the profiles are probably correct. Low specificity of profiles, thus >300 candidate genes in each genome. Observation: in upstream regions of all genes from the initial sample the candidate sites occur twice with 22 bp spacer. The new rule is absolutely specific: only one additional gene in each genome.

YgfO YicE UAPA_En UAPC_En YgfU _Bs _Bs YcdG_Ec UraA_Hi UraA_Ec _EfPyrP_Bc PyrP_Bs YjcD_Hi YjcDYgfQ YtiP_Bs _Bs YieGYicO Y326_Mj _Hp _Bb _Bb PbuX_Bs PH PA A PF

Sources G. Stormo J. Fickett W. Miller I. Dubchak Yuh et al. (1998) Tronche et al. (1997) textbooks

Discussions and collaboration Farid Chetouani (Institute Pasteur) Eugene Koonin (NCBI) Yuri Kozlov (Aginomoto) Leonid Mirny (Harvard - MIT) Alexander Mironov (GosNIIGenetika) Vasily Lybetsky (Inst. Probl. Inform. Trans.) Andrey Osterman (IntegratedGenomics) Danila Perumov (Inst. Nucl. Phys.) Pavel Pevzner (UC San Diego) Michael Roytberg (Inst. Math. Probl. Biol.)

Collaborators Andrey A. Mironov A. B. Rakhmaninova Vadim Brodyansky Lyudmila Danilova Anna Gerasimova Alexey Kazakov Ekaterina Kotelnikova Olga Laikova Pavel Novichkov Ekaterina Panina Elya Permina Dmitry Ravcheev Dmitry Rodionov Natalya Sadovskaya Alexey Vitreschak