DNA Regulatory Binding Motif Search Dong Xu Computer Science Department 109 Engineering Building West
Lecture Outline l Gene regulation l Definition of regulatory motif search l CONSENSUS (“Greedy” Algorithm) l Gibbs Sampler
Gene Regulation DNA sequence Start of transcription promoter operator
Key steps in transcription å Initiation å Elongation å Termination DNA + RNA
Initiation TATA RNA Polymerases RNA Pol IRibosomal RNAs RNA Pol IIAll protein genes, snRNAs U1,U2 etc RNA Pol IIITransfer RNAs, ribosomal RNAs One of the first sequences to be described was the TATA box consensus TATA A/T A A/T
Transcription Initiation Complex TATA-binding protein (TBP) binds to TATA box A macromolecular assembly of approximately 50 proteins Many conserved from yeast to humans TATA TBP TAF RNA pol II
Upstream Regulatory Elements In addition to the TATA box the comparison of many eukaryotic upstream sequences identified addition conserved motifs that were involved with the regulation of gene transcription Some UREs were common to many genes others were found only in genes expressed in specific cells or as a result of specific stimuli TATA URE Promoters are sequences in the DNA just upstream of transcripts (coding sequences) that define the sites of initiation
TATA TBP TAF RNA pol II motif Transcription faction Transcription factors are the proteins that modulate the rate of gene transcription by specific interactions with DNA and/or other proteins
Regulatory elements in eukaryotes are frequently arranged in “modules”. Frequently TFs act as synergistic (cooperative) or antagonistic (competitive) pairs. Endo 16 Regulatory Network (1)
Regulatory Network (2)
Lecture Outline l Gene regulation l Definition of regulatory motif search l CONSENSUS (“Greedy” Algorithm) l Gibbs Sampler
Motif Identification AGCCA Regulatory regions Motif – Binding site???
What constitutes a motif? l In S.cerevisiae typically 6-10 conserved bases – The motif l Spacers varying in length (1-11bp) å Usually located in the middle ACCNNNNNNGTT
Subproblem #1 l Having a collection of known binding sites l Can we develop a representation to search for new binding sites?
Subproblem # 2 l Given a set of sequences containing binding sites for a common factor l Can we discover their location in each sequence?
Computational Approach l Identify a set of genes believed to be controlled by the same regulatory mechanism (co-regulated genes). l Extract regulatory regions of the genes (usually upstream sequences) to form a sample of sequences. l Find some way to identify conserved elements (ungapped pattern) in these sequences, resulting in a list of potential regulatory sites.
Motif Finding Problem l Given a sample of sequences and an unknown pattern (motif) that appears at different unknown positions in each sequence, can we find the unknown pattern? l Input: a set of sequences, each one with an unknown pattern at an unknown position. l Output: the pattern and a set of starting positions of the pattern in each sequence.
Why Not Use Multiple Alignment l The motif is short and may appear in different location in different sequences. Most other areas are random. l The problem is made more complicated since not every sequence contains a motif, due to: å The upstream region used may not be long enough to include a regulatory site in every sequence. å Usually, potential co-regulated genes are used to construct the sample, which means that we don’t know for sure whether all these genes are really co-regulated.
Frequency matrix Log ( ) f(b,i) + p(b)
The functional constraints on each specific position of the pattern are variable from some sites absolutely conserved (Shannon’s information content C i ranging between 0 and 1). Information Content Values
Sequence Logo
Example Data Set Experimentally determined CRP binding sites for 18 genes
CRP Dimer Homo dimeric structure indicates symmetric model
CPR Product Multinomial Model Logo Palidromic Product Multinomial model of sites
Essentially a Multiple Local Alignment Find “best” multiple local alignment......
Difficulties l Multiple factors for a single gene l Variability in binding sites å The nature of variability is NOT well understood å Insertions and deletions are uncommon l Location, location, location… l Confidence assessment
Lecture Outline l Gene regulation l Definition of regulatory motif search l CONSENSUS (“Greedy” Algorithm) l Gibbs Sampler
Early Statistical Approaches å CONSENSUS – Use a greedy algorithm to iteratively build up motifs by adding more and more pattern instances. å Gibbs sampler – Start from a random initial solution, use the Gibbs sampling approach to make a series of local moves, trying to get to the solution with the best score. å MEME – Use the expectaion maxmization (EM) algorithm.
CONSENSUS Algorithm l CONSENSUS uses an iterative procedure to add more and more patterns to form potential motifs: å Initialize each l-mer in sequence 1 as a single- pattern motif. å Add each l-mer in sequence 2 to each single-pattern motif, forming motifs consisting of 2 patterns. Keep only the top n motifs. å Repeat the process by adding each l-mer in sequence 3 to the top n motifs from the last round, forming motifs consisting of 3 patterns, and so on until the last sequence. Only the top n motifs are kept each time.
More Details of CONSENSUS å CONSENSUS use the information content score for scoring a motif as a set of ungapped patterns. å Instead of following the sequence order as given in the input sequence set, a randomized ordering is used to avoid dependence on the input set.
CONSENSUS Procedure (1) Cycle 1: For each word W 1 in S 1 For each word W 2 in S 2 Create alignment (gap free) of W 1, W 2 Keep the n best alignments A 1,1, …, A n,1 : ACGGTTG,CGAACTT,GGGCTCT … ACGCCTG,AGAACTA,GGGGTGT …
Cycle t: For each alignment A j, t-1 from cycle t-1 For each word W t+1 in S t+1 Create alignment (gap free) of W t+1, A j, t-1 Keep the n best alignments A 1,t, …, A n,t ACGGTTG,CGAACTT,GGGCTCT … ACGCCTG,AGAACTA,GGGGTGT … ……… ACGGCTC,AGATCTT,GGCGTCT … CONSENSUS Procedure (2)
Weight matrix l Probabilistic model: How likely is each letter at each motif position? ACGTACGT
A. K. A. Weight matrices are also known as l Position-specific scoring matrices l Position-specific probability matrices l Position-specific weight matrices Related concepts l Information content l Relative entropy
Scoring a motif model l A motif is interesting if it is very different from the background distribution more interesting less interesting ACGTACGT
Relative entropy l A motif is interesting if it is very different from the background distribution l Use relative entropy as objective function: p i, = probability of in matrix position i b = background frequency (in non-motif sequence)
n is user-defined heuristic constants Running time: O(N 2 ) + O(k N n) Where N: length of sequence; n: top n selections k: number of sequences Computational Complexity
Lecture Outline l Gene regulation l Definition of regulatory motif search l CONSENSUS (“Greedy” Algorithm) l Gibbs Sampler
Gibbs Sampling (1) l Goal: find the best a k to maximize the difference between motif and background base distribution. a2a2 a3a3 a4a4 akak a1a1
l Step 1: Pick random start position, compute current motif matrix l Step 2: Iterative update å Take one sequence out, update motif matrix å Calcuate fitness score of each position of out sequence å Pick start position in out sequence based on weight Ax å Take out another sequence, …, until converge l Step 3: Reset starting position Liu, X Gibbs Sampling (2)
a3'a3' a4'a4' ak'ak' a2'a2' ????????????????? a1'a1' Take out one sequence, calculate the fitness score of every subsequence relative to the current motif Gibbs Sampling (3)
Fitness Score l Ax = Qx / Px å Qx: probability of generating subsequence x from current motif å Px: probability of generating subsequence x from background 123 A T G C Current Motif Background: P(A) = P(T) = 0.4 P(G) = P(C) = 0.1 X = GGA: Q? P?
An example ACAGTGT TAGGCGT ACACCGT ??????? CAGGTTT ACGTACGT ACAGTGT TAGGCGT ACACCGT ACGCCGT CAGGTTT sequence 4
Gibbs pseudocode select sites at random compute the relative entropy for (iter = 0; iter < maxiter; iter++) { shuffle(sequences) foreach sequence in (sequences) { assign score to each site in sequence choose one site probabilistically compute the fitness score if (fitness score is best so far) { store a copy of the current sites } print the best scoring set of sites
Computational Complexity l One iteration running time: O(NK) å Usually need < N iterations for convergence, and < N starting points. å Overall complexity: unclear – typically O(N 2 K) - O(N 3 K) l EM is a local optimization method l Initial parameters matter
Biological Considerations l In practice, motif finding algorithms have to take into account characteristics of real input samples. These include: å Motifs with unknown length. å Samples with biased nucleotide composition. å Corrupted samples (not every sequence contains a motif). å Regulatory sites can lie on either DNA strand.
Reading Assignments l Suggested reading: å Chapter 10 in “Current Topics in Computational Molecular Biology, edited by Tao Jiang, Ying Xu, and Michael Zhang. MIT Press ” l Optional reading: 1. Victor Olman, Dong Xu, and Ying Xu. CUBIC: Identifications of Regulatory Binding Sites through Data Clustering. Journal of Bioinformatics and Computational Biology. 1:
Develop a program that implement the “greedy” algorithm (CONSENSUS) for motif identification 1. Use an objective function of total mismatches between words. 2. Test the program using the DNA sequence in the next page. 3. Output the motif and location in each sequence. Project Assignment (1)
Project Assignment (2) atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa Test DNA sequence (each line a sequence):