Modeling Regulatory Motifs 3/26/2013
Transcriptional Regulation Transcription is controlled by the interaction of tran-acting elements called transcription factors (TFs) and cis-acting elements of DNA. Prediction of cis-acting elements or TF binding sites is a challenging problem in computational biology. TSS +1 Promoter region FT binding site Terminator RNA Transcription TF1 Ribosome binding site 3’UTR TF2 Transcriptional regulation of in prokaryotes 5’UTR
Specific Protein-DNA interactions Protein-DNA interactions are specific, guaranteeing that transcriptional regulation is specific and precise. The specificity of protein-DNA interactions are realized by the 3-D structures on the DNA-binding face of TF protein and the TF binding site of the DNA sequence. Usually a TF recognizes variable but similar binding sites associated with different genes. All the binding site recognized by the same TF is called a TF-binding motif.
Experimental determination of binding sites There are in vitro and in vivo methods for determining the binding sites of TFs. Systematic evolution of ligands by exponential enrichment (SELEX) is likely to identify all possible sequences recognized by a TF; SELEX may not work if TF- DNA interaction requires unknown co-factors; The method is laborious as tedious molecular cloning and sequencing are required to determine the binding sites. Geertz M, and Maerkl S J Briefings in Functional Genomics 2010;9: Motif finding
Experimental determination of binding sites Protein binding microarray (PBM) is another in vitro method, which avoid the molecular cloning step, and the binding site can be directly read out from the microarray; PBM can determine binding sites at single base resolution. But as SELEX, PBM may not work if TF-DNA interaction requires unknown co-factor; PBM may not work either if the binding site is long, e.g., longer than 12 pb. The putative binding site determined by PBM may not necessarily the real binding site in cells. Geertz M and Maerkl S J Briefings in Functional Genomics 2010;9:
Experimental determination of binding sites ChIP-seq and ChIP-chip are two high throughput in vivo methods for determining the binding sites of a TF. ChIP-seq and ChIP-chip can determine actual binding sites in a genome, but to determine all binding sites, many cell types need to be explored. Geertz M, and Maerkl S J Briefings in Functional Genomics 2010;9: Motif finding
Profile representation of TF binding sites TACGAT TATAAT GATACT TATGAT TATGTT TATAGT TATAAT Consensus sequence Examples of 70 binding sites in E. coli Regular expression [TG]A[TC][GA]XT Frequency matrix To avoid 0 counting, add a pseudo count of 1
Profile representation of TF binding sites where n b,i is the frequency of residue b at position i; and k is a pseudocount to avoid zero probability. Profile: for a motif of n samples (sequences), the probability of residue b at position i is Profile p b,i, of the 70 binding sites in E. coli, pseudocount k = 1
where p b,i is the probability of residue b at position i; and p b is the probability of residue b in the background sequences. Position specific weigh (scoring) matrix (PSWM): for a motif of n samples, the weight of residue b at position i is defined as Profile representation of TF binding sites PSWM of the 70 binding sites in E. coli, assuming p A =p C =p G =p T =0.25
Information content at position i of the sequence profile is given by: Logo representation: Information contents of a motif: Profile representation of TF binding sites where e(n) is a correction factor required when one only has a few (n) sample. A pseudo count is not added when computing p b,i. The height of each base is
Score of a sequence using a PSWM S =TATAAT {s j,b } nx4 = The score a sequence against a profile (or PSWM) is defined as A C G T If we represent a sequence S = {b 1 b 2 … b j …b n } as a binary matrix:
Score of a sequence using a PSWM TATAAT = {S j,b } = A C G T
Higher order PSWM To account for the dependence among adjacent positions of TF-DNA interaction, we can use higher order PSWMs. A higher order PSWM corresponds to a k-th order Markov chain, in which position i is dependent on the previous k positions. A higher order PSWM is also called a position weight array. TACGAT TATAAT GATACT TATGAT TATGTT TATAGT To avoid 0 counting, add a pseudo count of 1 First order PWSM for the 70 factor binding sites
Maximal dependence decomposition Maximal dependence decomposition (MDD) models the dependence between any two positions. It estimates the extent to which the nucleotides b j at position j depend on the nucleotides b i at position i. MDD uses the 2 test to determine whether position j depends on positions i. T A C G A T T A T A A T G A T A C T T A T G A T T A T G T T T A T A G T T A T A A T Consensus bases: bjbj bibi Non-consensus bases: G - C G C – G T For each position i, we divide binding sites in two groups: C i : Binding sites having the consensus base at i; : Binding sites having non-consensus base at i. T A C G A T T A T A A T T A T G A T G A T A C T T A T G T T T A T A G T bjbj bibi bjbj bibi C i
Maximal dependence decomposition Let f b be the probability base b at position j in the binding sites in Let N and N b be the total number of binding sites and count of base b at j in C i, respectively, then the 2 static is defined as, T A C G A T T A T A A T T A T G A T G A T A C T T A T G T T T A T A G T bjbj bibi bjbj bibi C i fAfCfGfTfAfCfGfT N binding sites NANCNGNTNANCNGNT
Maximal dependence decomposition This 2 static describes the dependence of position j on position i, and is denoted as 2 (j|i). The MDD approach proceeds iteratively as follows. 1.For each position i, compute 2.Among all the positions, select position i with maximum S i, and partition sequences into two groups C i and ; 3.Repeat steps 1 and 2 separately for C i and ; 4.Stop if there is no significant dependence or if there is an insufficient number of binding sites in C i or. In either case construct a standard PWSM for the remaining subset of binding sites.
AACGTG AGGCTG AGCTTT TACGTG CACGGT GATGGG AACGTG AGGCTG AGCTTT AACGTG CACGGT GATGGG GACTTG AACGTG AGCCTG AACGTG AAGGTG AGGCTG AATGTG PSWM1 PSWM2 Maximum S 1 Maximum S 3 Insufficient dependence Maximal dependence decomposition Illustration of the MDD procedure: modeling
AACGTG AGGCTG AGCTTT TACGTG CACGGT GATGGG AACGTG AGGCTG AGCTTT AACGTG CACGGT GATGGG GACTTG AACGTG AGCCTG AACGTG AAGGTG AGGCTG AATGTG PSWM1 PSWM2 Maximum S 1 Maximum S 3 Insufficient dependence Maximal dependence decomposition Illustration of the MDD procedure: scoring X=AAGGTG Position 1 has the consensus base ‘A’ Position 3 has non- consensus base ‘G’ Score X using PSWM2 AGCGTG
Modeling and detecting arbitrary dependencies We can also use a digraph to model the dependence among the positions: S2S2 S3S3 S4S4 S1S1 a S2S2 S3S3 S4S4 S1S1 b S2S2 S3S3 S4S4 S1S1 c S2S2 S3S3 S4S4 S1S1 d T
Searching for novel binding site using a PSWM Scan a sequence using a sliding window of the length of the PSWM, and return the windows that have a significantly high score....G A G T T A T A A T T A A G A... The significance of a score S can be computed as an empirical p value, or as follows, where S min and S max is the minimal and maximal score can be scored by the PSWM,
De novel prediction of TF binding sites 1.Greedy algorithms: CONSENSUS, DREME 2.Probabilistic algorithms: MEME, BioProspector 3.Graph-theoretic algorithms: CUBIC, MotifClick 4.…… The motif-finding problem: Since there are usually no fixed patterns of cis-regulatory elements of a TF, a cis-regulatory element can be only predicted by comparing a set of sequences that are likely to contain the binding site of the same TF. The problem of finding cis-regulatory elements in a given set of sequences is called the motif-finding problem. Currently, all sequence-based motif-finding algorithms are based on the assumption that binding sites of a TF are more conserved than the flanking sequences in a genome. A larger number of motif-finding algorithms have been developed:
Methods for finding a set of intergenic sequences for motif-finding One genome, multiple genes approach: identify a set of co- regulated genes from an organism of interest through clustering analysis of gene expression profiles. IAIA IBIB ICIC IDID IEIE IFIF Motif finding
Methods for finding a set of intergenic sequences for motif-finding One gene, multiple genomes approach---phylogenetic footprinting: in closely related species, more often both the coding sequences and cis-regulatory elements of orthologous genes are conserved Homologous A operon from another genome TFBSs Genes
Phylogenetic footprinting Orthologues identification T.g 1 G 1.g 1 G 2.g 1 G n.g PSWM m Motif finding Predicted binding Sites Intergenic regions …… T.g m G 1.g m G 2.g m G n.g m....
Additional hallmarks of functional TF binding sites In high eukaryote, genes are regulated by multiple TFs binding to a close cluster of respective binding sites. These clusters of binding sites of the same and/or different TFs are called cis-regulatory modules (CRMs), they can be in different orientations, located in the upstream, downstream or in the intron of a gene, can be very far away from the target gene, and can be even on a different chromosome. Borok M J et al. Development 2010;137:5-13 Wyeth W. Wasserman & Albin Sandelin Nature Reviews Genetics 2004; 5,