MotifClick: cis-regulatory k - length motifs finding in cliques of 2(k-1)- mers Shaoqiang Zhang April 3, 2013
Gene regulation in prokaryotes TSS +1 Promoter region Transcription factor binding sites Terminator mRNA Transcription ’ UTR cis-regulatory elements TF 1 TF 2 Gene1 Gene2 Gene3 Operon
Transcription Factor binding sites (TFBS) Gene4 Gene5 Gene1 Gene2Gene6 Gene3 TF BS1 BS2 BS3 Co-regulated genes (Regulon) in a single genome BS1 BS2 BS3 Cis-regulatory motif / binding site motif. Gene Orthologous genes Genome1 Genome2 Genome3 Genome4 Genome5 Phylogenetic footprinting technique TGTGAGATAGATCACA CATGATTTAAATCGCA …………………………… TGTGATCAACATCACA motif logo BS1 BS2 BS3
Motif finding from co-regulated/orthologous genes All MEME BioProspector CUBIC MotifSampler MDscan Top number of output motifs Coverage of known BSs Weeder CONSENSUS A lot of motif finding programs have been developed such as MEME, BioProspector, MotifSampler, MotifCut, MDscan, Weeder, CONSENSUS etc. We have also developed a motif finding program MotifClick
The binding sites of a TF may be divided into distinct sub-motifs. Merge cliques MotifClick: sub-motifs
Previous works Graph construction: G=(V,E) un-weighted graph, where V={candidate motif segments} E={for each pair of input sequences, top 10 pairs of segments with the largest numbers of conserved segments in the input seqs} Finding clique from an edge Expand each clique to a closure by adding candidate segments Sort motif closures in the p- value order Graph construction: G=(V,E,W) weighted graph V={all k-mers} E={each pair of k-mers} W={the probability that two k- mers belong to the same motif under the nucleotide background distribution} Maximum density subgraph finding (max-flow min-cut algorithm) Refine density subgraph Sort motifs in the order of constructing maximum density graphs. BOBRO MotifCut
Main idea Weighted graph: reduce constructed graph scale by using 2(k-1)-mers. Edge weight: use match number and consider the background. Clique finding: use the program we designed in GLECLUBS (find clique from each node). Expansion: expand cliques into quasi-cliques to include more segments. Rank: based on the size of cliques.
Graph construction: Vertex set s1s1 sisi sNsN Input a set of N sequences 2(k-1) k-1 step length = k-1 Each k-mer is located in exactly one 2(k-1)-mer size of the last one is in [k,2(k-1)]
Graph construction: Edge set For each pair of 2(k-1)-mers M’ and M”, calculate the maximum match number: a b k-mer Probability of each base in a binding site Sum of squared distance E coli known binding sites If max match number >=cutoff and the two k-mers a and b with the max matches have Then link M’ and M” with an edge.
How to select cutoffs and ? Random Randomly select a k-mer in the input seqs set, find a k-mer having max matches with it in each seq. 5% Keep 95% k-mers by deleting min ones and calculate the average match number of the 95% k-mer with max matches s1s1 sisi sNsN =average match number Sampling times=max{10, N/4} NOTE: the cutoff can be amended later
Graph construction: G=(V,E) s1s1 sisi sNsN sjsj MotifCut: max density subgraphs BOBRO: maximal clique starting from an edge MotifClick: maximal cliques starting from each node
We can correct the cutoff by calculating the graph density. If the graph density>100, set until density<=100. And update the graph. Graph construction: G=(V,E) Cutoff=10 Cutoff=11
Break ties by deleting the vertex with minimum sum of weights in the induced subgraph Neighbor graph of vertex v Cliques finding Max sum of matches Min sum of matches Top 1 motif: Clique1 (core) + Other cliques (expansion) CliquesGroup=
Merge other cliques into Clique1 5-clique 4-clique or After merging some other cliques into clique1, update the cliques group by removing clique1 and the cliques merged into clique1. ?????
Gapless alignments K-mer discard Cutoff= average match number Max number of neighbors For all k-mers in the quasi-clique of 2(k-1)-mers, find the k-mer with max number of neighbors. MUSCLE4.0: too strict to get ideal results Final alignment
Main steps 1.Read input fasta file into a matrix 2.Calculate background 3.Select match cutoff by estimating average match number 4.Build graph of 2(k-1)-mers 5.Calculate graph density 6.Update graph by deleting edges with matches=cutoff if graph density > density cutoff 7.Find all cliques associated with each vertex 8.Select the clique with max sum of matches and merge it with other cliques 9.Do gapless alignments on the expanded quasi-clique. 10.Update clique group, and go back step 8.
Flowchart of MotifClick Estimate average match number Set match cutoff=average match num+1 Build graph of 2(k-1)-mers Graph density<100 Yes No Update graph Set match cutoff=cutoff+1 Find all cliques associated with each vertex Select the clique with max sum of matches and merge it with other cliques Gapless alignments using average match number as cutoff Update clique group
Improvement How many kinds of nucleotides appear in a binding site? Yeast SGD % 332.4% 463.2% E.Coli RegulonDB 10 21% 314% 485% SGD (S. cerevisiae Genome Database) So, we only search the k-mers containing at less 3 kinds of nucleotides
Improvement TTTTTTCA 0.75 Percent of max length of single-nucleotide segments in BSs
Sum of squared distance SSD cutoff=0.2
Percentage SSD
Command-line options ********* USAGE: ********* MotifClick [OPTIONS] > OutputFile file containing DNA sequences in FASTA format OPTIONS: -w motif width (default=16) -n maximum number of motifs to find (default=5) -b 2 if examine sites on both of DNA strands (default=1 only forward) -d upper bound of graph density (default=100) -s 0 if want more degenerate sites (default=1 if want fewer sites) ********* -s 1: match cutoff=average match number+1 -s 0: match cutoff=average match number Coded by standard C++ and compiled by GNU C++ compiler under Linux and Mac, and by MinGW (Minimalist GNU for Windows) under Windows(32bits).
Synthetic data test Compare with Motif finding tools: MEME, BioProspector, Weeder and MotifCut Hu et al. have used RegulonDB database to evaluate five algorithms, AlignACE, MEME, BioProspector, MDscan, and MotifSampler, for the prediction of prokaryotic binding sites, and found that MEME often achieved the best sensitivity, and BioProspector often achieved the highest specificity. Tompa et al. have used TRANSFAC database to assess 13 computational tools for the discovery of transcription factor binding sites in eukaryotes and found that Weeder was the best, and MEME were also good. We test programs for k-mer sizes 8, 12, and 16. Weeder can only find motifs with length 6,8,10,12 (parameters: small (6,8), medium(6,8,10), large(6,8,10,12), extra(6-12, mainly 8,10) Shaoqiang Zhang et al find MEME and Bioprospector cover true BSs, Then CUBIC, MDscan, MotifSampler, consensus,
Synthetic data test Sensitivity : Sn=TP/(TP+FN)=(number of correctly predicted BSs)/(number of actual BSs) Specificity: Sp=TP/(TP+FP)=(number of correctly predicted BSs)/(number of predicted BSs) Performance coefficient: PC=TP/(TP+FP+FN)= )=(number of correctly predicted BSs)/(number of {actual U predicted BSs}) F-measure/Harmonic mean: F=2*Sn*Sp/(Sn+Sp) Binding sites level accuracy:
Synthetic data test A motif containing 20 binding sites The motif instance of 20 BSs was randomly seeded into a synthetic fasta file of 20 seqs, not necessarily one BS per seqs. We generated synthetic sets of background sequences using 3 rd -order Markov model. Motif seqs set Synthetic background seqs set We will test on 400 length X 20 seqs, 600X20, 800X20, and1000X20.
Meme inputfile.fasta –dna –mod anr –w 8 –nmotifs 1 –text > Synthetic data test (8-mer/Octamer) weederTFBS.out –f inputfile.fasta –W 8 –O SC –e 3 –R 50 –M –T 1 adviser.out inputfile.fasta S BioProspector –i inputfile.fasta –W 8 –d 1 –r 1 –o file.biop.out Motif_cuts.exe inputfile.fasta 8 1 MotifClicker inputfile.fasta –w 8 –n 1 –s 1 >file.motifclick.out Synthetic background seqs: the dependencies of 3 rd -order Markov were estimated from all intergenic seqs of the yeast genome. Motifs containing 20 BSs with information contents of 12 bits( at most 6 positions are conserved) were chosen from SGD database. MotifClicker inputfile.fasta –w 8 –n 1 –s 0 >file.motifclick.out Yeast background: AT: 0.65 GC:0.35 SGD binding site length Binding sites count Weederlaucher.out inputfile SC medium M T1 Number of mutations allowed Unfair to other tools
Sum of squared distance Background seqs sets size 400*20, 600*20, 800*20, 1000*20, Seed motifs into 100 instances of each size
Synthetic data test (8-mer) Average SSD=0.06 Average SSD= instances of 400*20 seq sets Note: Weeder did not output any results on the two motifs after setting number of ouput motifs as “T1”, so we decided to use “T2” and only consider top 1 motif of “T2”. 400*20
PCF-measure K-mer size 8 (using two motifs with SSD=0.06 and SSD=0.10, respectively, on 100 datasets) SensitivitySpecificity
Dodeca-mer (12-mer) Synthetic background seqs: the dependencies of 3 rd -order Markov were estimated from all intergenic seqs of the E. coli K12. Motifs containing 20 BSs with information contents of 14 bits( at most 7 positions are conserved) and the average SSD=0.02 between each BS and background were chosen from RegulonDB database. Seed motifs into 100 background seq sets. Test on 400*20, 600*20, 800*20, and 1000*20 We abandoned Weeder, because it can only set motif length as “small” (length 6 with 1 mutation,length 8 with 2 mutations), “medium” (like small, plus length 10 with 3 mutations, “large” (like medium,plus length 12 with 4 mutations), and “extra”(length 6 with 1 mutation, length 8 with 3 mutations, length 10 with 4 mutations, length 12 with 4 mutations). That is, Weeder only accepts motif length even values between 6~12. and for length 12 only accepts at most 4 mutations.
K-mer size 12, seed into 100 background seqs sets SnSp PC F-measure
12-mer, add noise SnSp PCF-measue
16-mer Synthetic background seqs: the dependencies of 3 rd - order Markov were estimated from all intergenic seqs of the E. coli K12. Motifs containing 20 BSs with information contents of 16 bits( at most 8 positions are conserved) and the average SSD=0.02 between each BS and background were chosen from RegulonDB database. Seed motifs into 100 background seq sets. Test on 400*20, 600*20, 800*20, and 1000*20
16-mer SnSp PCF-measure
16-mer,add noise Sn Sp PC F-measure
Motif finding in Yeast (8-mer) Motif finding toolsTop 1Top 5Top 10Top 15Top 20Top25 MotifClick67/ / / / / / MEME70/ / / / / / MotifCut65/ / / / / / BioProspector79/ / / / / / Weeder77/ / / / / / *At least 3 orthologous genes for each intergenic sequence set. Motif finding in 5137 intergenic sequence sets of orthologous genes, which contain 99 TFs, belonging to 2932 BSs in SGD.
Motif finding in Ecoli K12 (16-mer) ToolsTop 1Top 5Top 10Top 15Top 20Top 25 MotifClick 331/ / / / / / MEME 298/ / / / / / MotifCut 241/ / / / / / BioProspector 354/ / / / / / MotifClick +MEME 474/981029/ / / / /120 BioProspector +MEME 472/921051/ / / / /119 Ecoli K12: 2313 operon groups, RegulonDB v6.0: 122 TFs, 1411 BSs. Weeder and Consensus are the worst because they need high-quality input seqs set.
ToolsTop 1Top 5Top 10Top 15Top 20Top 25 MotifClick331/85793/ / / / /117 MEME298/83877/ / / / /117 MotifCut 744/ / BioProspector354/85743/103953/ / / /116 CUBIC242/75563/98791/108905/109999/ /114 MDscan355/82552/96634/99684/102758/107793/109 MotifSampler168/61486/92612/102729/102792/107831/108 Weeder179/65350/85452/92494/94532/94552/94 Consensus168/63186/68200/74210/76214/76220/76 MotifClick +MEME 474/981029/ / / / /120 BioProspector +MEME 472/921051/ / / / /119
Conclusions Synthetic data: MotifCut has highest specificity. MotifClick have highest sensitivity. MotifClick has the most complements with other tools. Yeast data and Ecoli data MotifClick and MEME have close numbers of true predictions and more true predictions than other tools. MotifClick has the most complements with other tools.