Presentation is loading. Please wait.

Presentation is loading. Please wait.

MotifClick: cis-regulatory k - length motifs finding in cliques of 2(k-1)- mers Shaoqiang Zhang April 3, 2013.

Similar presentations


Presentation on theme: "MotifClick: cis-regulatory k - length motifs finding in cliques of 2(k-1)- mers Shaoqiang Zhang April 3, 2013."— Presentation transcript:

1 MotifClick: cis-regulatory k - length motifs finding in cliques of 2(k-1)- mers Shaoqiang Zhang http://bioinfo.uncc.edu/szhang April 3, 2013

2 Gene regulation in prokaryotes TSS +1 Promoter region Transcription factor binding sites Terminator mRNA Transcription -10-35-300 3’ UTR cis-regulatory elements TF 1   TF 2 Gene1 Gene2 Gene3 Operon

3 Transcription Factor binding sites (TFBS) Gene4 Gene5 Gene1 Gene2Gene6 Gene3 TF BS1 BS2 BS3 Co-regulated genes (Regulon) in a single genome BS1 BS2 BS3 Cis-regulatory motif / binding site motif. Gene Orthologous genes Genome1 Genome2 Genome3 Genome4 Genome5 Phylogenetic footprinting technique TGTGAGATAGATCACA CATGATTTAAATCGCA …………………………… TGTGATCAACATCACA motif logo BS1 BS2 BS3

4 Motifs TTGTTACGTTATAACA CGGTTATATTATAACA CGGTTATGTTATAACA TGGTTATGTTATAACA CGGTTATGTTATAACA TGGTTATGTTATAACA TTGTTATGTTATAACG ATGTTATATTATTACA TTGTTATGTTATAACA TTGTTATAGTATAACA TTAAAATGTTATAACA TTAATATGTTATAACA TTGTTATAATATAACA ATGTTACATTATAACA CGGTTATGTTATAACA TGGTTATGTTATAACA TGGTTATGCTATAACA TTAAAATGTTATAACA TTAATATGTTATAACA A -0.839-5.231-0.839 -1.5311.688-5.231-0.187-2.909-5.2311.688-5.2311.6391.688-5.2311.639 C -0.607-4.695 -0.302-4.695-2.373-4.695 2.224-4.695 G -4.6110.882.047-4.611 1.864-2.289-4.611 -2.289 T 1.2351.093-5.1741.4841.594-5.1741.484-5.1741.5941.745-5.1741.745-2.852-5.174 A50553308103002930029 C40000050100000300 G0112500002210000001 T21190227025027300 1000 MotifFrequency matrix Motif profile matrix (Position weight matrix)

5 Motif finding from co-regulated/orthologous genes All MEME BioProspector CUBIC MotifSampler MDscan Top number of output motifs Coverage of known BSs Weeder CONSENSUS A lot of motif finding programs have been developed such as MEME, BioProspector, MotifSampler, MotifCut, MDscan, Weeder, CONSENSUS etc. We have also developed a motif finding program -------MotifClick http://motifclick.uncc.edu

6 The binding sites of a TF may be divided into distinct sub-motifs. Merge cliques MotifClick: sub-motifs

7 Previous works Graph construction: G=(V,E) un-weighted graph, where V={candidate motif segments} E={for each pair of input sequences, top 10 pairs of segments with the largest numbers of conserved segments in the input seqs} Finding clique from an edge Expand each clique to a closure by adding candidate segments Sort motif closures in the p- value order Graph construction: G=(V,E,W) weighted graph V={all k-mers} E={each pair of k-mers} W={the probability that two k- mers belong to the same motif under the nucleotide background distribution} Maximum density subgraph finding (max-flow min-cut algorithm) Refine density subgraph Sort motifs in the order of constructing maximum density graphs. BOBRO MotifCut

8 Main idea Weighted graph: reduce constructed graph scale by using 2(k-1)-mers. Edge weight: use match number and consider the background. Clique finding: use the program we designed in GLECLUBS (find clique from each node). Expansion: expand cliques into quasi-cliques to include more segments. Rank: based on the size of cliques.

9 Graph construction: Vertex set s1s1 sisi sNsN Input a set of N sequences 2(k-1) k-1 step length = k-1 Each k-mer is located in exactly one 2(k-1)-mer size of the last one is in [k,2(k-1)]

10 Graph construction: Edge set For each pair of 2(k-1)-mers M’ and M”, calculate the maximum match number: a b k-mer Probability of each base in a binding site Sum of squared distance E coli known binding sites 0.02 0.2 If max match number >=cutoff and the two k-mers a and b with the max matches have Then link M’ and M” with an edge.

11 How to select cutoffs and ? Random Randomly select a k-mer in the input seqs set, find a k-mer having max matches with it in each seq. 5% Keep 95% k-mers by deleting min ones and calculate the average match number of the 95% k-mer with max matches s1s1 sisi sNsN =average match number Sampling times=max{10, N/4} NOTE: the cutoff can be amended later

12 Graph construction: G=(V,E) s1s1 sisi sNsN sjsj MotifCut: max density subgraphs BOBRO: maximal clique starting from an edge MotifClick: maximal cliques starting from each node

13 We can correct the cutoff by calculating the graph density. If the graph density>100, set until density<=100. And update the graph. Graph construction: G=(V,E) Cutoff=10 Cutoff=11

14 Break ties by deleting the vertex with minimum sum of weights in the induced subgraph Neighbor graph of vertex v Cliques finding Max sum of matches Min sum of matches Top 1 motif: Clique1 (core) + Other cliques (expansion) CliquesGroup=

15 Merge other cliques into Clique1 5-clique 4-clique or After merging some other cliques into clique1, update the cliques group by removing clique1 and the cliques merged into clique1. ?????

16 Gapless alignments K-mer discard Cutoff= average match number Max number of neighbors For all k-mers in the quasi-clique of 2(k-1)-mers, find the k-mer with max number of neighbors. MUSCLE4.0: too strict to get ideal results Final alignment

17 Main steps 1.Read input fasta file into a matrix 2.Calculate background 3.Select match cutoff by estimating average match number 4.Build graph of 2(k-1)-mers 5.Calculate graph density 6.Update graph by deleting edges with matches=cutoff if graph density > density cutoff 7.Find all cliques associated with each vertex 8.Select the clique with max sum of matches and merge it with other cliques 9.Do gapless alignments on the expanded quasi-clique. 10.Update clique group, and go back step 8.

18 Flowchart of MotifClick Estimate average match number Set match cutoff=average match num+1 Build graph of 2(k-1)-mers Graph density<100 Yes No Update graph Set match cutoff=cutoff+1 Find all cliques associated with each vertex Select the clique with max sum of matches and merge it with other cliques Gapless alignments using average match number as cutoff Update clique group

19 Improvement How many kinds of nucleotides appear in a binding site? Yeast SGD 10 24.4% 332.4% 463.2% http://www.yeastgenome.org E.Coli RegulonDB 10 21% 314% 485% SGD (S. cerevisiae Genome Database) So, we only search the k-mers containing at less 3 kinds of nucleotides

20 Improvement TTTTTTCA 0.75 Percent of max length of single-nucleotide segments in BSs

21 Sum of squared distance 0.02 0.06 0.10 0.02 0.14 0.18 0.22 SSD cutoff=0.2

22 Percentage SSD

23 Command-line options ********* USAGE: ********* MotifClick [OPTIONS] > OutputFile file containing DNA sequences in FASTA format OPTIONS: -w motif width (default=16) -n maximum number of motifs to find (default=5) -b 2 if examine sites on both of DNA strands (default=1 only forward) -d upper bound of graph density (default=100) -s 0 if want more degenerate sites (default=1 if want fewer sites) ********* -s 1: match cutoff=average match number+1 -s 0: match cutoff=average match number Coded by standard C++ and compiled by GNU C++ compiler under Linux and Mac, and by MinGW (Minimalist GNU for Windows) under Windows(32bits). http://bioinfo.uncc.edu/szhang/computing.htm

24 Synthetic data test Compare with Motif finding tools: MEME, BioProspector, Weeder and MotifCut Hu et al. have used RegulonDB database to evaluate five algorithms, AlignACE, MEME, BioProspector, MDscan, and MotifSampler, for the prediction of prokaryotic binding sites, and found that MEME often achieved the best sensitivity, and BioProspector often achieved the highest specificity. Tompa et al. have used TRANSFAC database to assess 13 computational tools for the discovery of transcription factor binding sites in eukaryotes and found that Weeder was the best, and MEME were also good. We test programs for k-mer sizes 8, 12, and 16. Weeder can only find motifs with length 6,8,10,12 (parameters: small (6,8), medium(6,8,10), large(6,8,10,12), extra(6-12, mainly 8,10) Shaoqiang Zhang et al find MEME and Bioprospector cover true BSs, Then CUBIC, MDscan, MotifSampler, consensus,

25 Synthetic data test Sensitivity : Sn=TP/(TP+FN)=(number of correctly predicted BSs)/(number of actual BSs) Specificity: Sp=TP/(TP+FP)=(number of correctly predicted BSs)/(number of predicted BSs) Performance coefficient: PC=TP/(TP+FP+FN)= )=(number of correctly predicted BSs)/(number of {actual U predicted BSs}) F-measure/Harmonic mean: F=2*Sn*Sp/(Sn+Sp) Binding sites level accuracy:

26 Synthetic data test A motif containing 20 binding sites The motif instance of 20 BSs was randomly seeded into a synthetic fasta file of 20 seqs, not necessarily one BS per seqs. We generated synthetic sets of background sequences using 3 rd -order Markov model. Motif seqs set Synthetic background seqs set We will test on 400 length X 20 seqs, 600X20, 800X20, and1000X20.

27 Meme inputfile.fasta –dna –mod anr –w 8 –nmotifs 1 –text > file.meme.out Synthetic data test (8-mer/Octamer) weederTFBS.out –f inputfile.fasta –W 8 –O SC –e 3 –R 50 –M –T 1 adviser.out inputfile.fasta S BioProspector –i inputfile.fasta –W 8 –d 1 –r 1 –o file.biop.out Motif_cuts.exe inputfile.fasta 8 1 MotifClicker inputfile.fasta –w 8 –n 1 –s 1 >file.motifclick.out Synthetic background seqs: the dependencies of 3 rd -order Markov were estimated from all intergenic seqs of the yeast genome. Motifs containing 20 BSs with information contents of 12 bits( at most 6 positions are conserved) were chosen from SGD database. MotifClicker inputfile.fasta –w 8 –n 1 –s 0 >file.motifclick.out Yeast background: AT: 0.65 GC:0.35 SGD binding site length Binding sites count Weederlaucher.out inputfile SC medium M T1 Number of mutations allowed Unfair to other tools

28 Sum of squared distance 0.02 0.06 0.10 0.02 0.14 0.18 0.22 Background seqs sets size 400*20, 600*20, 800*20, 1000*20, Seed motifs into 100 instances of each size

29 Synthetic data test (8-mer) Average SSD=0.06 Average SSD=0.10 100 instances of 400*20 seq sets Note: Weeder did not output any results on the two motifs after setting number of ouput motifs as “T1”, so we decided to use “T2” and only consider top 1 motif of “T2”. 400*20

30 PCF-measure K-mer size 8 (using two motifs with SSD=0.06 and SSD=0.10, respectively, on 100 datasets) SensitivitySpecificity

31 Dodeca-mer (12-mer) Synthetic background seqs: the dependencies of 3 rd -order Markov were estimated from all intergenic seqs of the E. coli K12. Motifs containing 20 BSs with information contents of 14 bits( at most 7 positions are conserved) and the average SSD=0.02 between each BS and background were chosen from RegulonDB database. Seed motifs into 100 background seq sets. Test on 400*20, 600*20, 800*20, and 1000*20 We abandoned Weeder, because it can only set motif length as “small” (length 6 with 1 mutation,length 8 with 2 mutations), “medium” (like small, plus length 10 with 3 mutations, “large” (like medium,plus length 12 with 4 mutations), and “extra”(length 6 with 1 mutation, length 8 with 3 mutations, length 10 with 4 mutations, length 12 with 4 mutations). That is, Weeder only accepts motif length even values between 6~12. and for length 12 only accepts at most 4 mutations.

32 K-mer size 12, seed into 100 background seqs sets SnSp PC F-measure

33 12-mer, add noise SnSp PCF-measue

34 16-mer Synthetic background seqs: the dependencies of 3 rd - order Markov were estimated from all intergenic seqs of the E. coli K12. Motifs containing 20 BSs with information contents of 16 bits( at most 8 positions are conserved) and the average SSD=0.02 between each BS and background were chosen from RegulonDB database. Seed motifs into 100 background seq sets. Test on 400*20, 600*20, 800*20, and 1000*20

35 16-mer SnSp PCF-measure

36 16-mer,add noise Sn Sp PC F-measure

37 Motif finding in Yeast (8-mer) Motif finding toolsTop 1Top 5Top 10Top 15Top 20Top25 MotifClick67/585 7158 0.081 85/1200 24916 0.048 92/1638 41752 0.039 95/1923 55084 0.035 95/2107 65852 0.032 96/2222 74820 0.030 MEME70/754 10107 0.074 85/1202 34010 0.035 87/1615 49958 0.032 92/1931 60805 0.031 95/2087 69709 0.030 95/2198 77405 0.028 MotifCut65/474 7632 0.062 85/1189 28974 0.041 86/1641 47583 0.034 93/1893 61107 0.031 95/1983 67017 0.030 95/1998 67503 0.030 BioProspector79/780 10049 0.078 84/1145 20418 0.056 86/1465 31935 0.046 89/1701 42305 0.040 92/1911 52296 0.037 92/2038 61564 0.033 Weeder77/969 23417 0.041 88/1698 56440 0.030 92/2063 81374 0.025 94/2255 96046 0.023 94/2396 106872 0.022 96/2483 113346 0.022 *At least 3 orthologous genes for each intergenic sequence set. http://www.yeastgenome.org Motif finding in 5137 intergenic sequence sets of orthologous genes, which contain 99 TFs, belonging to 2932 BSs in SGD.

38 Motif finding in Ecoli K12 (16-mer) ToolsTop 1Top 5Top 10Top 15Top 20Top 25 MotifClick 331/85 2575 0.129 793/108 7706 0.103 1055/114 11056 0.095 1186/114 12779 0.093 1262/117 13592 0.093 1296/117 14026 0.092 MEME 298/83 3352 0.089 877/109 14243 0.062 1134/115 20999 0.054 1202/117 23912 0.050 1233/117 25412 0.049 1254/117 26201 0.048 MotifCut 241/75 1942 0.124 487/89 4763 0.102 544/96 6552 0.083 640/102 9145 0.070 744/107 10408 0.071 836/108 11212 0.074 BioProspector 354/85 4950 0.072 743/103 7678 0.097 953/112 10090 0.107 1056/112 11287 0.094 1150/116 12306 0.093 1181/116 13041 0.091 MotifClick +MEME 474/981029/1141259/1181335/1201357/1201377/120 BioProspector +MEME 472/921051/1151258/1181312/1191339/1191367/119 Ecoli K12: 2313 operon groups, RegulonDB v6.0: 122 TFs, 1411 BSs. Weeder and Consensus are the worst because they need high-quality input seqs set.

39 ToolsTop 1Top 5Top 10Top 15Top 20Top 25 MotifClick331/85793/1081055/1141186/1141262/117 1296/117 MEME298/83877/1091134/1151202/1171233/1171254/117 MotifCut 744/107 10408 836/108 11212 BioProspector354/85743/103953/1121056/1121150/1161181/116 CUBIC242/75563/98791/108905/109999/1111062/114 MDscan355/82552/96634/99684/102758/107793/109 MotifSampler168/61486/92612/102729/102792/107831/108 Weeder179/65350/85452/92494/94532/94552/94 Consensus168/63186/68200/74210/76214/76220/76 MotifClick +MEME 474/981029/1141259/1181335/1201357/1201377/120 BioProspector +MEME 472/921051/1151258/1181312/1191339/1191367/119

40 Conclusions Synthetic data: MotifCut has highest specificity. MotifClick have highest sensitivity. MotifClick has the most complements with other tools. Yeast data and Ecoli data MotifClick and MEME have close numbers of true predictions and more true predictions than other tools. MotifClick has the most complements with other tools.


Download ppt "MotifClick: cis-regulatory k - length motifs finding in cliques of 2(k-1)- mers Shaoqiang Zhang April 3, 2013."

Similar presentations


Ads by Google