CisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold.

cisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold

cisGreedy De novo motif finder which implements a greedy algorithm similar to Consensus motif finder Goal: To provide an efficient algorithm to be included in the Cistematic package that performs similarly to Consensus and meme

Cistematic Integrate visualization, refinement of motifs and improve performance of multiple motif finders in a single package Mortazavi, 2006

Cistematic Image: Ali Mortazavi cisGreedy becomes part of “Bottom Tier” Motif finder would be included in the Cistematic package (prevents need for complicated installations)

What is a Motif? cis-Regulatory elements – Transcription Factor Binding Sites(TFBS) – Binding by transcription factors may increase or decrease transcription of genes

What is a Motif? GAL4 in Yeast – Activator of galactose-induced genes (convert galactose to glucose) – Protein structure determines motif DNA-protein interactions require certain bases at specified locations Motif reflects homodimer structure

What is a Motif? cis-Regulatory elements – Transcription Factor Binding Sites(TFBS) – Binding by transcription factors may increase or decrease transcription of genes Gene Regulation believed to be a major source of complexity – Plants may have more genes or larger genomes than humans – are they more complex? Identification of cis-regulatory elements will help us understand gene regulatory networks (bigger picture)

How do we find motifs? Hard to identify – Relatively short sequences (as small as 6 bases) – Many positions not well conserved Factors improving identification – Usually localized in certain proximity of a gene (search within 3 kb upstream) – Some positions highly conserved – Use other data (Microarray?)

Motif Finders Greedy – Maximizes similarity of motifs from sequences through a greedy approach – Eliminate background modeling by using Cistematic package preprocessing steps Improves speed Prevents false negatives – Implements multiple models (zoops, oops, TCM)

Consensus Scoring Use equation similar to log likelihood called Information Content Hertz, Gerald Z., and Gary D. Stormo. "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences." Bioinformatics 8 1999: 563- 577. L columns in the matrix A = {A,C,G,T} frequency of each letter i at each position j a priori probability of letter i

Removing Background Goal of a background model: differentiate noise from signal Issues with background: – What background should be used? Whole genome? Conserved regions? – Selective pressures maintain conserved regions Arguably searching in conserved regions guarantees there is little noise (it has been maintained) Solution: – Search in conserved regions – Use simple repeat masking – Sequences which reoccur are likely TFBS

cisGreedy scoring Scoring focuses on maximizing number of identical bases – Percent identity is dependent on number of deviations from the strict consensus – Background adds complexity that may lead to false negatives

cisGreedy Input sequences are analyzed Randomly select 2 sequences to be compared

cisGreedy The two selected sequences are analyzed independently of the remaining sequences

cisGreedy The two selected sequences are analyzed independently of the remaining sequences Windows of motif size are scanned starting at the beginning of each sequence

cisGreedy Sequences are scanned in an attempt to locate the highest scoring alignment – Alignments are ungapped – Score is established as the number of sequences containing the most frequently occurring base at each position

cisGreedy Reverse Complements are analyzed (user specified) Once start locations are established with a top alignment score, these are left unchanged (Greedy)

cisGreedy Select an additional sequence in which to identify the location of the motif Windows in the additional sequence are aligned to previously established windows (Greedy)

cisGreedy Additional sequence scanned as before, reverse complement (user specified) Alignment score established as before

cisGreedy Final motif locations are used in order to build position specific frequency matrices Reverse complement sequence used in building PSFM if used

Testing cis-Greedy AIY – 16bp cis-regulatory motif drives expression – Experimentally verified – Gene battery consists of a set of genes bound by AIY Orthologous genes contain highly specified binding sites Individual binding sites of battery genes within a single species can vary considerably (Wenick and Hobert 759)

Cistematic Results for AIY hen-1 regions of conservation orthologous genes

Results for AIY AIY IdentifiedAAATTGGCTTCCTCAAA cisGreedyTTTGAGGAAGCCAATTT (reverse comp) AAATTGGCTTCCTCAAA memeAAATTGGCTTCCTCAAA AIY- Battery Consensus

Cistematic Results for AIY hen-1

Results for AIY hen-1

Tompa Bakeoff 3 benchmark datasets – Real – Markov Chain – Generic 4 organisms – Human – Mouse – Fruitfly – Yeast Each dataset contains 0-1 motifs. Each sequence can have 0 or multiple motifs Report 0-1 motif per dataset and locations of motifs Use statistical tools provided by bakeoff to analyze runs

Bakeoff example (hm03) Identify most reasonable motif based in each dataset independently

Real Interesting pattern appears between 3 of 10 sequences Real

Markov

Generic

Bakeoff example (hm03) Identify most reasonable motif based in each dataset independently Determine which motif appears most reasonable across 3 benchmarks and map motif in sequences using Cistematic Compare results to actual locations (provided in bakeoff package)

Solution

Markov

Bakeoff results Correlation Coefficient: nCC = (nTP nTN - nFN nFP) / √((nTP+nFN)(nTN+nFP)(nTP+nFP)(nTN+nFN)) Sensitivity (fraction of known sites that are predicted): sSn = sTP / (sTP + sFN) Positive Predictive Value (fraction of predicted sites that are known.) sPPV = sTP / (sTP + sFP)

Bakeoff results cisGreedy overall 7 th best performer (excluding those with no data): – Overall top performer in fly – Worst performer in yeast – 3 rd worst performer in mouse – 4 th best performer in human

Adapted from Tompa, 2005 Bakeoff results When running programs in parallel, correlation of motif finder results to true binding sites improves

Future goals Complete analysis of results for cisGreedy using benchmarks established by Tompa paper (Nature Biotech, 2005) Document results and algorithm development Continue improving cisGreedy

References Bioalgorithms.info Jones, Neil C., and Pavel A. Pevzner. An Introduction to Bioinformatics Algorithms. : MIT Press, 2004. Hertz, Gerald Z., and Gary D. Stormo. "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences." Bioinformatics 8 1999: 563-577. Tompa, Martin et al. “Assessing computational tools for the discovery of transcription factor binding sites." Nature Biotechnology January 2005: 137-144. Wenick, Adam S., and Oliver Hobert. "Genomic cis-Regulatory Architecture and trans-Acting Regulators of a Single Interneuron- Specific Gene Battery in C. elegans." Developmental Cell 6(2005): 757-770. http://cistematic.caltech.edu

Acknowledgements Ali Mortazavi Barbara Wold Wold Lab funding provided by DOE & NASA Additional funding by NSF & NIH SoCalBSI faculty, staff and fellow students

CisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold.

Similar presentations

Presentation on theme: "CisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold.

Similar presentations

Presentation on theme: "CisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold."— Presentation transcript:

Similar presentations

About project

Feedback