Download presentation
Presentation is loading. Please wait.
1
cisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold
2
cisGreedy De novo motif finder which implements a greedy algorithm similar to Consensus motif finder Goal: To provide an efficient algorithm to be included in the Cistematic package that performs similarly to Consensus and meme
3
Cistematic Integrate visualization, refinement of motifs and improve performance of multiple motif finders in a single package Mortazavi, 2006
4
Cistematic Image: Ali Mortazavi cisGreedy becomes part of “Bottom Tier” Motif finder would be included in the Cistematic package (prevents need for complicated installations)
5
What is a Motif? cis-Regulatory elements – Transcription Factor Binding Sites(TFBS) – Binding by transcription factors may increase or decrease transcription of genes
6
What is a Motif? GAL4 in Yeast – Activator of galactose-induced genes (convert galactose to glucose) – Protein structure determines motif DNA-protein interactions require certain bases at specified locations Motif reflects homodimer structure
7
What is a Motif? cis-Regulatory elements – Transcription Factor Binding Sites(TFBS) – Binding by transcription factors may increase or decrease transcription of genes Gene Regulation believed to be a major source of complexity – Plants may have more genes or larger genomes than humans – are they more complex? Identification of cis-regulatory elements will help us understand gene regulatory networks (bigger picture)
8
How do we find motifs? Hard to identify – Relatively short sequences (as small as 6 bases) – Many positions not well conserved Factors improving identification – Usually localized in certain proximity of a gene (search within 3 kb upstream) – Some positions highly conserved – Use other data (Microarray?)
9
Motif Finders Greedy – Maximizes similarity of motifs from sequences through a greedy approach – Eliminate background modeling by using Cistematic package preprocessing steps Improves speed Prevents false negatives – Implements multiple models (zoops, oops, TCM)
10
Consensus Scoring Use equation similar to log likelihood called Information Content Hertz, Gerald Z., and Gary D. Stormo. "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences." Bioinformatics 8 1999: 563- 577. L columns in the matrix A = {A,C,G,T} frequency of each letter i at each position j a priori probability of letter i
11
Removing Background Goal of a background model: differentiate noise from signal Issues with background: – What background should be used? Whole genome? Conserved regions? – Selective pressures maintain conserved regions Arguably searching in conserved regions guarantees there is little noise (it has been maintained) Solution: – Search in conserved regions – Use simple repeat masking – Sequences which reoccur are likely TFBS
12
cisGreedy scoring Scoring focuses on maximizing number of identical bases – Percent identity is dependent on number of deviations from the strict consensus – Background adds complexity that may lead to false negatives
13
cisGreedy Input sequences are analyzed Randomly select 2 sequences to be compared
14
cisGreedy The two selected sequences are analyzed independently of the remaining sequences
15
cisGreedy The two selected sequences are analyzed independently of the remaining sequences Windows of motif size are scanned starting at the beginning of each sequence
16
cisGreedy Sequences are scanned in an attempt to locate the highest scoring alignment – Alignments are ungapped – Score is established as the number of sequences containing the most frequently occurring base at each position
17
cisGreedy Reverse Complements are analyzed (user specified) Once start locations are established with a top alignment score, these are left unchanged (Greedy)
18
cisGreedy Select an additional sequence in which to identify the location of the motif Windows in the additional sequence are aligned to previously established windows (Greedy)
19
cisGreedy Additional sequence scanned as before, reverse complement (user specified) Alignment score established as before
20
cisGreedy Final motif locations are used in order to build position specific frequency matrices Reverse complement sequence used in building PSFM if used
21
Testing cis-Greedy AIY – 16bp cis-regulatory motif drives expression – Experimentally verified – Gene battery consists of a set of genes bound by AIY Orthologous genes contain highly specified binding sites Individual binding sites of battery genes within a single species can vary considerably (Wenick and Hobert 759)
22
Cistematic Results for AIY hen-1 regions of conservation orthologous genes
23
Results for AIY AIY IdentifiedAAATTGGCTTCCTCAAA cisGreedyTTTGAGGAAGCCAATTT (reverse comp) AAATTGGCTTCCTCAAA memeAAATTGGCTTCCTCAAA AIY- Battery Consensus
24
Cistematic Results for AIY hen-1
25
Results for AIY hen-1
26
Tompa Bakeoff 3 benchmark datasets – Real – Markov Chain – Generic 4 organisms – Human – Mouse – Fruitfly – Yeast Each dataset contains 0-1 motifs. Each sequence can have 0 or multiple motifs Report 0-1 motif per dataset and locations of motifs Use statistical tools provided by bakeoff to analyze runs
27
Bakeoff example (hm03) Identify most reasonable motif based in each dataset independently
28
Real Interesting pattern appears between 3 of 10 sequences Real
29
Markov
30
Generic
31
Bakeoff example (hm03) Identify most reasonable motif based in each dataset independently Determine which motif appears most reasonable across 3 benchmarks and map motif in sequences using Cistematic Compare results to actual locations (provided in bakeoff package)
32
Solution
33
Real
34
Solution
35
Markov
36
Bakeoff results Correlation Coefficient: nCC = (nTP nTN - nFN nFP) / √((nTP+nFN)(nTN+nFP)(nTP+nFP)(nTN+nFN)) Sensitivity (fraction of known sites that are predicted): sSn = sTP / (sTP + sFN) Positive Predictive Value (fraction of predicted sites that are known.) sPPV = sTP / (sTP + sFP)
37
Bakeoff results cisGreedy overall 7 th best performer (excluding those with no data): – Overall top performer in fly – Worst performer in yeast – 3 rd worst performer in mouse – 4 th best performer in human
38
Adapted from Tompa, 2005 Bakeoff results When running programs in parallel, correlation of motif finder results to true binding sites improves
39
Future goals Complete analysis of results for cisGreedy using benchmarks established by Tompa paper (Nature Biotech, 2005) Document results and algorithm development Continue improving cisGreedy
40
References Bioalgorithms.info Jones, Neil C., and Pavel A. Pevzner. An Introduction to Bioinformatics Algorithms. : MIT Press, 2004. Hertz, Gerald Z., and Gary D. Stormo. "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences." Bioinformatics 8 1999: 563-577. Tompa, Martin et al. “Assessing computational tools for the discovery of transcription factor binding sites." Nature Biotechnology January 2005: 137-144. Wenick, Adam S., and Oliver Hobert. "Genomic cis-Regulatory Architecture and trans-Acting Regulators of a Single Interneuron- Specific Gene Battery in C. elegans." Developmental Cell 6(2005): 757-770. http://cistematic.caltech.edu
41
Acknowledgements Ali Mortazavi Barbara Wold Wold Lab funding provided by DOE & NASA Additional funding by NSF & NIH SoCalBSI faculty, staff and fellow students
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.