Nora Pierstorff Dept. of Genetics University of Cologne 30.8.2005 Combined ab initio and comparative analysis of putative regulatory regions Nora Pierstorff Dept. of Genetics University of Cologne 30.8.2005
Outline Introduction Ab Initio Approach Datasets Comparative Analysis of Enhancers and Results Combination of Both Approaches and Results Discussion
Eukaryotic regulation model
3 Approaches Search for binding sites of known transcription factors using Position Weight Matrices. Search for conserved motifs in upstream-regions of homolog or coregulated genes. Search statistical overrepresented motifs
Outline Introduction Ab Initio Approach Datasets Comparative Analysis of Enhancers and Results Combination of Both Approaches and Results Discussion
Ab Initio Approach (overrepresented patterns) overrepresented patterns are frequent in the DNA => many false positive predictions amount of available data is not large enough to find additional reliable universally valid rules
Outline Introduction Ab Initio Approach Datasets Comparative Analysis of Enhancers and Results Combination of Both Approaches and Results Discussion
Dataset (collected by Nazina et al. 2003) target-species: Drosophila melanogaster reference species: D. yakuba D. ananassae D. pseudoobscura D. virilis # sequences: 39 # bp: 1080200 # regulatory regions: 87 # bp in enh: 158317 enhancer/sequence: 2.462 amount of bp in enhancers: 0.14656 Dorsal motif dorsal matches
Outline Introduction Ab Initio Approach Datasets Comparative Analysis of Enhancers and Results Combination of Both Approaches and Results Discussion
Are enhancers alignable? Emberly et al. (2003) the overlap of binding sites and conserved sequence blocks is not much greater than by chance, but still statistically significant compared organisms: D. melanogaster and D. pseudoobscura alignment methods: LAGAN, SMASH (construct chains of local alignments)
Assumptions about enhancer conservation binding sites contain core sequences essential to bind transcription factor core sequences are conserved between binding sites of one species and between species binding sites are indicated by short, exactly conserved, overrepresented patterns
Alignment of short exact matches input: chain of high scoring fragments from blastn alignment of each sequence pair output: regions containing a high amount of short conserved stretches
Outline Introduction Ab Initio Approach Datasets Comparative Analysis of Enhancers and Results Combination of Both Approaches and Results Discussion
Result using only comparative approach with 5 species m8 region score = number of short conserved stretches in a 200bp window
Outline Introduction Ab Initio Approach Datasets Comparative Analysis of Enhancers and Results Combination of Both Approaches and Results Discussion
searching overrepresented motifs in conserved region input: all short conserved words 1. step: counting the occurrence of all 5bp-substrings of the word in the 1000 surrounding basepairs 2. calculating one observed/expected ratio for every species output: conserved stretches containing at least one 5mer which is overrepresented in each species
Outline Introduction Ab Initio Approach Datasets Comparative Analysis of Enhancers and Results Combination of Both Approaches and Results Discussion
Improvement by combination m8 region score = number of short conserved stretches in a 200bp window m8 region score = number of short conservedoverrepresented stretches in a 200bp window
improvement by combination
Outline Introduction Ab Initio Approach Datasets Comparative Analysis of Enhancers and Results Combination of Both Approaches and Results Discussion
Discussion use of a combination of methods improves predictions in nearest future regulatory regions can be found without knowing the binding transcription factors, if enough related species are known. more features to differ between conserved regulatory regions and other functional conserved regions need to be found
References E. Emberly, N. Rajewsky, E. Siggia (2003) Conservation of regulatory elements between two species of Drosophila BMC Bioinformatics 2003, 4:57 A. Nazina, D. Papatsenko (2003) Statistical extraction of Drosophila cis-regulatory modules using exhaustive assessment of local word frequency. BMC Bioinformatics. 2003 Dec 22;4:65.