Download presentation
Presentation is loading. Please wait.
Published byLindsey York Modified over 9 years ago
1
Statistical Analysis for Word counting in Drosophila Core Promoters Yogita Mantri April 27 2005 Bioinformatics Capstone presentation
2
Introduction & Motivation Dataset used Part I – Unbiased word counting Part II – TCAGT-centric word counting Conclusions and Future work
3
Introduction Regulatory elements are short DNA sequences that control gene expression. They are often found around the Transcription Start Site (TSS), sometimes further upstream. Identification of promoters and regulatory elements is a major challenge in bioinformatics: Regulatory elements are not well-conserved Computational discovery of TSS in not straightforward Promoter sequences do not have distinguishable statistical properties Transcription is a highly cooperative process including competitive or cooperative binding which is not completely determined from the rest of the genome’s DNA sequence
4
“Computational analysis of core promoters in the Drosophila Genome”, Ohler, Rubin et. al, Genome Biology 2002, 3(12):research0087.1–0087.12 Above image edited from: http://163.238.8.180/~davis/Bio_327/lectures/Transcription/TranscriptionOver.html Drosophila Core Promoters
5
Motivation for project Database of Core Promoters with TSS experimentally determined is a huge advantage over other approaches using only gene upstream regions. Word Counting method to determine significant patterns, inspired by Dr. Peter Cherbas’ earlier work. “The arthropod initiator: the capsite consensus plays an important role in transcription”, Cherbas L, Cherbas P., Insect Biochem Mol Biol. 1993 Jan;23(1):81-90
6
Introduction & Motivation Dataset used Part I – Unbiased word counting Part II – TCAGT-centric word counting Conclusions and Future work
7
The Database of Drosophila Core Promoters Compiled by Sumit Middha. It consists of Drosophila core promoters from three experimental sources. Ohler, Rubin et al: 1941 promoters Stringent criteria for identifying TSSs, requiring 5’ ends of multiple cDNAs to lie in close proximity. Kadonaga et al: 205 promoters Changed TSS to coincide with A of Inr consensus TCAGT even if experimental results reported TSS in the vicinity. The discrepancy was fixed by taking the experimentally reported TSS. Eukaryotic Promoter Database: 1926 promoters Assigned TSS based on experimental data with a precision of +/- 5bp or better. 3458 sequences after removing redundant entries in the dataset.
8
Introduction & Motivation Dataset used Part I – Unbiased word counting Part II – TCAGT-centric word counting Conclusions and Future work
9
Word Analysis – Part I Unbiased search Used various statistical measures like Z- score on all possible n-mers in the entire dataset and in specific windows. The goal was to see whether known patterns of interest were significantly enriched in promoter sequences than other patterns.
10
Basic Statistics of the dataset 3458 promoter sequences in the database. First step was a word-frequency analysis (pentamers used for initial analysis) Performed analysis on the following sets: Entire dataset (DS-1) Subset of above dataset, with only -20 to +20 region (DS-2) 2 types of analyses, differing in “Random” sequences used: 1 st Order Markov Chains based on base and transition probabilities of respective dataset “non-coding” regions
11
Random set Generated 100 sets of 1 st order Markov chains Each set contained same number of sequences as original datase (3458), and having same length (350) Computed occurrence of each pentamer in actual and random sequences For random sequences, calculated average and S.D over all sets
12
Z-score A test of significance Mean and S.D calculated over 100 sets Calculated Z-scores for all pentamers Looking for pentamers with very high or very low Z- scores
13
RankPatternZ-Score 1aaaaa113.037 2ttttt111.647 3ttttg88.1 4gaaaa83.156 5aaaac82.69 6atttt82.152 7gtttt82.067 8ttttc79.485 9aaaat78.348 10gcagc77.091 101gcagt29.269 115tcagt27.156 307acagt10.286 485tcatt1.375 965tataa-25.213 Rank of TCAGT and variants in entire dataset
14
-20+20 PATTERNZ-ScoreRank tcagt58.9292 tcatt3.6418 gcagt25.54534 acagt12.923179 tataa-251022 Summary of known pentamers in different windows PatternZ-scoreRank tcagt7.559871254 tcatt-1.402484576 gcagt9.0644839200 acagt2.7177419409 tataa-8.962065880 Sliding Windows PatternZ-scoreRank tcagt4.277429356 tcatt-2.00671590 gcagt7.714143246 acagt2.080429435 tataa-9.064898 Non-overlapping windows
15
Z-score Plots of tcagt and variants using sliding windows of 10 bp
16
Lesson Cannot ignore position preference of regulatory motifs!
17
Introduction & Motivation Dataset used Part I – Unbiased word counting Part II – TCAGT-centric word counting Conclusions and Future work
18
Word Analysis – Part II Guided search, starting with known INR element TCAGT Identification of INR enriched regions Identification of synonyms Correlation analysis of INR synonyms Guided search
19
TCAGT-centric word analysis WindowZscore (-3,3) 130.58 (-4,2) 116.27 (-2,4) 105.67 (-5,1) 98.96 (-6,-1) 95.71 (-7,-2) 85.83 (-1,5) 59.23 (1,6) 47.68 (2,7) 43.30 (3,8) 28.79
20
Group1 CTCAG--- ATCAG--- TTCAG--- GTCAG--- -TCAGT-- ---AGTTG ---AGTCG --CAGTT- --CAGTC- Group 3 ACACT--- -CACTCTG Group 4 -TCACA- GTCAC-- --CACAC Group 6 -CATTC TCATT- INR Synonyms Group 2 TTAGT Group 5 TCACTCT “Computational analysis of core promoters in the Drosophila Genome”, Ohler, Rubin et. al, Genome Biology 2002, 3(12):research0087.1–0087.12
21
TOTAL: 3412 1801 1611 INR+INR- TATA+ TATA- TATA+ TATA- DPE-DPE+ DPE- DPE+ DPE- DPE+ DPE- 410 1201 3971404 79 1172 331 832 36976 321 232 Binary Tree Representation of Dataset
22
3 Clusters in INR-positive set ggtcacact ggtcacac cggtcacac ttcagtcg cggacgtg tataaaag tcagt TATA (-40, -35) DPE (+20, -30) INR (-10, +2) 50.0 100.0 150.0 200.0 250.0 0.0
23
TATA+TATA- INR+41012011611 INR-39714041801 8072605 INR+, TATA+ Log Likelihood: 0.073 DPE+DPE- INR+44811631611 INR-30814931801 7562656 INR+, DPE+ Log Likelihood: 0.227 DPE+DPE- TATA+155652807 TATA-60120042605 7562656 INR+, DPE+ Log Likelihood: -0.143 Contingency Matrices for INR, TATA, DPE
24
tctttcttt ggtcacac ctcgaggg ctatcgat cggtcacac ttctttccg gtcacact Possible Alternative TATA and INR Synonyms ?? 0.0 10.0 90.0 80.0 70.0 60.0 30.0 20.0 40.0 50.0 TATA – 2 ? INR – 2 ?
25
actatcgat ctatcgat tatcgataaactatcgat Enrichment further upstream – New Binding Sites?
26
TOTAL: 3412 18011611 INR+ INR- TATA+ TATA- INR_2+ INR_2- DPE-DPE+ DPE- DPE+ TATA_2+ TATA_2- DPE+ DPE- 410 1201 3971404 Next Level of Binary Tree analysis DPE+ DPE- ? DPE+ DPE- ?
27
Conclusions & Future steps The main goal of this project was to try to identify significant words based on only statistical over- representation. The first part of the analysis using an unbiased searching method was successful only in a very narrow range of positions around the TSS. However, the biased search starting with the Inr consensus revealed the 3 known regulatory elements in that region. An analysis of the Inr-negative set showed over-expression of patterns in the same positions as the Inr, TATA and DPE should be, and could be possible synonyms. Thus the word-counting strategy has the potential to reveal: Regulatory motifs and interrelationships that other motif discovery programs cannot Synonyms for regulatory motifs Dependencies among regulatory motifs
28
Acknowledgements Dr. Haixu Tang Dr. Sun Kim Dr. Peter Cherbas Sumit Middha Bioinformatics Research Group
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.