Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev Institute des Hautes Etudes Scientifique, Bures-sur-Yvette
n Transition probabilities = n Frequencies of N-grams …AGGTCGATC … Markov chain models
AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC Sliding window width W f AAA f AAC f GGG … = f ijk, i,j,k in [A,C,G,T]
AGGTCG ATG AATCCGTATTGACAAATGAATCCG TAA TGACATGACAATCCAACATGACAAT Protein-coding sequences bacterial gene correct frame f ijk f ijk (1) f ijk (2)
TCCAGC TTA TGAGGCATAACTGTTTACTGAGGC CAT ACT GTACTGTTAGGTTGTACTGTTA AGGTCG AAT ACTCCGTATTGACAAATGACTCCG GTA TGACATGACAATCCAACATGACAAT “Shadow” genes shadow gene, =G=G
When we can detect genes (by their content)?, 1.When non-coding regions are very different in base composition (e.g., different GC-content) 2.When distances between the phases are large: non-coding
Simple experiment, 1. Only the forward strands of genomes are used for triplet counting 2. Every p positions in the sequence, open a window (x-W/2,x+W/2,) of size W and centered at position x 3. Every window, starting from the first base-pair, is divided into W/3 non-overlapping triplets, and the frequencies of all triplets f ijk are calculated 4. The dataset consists of N = [L/p] points, where L is the entire length of the sequence 5. Every data point X i ={x is } corresponds to one window and has 64 coordinates, corresponding to the frequencies of all possible triplets s = 1,…,64
Principal Component Analysis, Maximal dispersion 1 st Principal axis 2 nd principal axis
ViDaExpert tool,
Caulobacter crescentus (GenBank NC_002696),
“Path” of sliding window,
Helicobacter pylori (GenBank NC_000921),
Saccharomyces cerevisiae chromosome IV,
Model sequences: (random codon usage),
Model sequences: (random codon usage+ 50% of frequencies are set to 0),
Graph of coding phase,
Assessment, SequenceLW % of coding bases Sn 1 Sp 1 Sn 2 Sp 2 Helicobacter pylori, complete genome (NC_000921) Caulobacter crescentus, complete genome (NC_002696) Prototheca wickerhamii mitochondrion (NC_001613) Saccharomyces cerevisiae chromosome III (NC_001135) Saccharomyces cerevisiae chromosome IV (NC_001136) Model text RANDOM Model text RANDOM_BIAS Completely blind prediction
Dependence on window size,
, W = 51 W = 252 W = 900 W = 2000
State of art: GLIMMER strategy, 1.Use MM of 5 th order (hexamers) 2.Use interpolation for transition probabilities 3.Use long ORF (>500bp) as learning dataset Problems: 1.The number of hexamers to be evaluated is still big 2.Applicable only for collected genomes of good quality (<1frameshift/1000bp)
What can we learn from this game?, Learning can be replaced with self-learning Bacterial gene-finders work relatively well, when concentration of coding sequences is high Correlations in the order of codons are small Codon usage is approximately the same along the genome The method presented allows self-learning on pieces of even uncollected DNA (>150 bp) The method gives alternative to HMM view on the problem of gene recognition
Acknowledgements, Professor Alexander Gorban Professor Misha Gromov My coordinates: