Seven clusters and four types of symmetry in microbial genomes Andrei Zinovyev Bioinformatics service group of M.Gromov Tatyana Popova R&D Centre in Biberach, Germany Alexander Gorban Centre for Mathematical Modelling
Symbol of GofG’05
Genomic sequence as a text in unknown language tagggrcgcacgtggtgagctgatgctaggg frequency dictionaries: t a g g g r c g c a c g t g g t g a g c t g a t g c t a g g g ta gg gr cg ca cg tg gt ga gc tg at gc ta gg tagg grcg cacg tggt gagc tgat gcta gggr N = 4=4 1 N = 16=4 2 N = 64=4 3 N=256=4 4 gggrcgccacgttggtgagctgatgctagggrcgacgtgg tagggrcgcacgtggtgagctgatgctagggrcgacgtgg agggrcgcacgtggtgagctgatgctagggrcgacgtggc..cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc…
From text to geometry cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc 10 7 cgtggtgagctgatgctagggrcgcac ggtgagctgatgctagggrcgcacact tgagctgatgctagggrcgcacaattc gtgagctgatgctagggrcgcacggtg …… gagctgatgctagggrcgcacaagtga length~ fragments RNRN
Method of visualization principal components analysis RNRN R2R2 R2R2 PCA plot
Caulobacter crescentus singles N=4 doublets N=16 triplets N=64 quadruplets N=256 !!! the information in genomic sequence is encoded by non-overlapping triplets (Nature, 1961)
First explanation cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc
tga tgc tag ggr cgc acg tgg ctg atg cta ggg rcg cac gtg Basic 7-cluster structure gtgagctgatgctagggrcgcacgtggtgagc gct gat gct agg grc gca cgt gtgaatcggtgggtgaqtgtgctgctatgagc atc ggt ggg tga gtg tgc tgc tcg gtg ggt gag tgt gct gct cgg tgg gtg agt gtg ctg ctg
Non-coding parts gtgagctgatgctagggr cgcacgaat Point mutations: insertions, deletions a
The flower-like 7 clusters structure is flat
Seven classes vs Seven clusters Stanford TIGR Georgia Institute of Technology
Computational gene prediction Accuracy >90%
Mean-field approximation for triplet frequencies F IJK : Frequency of triplet IJK ( I,J,K {A,C,G,T} ): F AAA, F AAT, F AAC … F GGC, F GGG : 64 numbers position-specific letter frequency + correlations : 12 numbers
Why hexagonal symmetry? GC-content = P C + P G
Genome codon usage and mean-field approximation ggtgaATG gat gct agg … gtc gca cgc TAAtgagct … correct frameshift 64 frequencies F IJK … ggtgaATG gat gct agg … gtc gca cgc TAAtgagct 12 frequencies P I 1, P J 2, P K 3
P I J are linear functions of GC-content eubacteria archae
THE MYSTERY OF TWO STRAIGHT LINES ??? R 12 R 64 F IJK = P 1 I P 2 J P 3 K + correlations
Codon usage signature 0-+
19 possible eubacterial signatures
Example: Palindromic signatures
Four symmetry types of the basic 7-cluster structure eubacteria flower-like degenerated perpendicular triangles parallel triangles
B.Halodurans (GC=44%) S.Coelicolor (GC=72%) F.Nucleatum (GC=27%) E.Coli (GC=51%)
Web-site cluster structures in genomic sequences
Human genome (chr19) non-repetitive sequences repetitive sequences singles doublets triplets
Letter frequencies (3 dimensions) GC-content (50%) Purine- Pyrimidine (33%) Amino- Keto (17%) a t c g a t c g a c g t
Non-linear good 2D representation (elastic principal manifolds) A T G C 0% 100%
Measuring densities A T G C A T G C
Contrasting density distribution (two ideas) Noise is Gaussian Noise is smooth
Contrasted density A T G C A T G C
Excluding repeats A T G C A T G C
A T G C A T G C
Papers (type Zinovyev in Google) Gorban A, Zinovyev A PCA deciphers genome Arxiv preprint Gorban A, Popova T, Zinovyev A Codon usage trajectories and 7-cluster structure of 143 complete bacterial genomic sequences Physica A 353, Gorban A, Popova T, Zinovyev A Four basic symmetry types in the universal 7-cluster structure of microbial genomic sequences In Silico Biology 5, 0025 Gorban A, Zinovyev A, Popova T Seven clusters in genomic triplet distributions Seven clusters in genomic triplet distributions In Silico Biology. V.3, Zinovyev A, Gorban A, Popova T Self-Organizing Approach for Automated Gene Identification Self-Organizing Approach for Automated Gene Identification Open Systems and Information Dynamics 10 (4).
People Dr. Tanya Popova Institute of Computational Modeling Russia Professor Alexander Gorban University of Leicester UK