Presentation is loading. Please wait.

Presentation is loading. Please wait.

Codons, Genes and Networks Bioinformatics service group of M.Gromov Andrei Zinovyev.

Similar presentations


Presentation on theme: "Codons, Genes and Networks Bioinformatics service group of M.Gromov Andrei Zinovyev."— Presentation transcript:

1 Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

2 Plan of the talk Part I: 7-clusters structure of genome (codons and genes) Part II: Coding and non-coding DNA scaling laws (genes and networks)

3 Part I: 7-clusters genome structure Dr. Tatyana Popova R&D Centre in Biberach, Germany Prof. Alexander Gorban Centre for Mathematical Modelling

4 Genomic sequence as a text in unknown language tagggacgcacgtggtgagctgatgctaggg frequency dictionaries: t a g g g a c g c a c g t g g t g a g c t g a t g c t a g g g ta gg ga cg ca cg tg gt ga gc tg at gc ta gg tagg gacg cacg tggt gagc tgat gcta gggr N = 4=4 1 N = 16=4 2 N = 64=4 3 N=256=4 4 gggrcgccacgttggtgagctgatgctagggrcgacgtgg tagggrcgcacgtggtgagctgatgctagggrcgacgtgg agggrcgcacgtggtgagctgatgctagggrcgacgtggc..cgtggtgagctgatgctagggacgcacgtggtgagctgatgctagggacgacgtggtgagctgatgctagggacgc…

5 From text to geometry cgtggtgagctgatgctagggacgcacgtggtgagctgatgctagggacgacgtggtgagctgatgctagggacgc 10 7 cgtggtgagctgatgctagggacgcac ggtgagctgatgctagggacgcacact tgagctgatgctagggacgcacaattc gtgagctgatgctagggacgcacggtg …… gagctgatgctagggacgcacaagtga length~200-400 10000-20000 fragments RNRN

6 Method of visualization principal components analysis RNRN R2R2 R2R2 PCA plot

7 Caulobacter crescentus singles N=4 doublets N=16 triplets N=64 quadruplets N=256 !!! the information in genomic sequence is encoded by non-overlapping triplets (Nature, 1961)

8 First explanation cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc

9 tga tgc tag ggr cgc acg tgg ctg atg cta ggg rcg cac gtg Basic 7-cluster structure gtgagctgatgctagggrcgcacgtggtgagc gct gat gct agg grc gca cgt gtgaatcggtgggtgaqtgtgctgctatgagc atc ggt ggg tga gtg tgc tgc tcg gtg ggt gag tgt gct gct cgg tgg gtg agt gtg ctg ctg

10 Non-coding parts gtgagctgatgctagggr cgcacgaat Point mutations: insertions, deletions a

11 The flower-like 7 clusters structure is flat

12 Seven classes vs Seven clusters Stanford TIGR Georgia Institute of Technology Hong-Yu Ou, Feng-Biao Guo and Chun-Ting Zhang (2003). Analysis of nucleotide distribution in the genome of Streptomyces coelicolor A3(2) using the Z curve method. FEBS Letters 540(1-3),188-194 Audic, S. and J. Claverie. Self-identification of protein-coding regions in microbial genomes. Proc Natl Acad Sci U S A, 95(17):10026-31, 1998. Lomsadze A., Ter-Hovhannisyan V., Chernoff YO, Borodovsky M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Research, 2005, Vol. 33, No. 20

13 Computational gene prediction Accuracy >90%

14 Mean-field approximation for triplet frequencies F IJK : Frequency of triplet IJK ( I,J,K {A,C,G,T} ): F AAA, F AAT, F AAC … F GGC, F GGG : 64 numbers position-specific letter frequency + correlations : 12 numbers

15 Why hexagonal symmetry? 0-+ -+0 +0- +-0 -0+ 0+- GC-content = P C + P G

16 Genome codon usage and mean-field approximation ggtgaATG gat gct agg … gtc gca cgc TAAtgagct … correct frameshift 64 frequencies F IJK … ggtgaATG gat gct agg … gtc gca cgc TAAtgagct 12 frequencies P I 1, P J 2, P K 3

17 P I J are linear functions of GC-content eubacteria archae

18 THE MYSTERY OF TWO STRAIGHT LINES ??? R 12 R 64 F IJK = P 1 I P 2 J P 3 K + correlations

19 Codon usage signature 0-+

20 19 possible eubacterial signatures

21 Example: Palindromic signatures

22 Four symmetry types of the basic 7-cluster structure eubacteria flower-like degenerated perpendicular triangles parallel triangles

23 B.Halodurans (GC=44%) S.Coelicolor (GC=72%) F.Nucleatum (GC=27%) E.Coli (GC=51%)

24 Using branching principal components to analyze 7-clusters genome structures

25 Streptomyces coelicolor Bacillus haloduransErcherichia coli Fusobacterium nucleatum Using branching principal components to analyze 7-clusters genome structures

26 Web-site http://www.ihes.fr/~zinovyev/7clusters cluster structures in genomic sequences

27 Papers (type Zinovyev in Google) Gorban A, Zinovyev A PCA deciphers genome. 2005. Arxiv preprint Gorban A, Popova T, Zinovyev A Codon usage trajectories and 7-cluster structure of 143 complete bacterial genomic sequences. 2005. Physica A 353, 365-387 Gorban A, Popova T, Zinovyev A Four basic symmetry types in the universal 7-cluster structure of microbial genomic sequences. 2005. In Silico Biology 5, 0025 Gorban A, Zinovyev A, Popova T Seven clusters in genomic triplet distributions Seven clusters in genomic triplet distributions. 2003. In Silico Biology. V.3, 0039. Zinovyev A, Gorban A, Popova T Self-Organizing Approach for Automated Gene Identification Self-Organizing Approach for Automated Gene Identification. 2003. Open Systems and Information Dynamics 10 (4).

28 Part II:Coding and non-coding DNA scaling laws Dr. Thomas Fink Bioinformatics service Dr. Sebastian Ahnert Cavendish laboratory, University of Cambridge

29 C-value and G-value paradox Neither genome length nor gene number account for complexity of an organism Drosophila melanogaster (fruit fly) C=120Mb Podisma pedestris (mountain grasshopper) C=1650 Mb

30 Non-linear growth of regulation Mattick, J. S. Nature Reviews Genetics 5, 316–323 (2004). “Amount of regulation” scales non-linearly with the number of genes: every new gene with a new function requires specific regulation, but the regulators also need to be regulated Log number of genes Log number of regulatory genes bacteria archae Slope = 1.96 Slope = 1

31 Complexity ceiling for prokaryotes Adding a new function S requires adding a regulatory overhead R, the total increase is N = R + S Since R ~ N 2, at some point R > S, i.e. gain from a new function is too expensive for an organism, it requires too much regulation to be integrated There is a maximum possible genome length for prokaryotes (~10Mb) There is a maximum possible genome length for prokaryotes (~10Mb)

32 How eukaryotes bypassed this limitation? Presumably, they invented a cheaper (digital) regulatory system, based on RNA This regulatory information is stored in the “non-coding” DNA

33 Simple model: Accelerated networks Node is a gene (c genes) Edge is a “regulation” (n edges) n = c 2 Connectivity < k max, regulators are only proteins Connectivity > k max deficit of regulations is taken from non-coding DNA

34 How much regulation genome needs to take from non-coding DNA? c max (prokaryotic ceiling) These regulations must be encoded in the non-coding part of genome, therefore N – non-coding DNA length C – coding DNA length C prok – ceiling for prokaryotes (~10Mb) some coefficient

35 Observation: coding length vs non-coding =1 Minimum non-coding length needed for the «deficit» regulation

36 Hypothesis Prokaryotes: =    ( little constant add-on, promoters, UTRs… ) 15% ≈ 1/7 Eukaryotes N reg = /2 C/C maxprok (C-C maxprok ) ~ C 2, C maxprok ≈ 10Mb ≈ This is the amount necessary for regulation, but repeats, genome parasites, etc., might make a genome much bigger

37 This is only a hypothesis, but… Prediction on the N reg for human: N reg = 87 Mb = 3% of genome length C = 48 Mb = 1.7% N reg +C = 4.7%

38 Thank you for your attention Questions?


Download ppt "Codons, Genes and Networks Bioinformatics service group of M.Gromov Andrei Zinovyev."

Similar presentations


Ads by Google