Codons, Genes and Networks Bioinformatics service group of M.Gromov Andrei Zinovyev.

Slides:



Advertisements
Similar presentations
The genetic code.
Advertisements

Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques group of M.Gromov.
Restriction Enzymes Lecture 15: 1 11/20/ Definition: enzymes that recognize specific double-stranded sequences and hydrolyze the phosphodiester.
Seven clusters and four types of symmetry in microbial genomes Andrei Zinovyev Bioinformatics service group of M.Gromov Tatyana Popova R&D Centre.
 -GLOBIN MUTATIONS AND SICKLE CELL DISORDER (SCD) - RESTRICTION FRAGMENT LENGTH POLYMORPHISMS (RFLP)
The Secret Code of Life: “The Cellville Cipher” Genome British Columbia,
ATG GAG GAA GAA GAT GAA GAG ATC TTA TCG TCT TCC GAT TGC GAC GAT TCC AGC GAT AGT TAC AAG GAT GAT TCT CAA GAT TCT GAA GGA GAA AAC GAT AAC CCT GAG TGC GAA.
Supplementary Fig.1: oligonucleotide primer sequences.
ECE 501 Introduction to BME
Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal.
Introduction to Molecular Biology. G-C and A-T pairing.
1 Essential Computing for Bioinformatics Bienvenido Vélez UPR Mayaguez Lecture 5 High-level Programming with Python Part II: Container Objects Reference:
Today… Genome 351, 8 April 2013, Lecture 3 The information in DNA is converted to protein through an RNA intermediate (transcription) The information in.
Reading the blueprint of life DNA sequencing. Introduction The blueprint of life is contained in the DNA in the nuclei of eukaryotic cells and simply.
GENE MUTATIONS aka point mutations. DNA sequence ↓ mRNA sequence ↓ Polypeptide Gene mutations which affect only one gene Transcription Translation © 2010.
IGEM Arsenic Bioremediation Possibly finished biobrick for ArsR by adding a RBS and terminator. Will send for sequencing today or Monday.
CAI and the most biased genes Zinovyev Andrei Institut des Hautes Études Scientifiques.
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
Nature and Action of the Gene
Markov Chain Models BMI/CS 576 Fall 2010.
Biological Dynamics Group Central Dogma: DNA->RNA->Protein.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad.
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
Undifferentiated Differentiated (4 d) Supplemental Figure S1.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
Supplemental Table S1 For Site Directed Mutagenesis and cloning of constructs P9GF:5’ GAC GCT ACT TCA CTA TAG ATA GGA AGT TCA TTT C 3’ P9GR:5’ GAA ATG.
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev Institute des Hautes Etudes Scientifique, Bures-sur-Yvette.
PART 1 - DNA REPLICATION PART 2 - TRANSCRIPTION AND TRANSLATION.
Chapter 21 Eukaryotic Genome Sequences
TRANSLATION: information transfer from RNA to protein the nucleotide sequence of the mRNA strand is translated into an amino acid sequence. This is accomplished.
Today… Genome 351, 8 April 2013, Lecture 3 The information in DNA is converted to protein through an RNA intermediate (transcription) The information in.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
From Genomes to Genes Rui Alves.
Prodigiosin Production in E. Coli Brian Hovey and Stephanie Vondrak.
Passing Genetic Notes in Class CC106 / Discussion D by John R. Finnerty.
Evaluation of Three Extraction Methods for DNA Quantification and PCR Detection in Cocoa-Derived Products. Assessment of Genetic Diversity of the Main.
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
Definitions Mutation – any change in the genetic sequence.
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
Suppl. Figure 1 APP23 + X Terc +/- Terc +/-, APP23 + X Terc +/- G1Terc -/-, APP23 + X G1Terc -/- G2Terc -/-, APP23 + X G2Terc -/- G3Terc -/-, APP23 + and.
Structure and Function of DNA DNA Replication and Protein Synthesis.
Example 1 DNA Triplet mRNA Codon tRNA anticodon A U A T A U G C G
Name of presentation Month 2009 SPARQ-ed PROJECT Mutations in the tumor suppressor gene p53 Pulari Thangavelu (PhD student) April Chromosome Instability.
DNA, RNA and Protein.
The response of amino acid frequencies to directional mutation pressure in mitochondrial genomes is related to the physical properties of the amino acids.
ORF Calling.
The Secret Code of Life.
Short reads: 50 to 150 nt (nucleotide)
bacteria and eukaryotes
Modelling Proteomes.
Supplementary information Table-S1 (Xiao)
Sequence – 5’ to 3’ Tm ˚C Genome Position HV68 TMER7 Δ mt. Forward
Supplemental Table 3. Oligonucleotides for qPCR
Sequence Alignments—part 2
GENE MUTATIONS aka point mutations © 2016 Paul Billiet ODWS.
Supplementary Figure 1 – cDNA analysis reveals that three splice site alterations generate multiple RNA isoforms. (A) c.430-1G>C (IVS 6) results in 3.
Biology Chapter 9 Section 2 Part 2
Huntington Disease (HD)
DNA By: Mr. Kauffman.
Gene architecture and sequence annotation
More on translation.
Molecular engineering of photoresponsive three-dimensional DNA
Agenda 3/8 and 3/9 Uses in Agriculture Notes Plant Transgenic Activity
Fundamentals of Protein Structure
Graph Algorithms in Bioinformatics
Python.
RNA.
Presentation transcript:

Codons, Genes and Networks Bioinformatics service group of M.Gromov Andrei Zinovyev

Plan of the talk Part I: 7-clusters structure of genome (codons and genes) Part II: Coding and non-coding DNA scaling laws (genes and networks)

Part I: 7-clusters genome structure Dr. Tatyana Popova R&D Centre in Biberach, Germany Prof. Alexander Gorban Centre for Mathematical Modelling

Genomic sequence as a text in unknown language tagggacgcacgtggtgagctgatgctaggg frequency dictionaries: t a g g g a c g c a c g t g g t g a g c t g a t g c t a g g g ta gg ga cg ca cg tg gt ga gc tg at gc ta gg tagg gacg cacg tggt gagc tgat gcta gggr N = 4=4 1 N = 16=4 2 N = 64=4 3 N=256=4 4 gggrcgccacgttggtgagctgatgctagggrcgacgtgg tagggrcgcacgtggtgagctgatgctagggrcgacgtgg agggrcgcacgtggtgagctgatgctagggrcgacgtggc..cgtggtgagctgatgctagggacgcacgtggtgagctgatgctagggacgacgtggtgagctgatgctagggacgc…

From text to geometry cgtggtgagctgatgctagggacgcacgtggtgagctgatgctagggacgacgtggtgagctgatgctagggacgc 10 7 cgtggtgagctgatgctagggacgcac ggtgagctgatgctagggacgcacact tgagctgatgctagggacgcacaattc gtgagctgatgctagggacgcacggtg …… gagctgatgctagggacgcacaagtga length~ fragments RNRN

Method of visualization principal components analysis RNRN R2R2 R2R2 PCA plot

Caulobacter crescentus singles N=4 doublets N=16 triplets N=64 quadruplets N=256 !!! the information in genomic sequence is encoded by non-overlapping triplets (Nature, 1961)

First explanation cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc

tga tgc tag ggr cgc acg tgg ctg atg cta ggg rcg cac gtg Basic 7-cluster structure gtgagctgatgctagggrcgcacgtggtgagc gct gat gct agg grc gca cgt gtgaatcggtgggtgaqtgtgctgctatgagc atc ggt ggg tga gtg tgc tgc tcg gtg ggt gag tgt gct gct cgg tgg gtg agt gtg ctg ctg

Non-coding parts gtgagctgatgctagggr cgcacgaat Point mutations: insertions, deletions a

The flower-like 7 clusters structure is flat

Seven classes vs Seven clusters Stanford TIGR Georgia Institute of Technology Hong-Yu Ou, Feng-Biao Guo and Chun-Ting Zhang (2003). Analysis of nucleotide distribution in the genome of Streptomyces coelicolor A3(2) using the Z curve method. FEBS Letters 540(1-3), Audic, S. and J. Claverie. Self-identification of protein-coding regions in microbial genomes. Proc Natl Acad Sci U S A, 95(17): , Lomsadze A., Ter-Hovhannisyan V., Chernoff YO, Borodovsky M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Research, 2005, Vol. 33, No. 20

Computational gene prediction Accuracy >90%

Mean-field approximation for triplet frequencies F IJK : Frequency of triplet IJK ( I,J,K {A,C,G,T} ): F AAA, F AAT, F AAC … F GGC, F GGG : 64 numbers position-specific letter frequency + correlations : 12 numbers

Why hexagonal symmetry? GC-content = P C + P G

Genome codon usage and mean-field approximation ggtgaATG gat gct agg … gtc gca cgc TAAtgagct … correct frameshift 64 frequencies F IJK … ggtgaATG gat gct agg … gtc gca cgc TAAtgagct 12 frequencies P I 1, P J 2, P K 3

P I J are linear functions of GC-content eubacteria archae

THE MYSTERY OF TWO STRAIGHT LINES ??? R 12 R 64 F IJK = P 1 I P 2 J P 3 K + correlations

Codon usage signature 0-+

19 possible eubacterial signatures

Example: Palindromic signatures

Four symmetry types of the basic 7-cluster structure eubacteria flower-like degenerated perpendicular triangles parallel triangles

B.Halodurans (GC=44%) S.Coelicolor (GC=72%) F.Nucleatum (GC=27%) E.Coli (GC=51%)

Using branching principal components to analyze 7-clusters genome structures

Streptomyces coelicolor Bacillus haloduransErcherichia coli Fusobacterium nucleatum Using branching principal components to analyze 7-clusters genome structures

Web-site cluster structures in genomic sequences

Papers (type Zinovyev in Google) Gorban A, Zinovyev A PCA deciphers genome Arxiv preprint Gorban A, Popova T, Zinovyev A Codon usage trajectories and 7-cluster structure of 143 complete bacterial genomic sequences Physica A 353, Gorban A, Popova T, Zinovyev A Four basic symmetry types in the universal 7-cluster structure of microbial genomic sequences In Silico Biology 5, 0025 Gorban A, Zinovyev A, Popova T Seven clusters in genomic triplet distributions Seven clusters in genomic triplet distributions In Silico Biology. V.3, Zinovyev A, Gorban A, Popova T Self-Organizing Approach for Automated Gene Identification Self-Organizing Approach for Automated Gene Identification Open Systems and Information Dynamics 10 (4).

Part II:Coding and non-coding DNA scaling laws Dr. Thomas Fink Bioinformatics service Dr. Sebastian Ahnert Cavendish laboratory, University of Cambridge

C-value and G-value paradox Neither genome length nor gene number account for complexity of an organism Drosophila melanogaster (fruit fly) C=120Mb Podisma pedestris (mountain grasshopper) C=1650 Mb

Non-linear growth of regulation Mattick, J. S. Nature Reviews Genetics 5, 316–323 (2004). “Amount of regulation” scales non-linearly with the number of genes: every new gene with a new function requires specific regulation, but the regulators also need to be regulated Log number of genes Log number of regulatory genes bacteria archae Slope = 1.96 Slope = 1

Complexity ceiling for prokaryotes Adding a new function S requires adding a regulatory overhead R, the total increase is N = R + S Since R ~ N 2, at some point R > S, i.e. gain from a new function is too expensive for an organism, it requires too much regulation to be integrated There is a maximum possible genome length for prokaryotes (~10Mb) There is a maximum possible genome length for prokaryotes (~10Mb)

How eukaryotes bypassed this limitation? Presumably, they invented a cheaper (digital) regulatory system, based on RNA This regulatory information is stored in the “non-coding” DNA

Simple model: Accelerated networks Node is a gene (c genes) Edge is a “regulation” (n edges) n = c 2 Connectivity < k max, regulators are only proteins Connectivity > k max deficit of regulations is taken from non-coding DNA

How much regulation genome needs to take from non-coding DNA? c max (prokaryotic ceiling) These regulations must be encoded in the non-coding part of genome, therefore N – non-coding DNA length C – coding DNA length C prok – ceiling for prokaryotes (~10Mb) some coefficient

Observation: coding length vs non-coding =1 Minimum non-coding length needed for the «deficit» regulation

Hypothesis Prokaryotes: =    ( little constant add-on, promoters, UTRs… ) 15% ≈ 1/7 Eukaryotes N reg = /2 C/C maxprok (C-C maxprok ) ~ C 2, C maxprok ≈ 10Mb ≈ This is the amount necessary for regulation, but repeats, genome parasites, etc., might make a genome much bigger

This is only a hypothesis, but… Prediction on the N reg for human: N reg = 87 Mb = 3% of genome length C = 48 Mb = 1.7% N reg +C = 4.7%

Thank you for your attention Questions?