A novel method for measuring codon usage bias and estimating its statistical significance Codon usage bias or CUB, a phenomenon in which synonymous codons.

Slides:



Advertisements
Similar presentations
Design of Experiments Lecture I
Advertisements

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Myers’ PSYCHOLOGY (7th Ed)
Luciano Brocchieri, PhD Research Interests. Summary of Research Interests 1.Gene identification and genome annotation 2.The evolution of genome-sequence.
Sampling distributions of alleles under models of neutral evolution.
Cox Model With Intermitten and Error-Prone Covariate Observation Yury Gubman PhD thesis in Statistics Supervisors: Prof. David Zucker, Prof. Orly Manor.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Evaluating Hypotheses
28-Way vertebrate alignment and conservation track in the UCSC Genome Browser Journal club Dec. 7, 2007.
Influenza A Virus Pandemic Prediction and Simulation Through the Modeling of Reassortment Matthew Ingham Integrated Sciences Program University of British.
Similar Sequence Similar Function Charles Yan Spring 2006.
Optimized Numerical Mapping Scheme for Filter-Based Exon Location in DNA Using a Quasi-Newton Algorithm P. Ramachandran, W.-S. Lu, and A. Antoniou Department.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Experimental Evaluation
Statistical Treatment of Data Significant Figures : number of digits know with certainty + the first in doubt. Rounding off: use the same number of significant.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Lecture 12 Splicing and gene prediction in eukaryotes
1 Seventh Lecture Error Analysis Instrumentation and Product Testing.
The phylogenetics project data revealed! October 4, 2010 OEB 192.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Review Session Monday, November 8 Shantz 242 E (the usual place) 5:00-7:00 PM I’ll answer questions on my material, then Chad will answer questions on.
Results Conclusion C Results CFD study on heat transfer and pressure drop characteristics of an offset strip-fin heat exchanger in helium systems Objectives.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Determining Sample Size
CAI and the most biased genes Zinovyev Andrei Institut des Hautes Études Scientifiques.
1 Patterns of Substitution and Replacement. 2 3.
Signposts for translation initiation: An illustration of formulating a research project Xuhua Xia
Xuhua Xia Signposts for translation initiation: An illustration of formulating a research project Xuhua Xia.
Improved Gene Expression Programming to Solve the Inverse Problem for Ordinary Differential Equations Kangshun Li Professor, Ph.D Professor, Ph.D College.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Comp. Genomics Recitation 3 The statistics of database searching.
Codon usage bias Ref: Chapter 9 Xuhua Xia dambe.bio.uottawa.ca.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
CEN st Lecture CEN 4021 Software Engineering II Instructor: Masoud Sadjadi Monitoring (POMA)
Codon usage bias Ref: Chapter 9
Reading Report: A unified approach for assessing agreement for continuous and categorical data Yingdong Feng.
Module 1: Measurements & Error Analysis Measurement usually takes one of the following forms especially in industries: Physical dimension of an object.
Correlation & Regression Analysis
So Hirai The University of Tokyo Currently NTT DATA Corp. Kenji Yamanishi The University of Tokyo WITMSE 2012, Amsterdam, Netherland Presented at KDD 2012.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Compression of Protein Sequences EE-591 Information Theory FEI NAN, SUMIT SHARMA May 3, 2003.
1 Codon Usage. 2 Discovering the codon bias 3 In the year 1980 Four researchers from Lyon analyzed ALL published mRNA sequences of more than about 50.
Finding genes in the genome
In populations of finite size, sampling of gametes from the gene pool can cause evolution. Incorporating Genetic Drift.
Modelling evolution Gil McVean Department of Statistics TC A G.
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Applications of Stochastic Processes in Asset Price Modeling Preetam D’Souza.
Project VIABLE - Direct Behavior Rating: Evaluating Behaviors with Positive and Negative Definitions Rose Jaffery 1, Albee T. Ongusco 3, Amy M. Briesch.
Patterns of HIV-1 evolution in individuals with differing rates of CD4 T cell decline Markham RB, Wang WC, Weisstein AE, Wang Z, Munoz A, Templeton A,
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Understanding GWAS SNPs Xiaole Shirley Liu Stat 115/215.
1. 2 Discovering the codon bias 3 Il codice genetico è DEGENERATO.
Discovering the codon bias
From: Cost of Antibiotic Resistance and the Geometry of Adaptation
Fig. 1. (a) Relationship of codon adaptation index (CAI) and the number of substitutions per fourfold-degenerate site (dS) between D. melanogaster.
Pipelines for Computational Analysis (Bioinformatics)
Figure 1. Exploring and comparing context-dependent mutational profiles in various cancer types. (A) Mutational profiles of pan-cancer somatic mutations,
Itsik Pe’er, Yves R. Chretien, Paul I. W. de Bakker, Jeffrey C
Adaptive Evolution of Gene Expression in Drosophila
Volume 13, Issue 9, Pages (December 2015)
A Role for Codon Order in Translation Dynamics
Volume 141, Issue 2, Pages (April 2010)
Product moment correlation
Volume 14, Issue 7, Pages (February 2016)
Advanced challenges in assessing translation efficiency.
Patterns of amino acid usage and its GC-content of synonymous codons in 65 nuclear genomes in this study. Patterns of amino acid usage and its GC-content.
Presentation transcript:

A novel method for measuring codon usage bias and estimating its statistical significance Codon usage bias or CUB, a phenomenon in which synonymous codons are used at different frequencies, is generally believed to be a combined outcome of mutation pressure, natural selection, and genetic drift. Why should we care Genes often exhibit variable CUBs, which are closely related with gene expression for translational efficiency and/or accuracy. The ability to accurately quantify CUBs for protein-coding sequences is of fundamental importance in revealing the underlying mechanisms behind codon usage and understanding gene evolution and function in general. What is the question Extant measures for estimating CUB have limitations. They are not supplied with straightforward procedures for assessing the statistical significance of the bias in codon usages for any given gene. They are not fully effective at incorporating background nucleotide composition into CUB estimation. How we can do it Here we present a novel measure, Codon Deviation Coefficient (CDC), using it to characterize CUB and to ascertain its statistical significance. Statistical significance of codon usage bias To evaluate the statistical significance of codon usage bias, we implement a bootstrap resampling of N =10000 replicates for any given sequence.  Each replicate is randomly generated according to the sequence background nucleotide composition ( S i and R i, i = 1, 2, 3) and the sequence length.  When CUB is derived from each replicate, a bootstrap distribution of N estimates of CUB is obtained.  A two-sided bootstrap P -value is calculated as twice the smaller of the two one-sided P -values. P ranges from 0 to 1. By convention, a statistically significant CUB is identified by P < CDC features its first application of the bootstrap resampling in estimating the statistical significance of CUB. Implementation and availability CDC is written in standard C++ programming language and implemented into Composition Analysis Toolkit (CAT). Its software package, including compiled executables on Linux/Mac/Windows, example data, documentation, and source codes, is freely available at Introduction Materials and methods Expected codon usage CDC considers both GC and purine contents as background nucleotide composition and derives expected codon usage from observed positional GC and purine contents. We denote the content of the four nucleotides (adenine, thymine, guanine, and cytosine), GC content, and purine content as A, T, G, C, S and R, respectively. The expected position-dependent nucleotide contents are formulated: where S i and R i are their corresponding observed contents at codon position i and A i, T i, G i, C i are expected nucleotide contents at codon position i ( i = 1, 2, 3). For any sense codon xyz, the expected usage π xyz is defined as the product of its constituent expected nucleotide contents x 1 y 2 z 3, normalized by the sum over all sense codons, viz., where and. Zhang Zhang CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences 1.Bulmer, M The selection-mutation-drift theory of synonymous codon usage. Genetics 129: Hershberg, R, DA Petrov Selection on codon bias. Annu Rev Genet 42: Novembre, JA Accounting for background nucleotide composition when measuring codon usage bias. Mol Biol Evol 19: Plotkin, JB, G Kudla Synonymous but not the same: the causes and consequences of codon bias. Nature Reviews Genetics 12: Literature cited 5.Sharp, PM, WH Li The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 15: Wright, F The 'effective number of codons' used in a gene. Gene 87: Zhang, Z, J Yu Modeling compositional dynamics based on GC and purine contents of protein-coding sequences. Biol Direct 5:63. 8.Zhang, Z, J Yu On the organizational dynamics of the genetic code. Genomics Proteomics Bioinformatics 9: The proposed measure (CDC) has been published as a journal article. Reference: Zhang, Z. et al Codon Deviation Coefficient: a novel measure for estimating codon usage bias and its statistical significance. BMC Bioinformatics 13:43. Contact: Lab website: Further information Codon usage bias  Any coding sequence can be represented as a vector of n dimensions, whose entries correspond to n sense codon usages in the sequence.  The dimension n equals 61 for the canonical code; although codons ATG and TGG could be set aside due to the absence of synonymous codons, calculation based on a vector of 61 dimensions instead of 59 dimensions makes little substantial difference.  To calculate CUB for any given sequence, we employ the cosine distance metric based on the cosine of the angle between the two vectors of n dimensions. Therefore, when both expected ( ) and observed ( ) codon usage vectors are available for any given sequence, CDC renders a distance coefficient ranging from 0 (no bias) to 1 (maximum bias), to represent CUB, expressed by the deviation of from. Results and discussion Comparative analysis on simulated data To evaluate the performance of CDC and compare it against N c (Effective Number of Codons) and the most powerful extant measure, N c ′ ( N c ’s variant), We simulated sequences by specifying different heterogeneities in positional background nucleotide compositions ( Figure 1 ) and varying sequence lengths ( Figure 2 ). It should be noted that CDC ranges from 0 (no bias) to 1 (maximum bias), whereas N c ′ and N c range from 20 (maximum bias) to 61 (no bias). To facilitate comparisons of CDC with N c ′ and N c, we use the formula (61- N c ′) / 41 and (61- N c ) / 41 to rescale their ranges, denoted as scaled N c ′ and scaled N c, respectively, from 0 (no bias) to 1 (maximum bias). Application to empirical data To empirically test CDC and compare it to three popular measures, N c ′, N c and CAI, we collected multiple expression data sets from different species and correlated their CUBs with gene expression levels. Overall, CDC correlates positively with gene expression level, much better than scaled N c ', scaled N c, and CAI ( Table 1 ). Figure 1 Codon usage bias across a variety of positional background nucleotide compositions. Heterogeneous positional background compositions were considered for GC content (panels A to C) and purine content (panels D to E), respectively. The expected values of codon usage bias are zero for all examined cases. Figure 2 Codon usage bias across a range of sequence lengths. Sequences were simulated with the four non-uniform positional composition sets: Low (panel A), Med-1 (panel B), Med-2 (panel C) and High (panel D). The expected values of codon usage bias are zero for all examined cases. Taken together, our simulation results demonstrate that CDC is superior to N c ′ and N c. We proceeded to calculate CDC values for all E. coli genes.  The gene with the highest CDC value and statistical significance in CUB is rpmI (CDC=0.481, P <0.05), which encodes ribosomal protein L35.  CDC values for 54 ribosomal protein (RP) genes in E. coli, are larger than the mean and median values of all genes.  Nearly all RP genes have statistically significant CUBs, with three exceptions ( Table 2 ). Table 1 Correlation coefficients of CUB with gene expression level Measure E. coli 1 S. cerevisiae 2 D. melanogaster 3 C. elegans 4 A. thaliana 5 LBM9 CDC Scaled N c ′ Scaled N c CAI Note: Expression data were obtained from 1 Bernstein et al, 2 Holstege et al, 3 Zhang et al, 4 Roy et al, and 5 Wuest et al. P < for all values. These results suggest that CDC has the potential to illuminate the evolutionary process that has operated on each gene. Conclusions  CDC accounts for background nucleotide composition to estimate codon usage bias and utilizes a bootstrap assessment to evaluate the statistical significance of codon usage bias.  As validated by simulated sequences and empirical data, CDC outperforms extant measures by providing informative estimates of codon usage bias and its statistical significance.