Models for the evolution of gene-duplicates: Applications of Phase-Type distributions. Tristan Stark1, David Liberles1, Małgorzata O’Reilly2,3 and Barbara.

Slides:



Advertisements
Similar presentations
. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
Advertisements

Whole Genome Duplications (Polyploidy) Made famous by S. Ohno, who suggested WGD can be a route to evolutionary innovation (focusing on neofunctionalization)
Learning Goal Understand the evolution of complexity.
Chapter 19 Evolutionary Genetics 18 and 20 April, 2004
A New Nonparametric Bayesian Model for Genetic Recombination in Open Ancestral Space Presented by Chunping Wang Machine Learning Group, Duke University.
Phylogenetic reconstruction
Duplication, rearrangement, and mutation of DNA contribute to genome evolution Chapter 21, Section 5.
Molecular Evolution Revised 29/12/06
Key area 6: Mutations.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
Theory Chapter 11. A.E. Eiben and J.E. Smith, Introduction to Evolutionary Computing Theory Overview (reduced w.r.t. book) Motivations and problems Holland’s.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Evolutionary Algorithms Simon M. Lucas. The basic idea Initialise a random population of individuals repeat { evaluate select vary (e.g. mutate or crossover)
1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler.
Molecular Evolution, Part 2 Everything you didn’t want to know… and more! Everything you didn’t want to know… and more!
Class 3: Estimating Scoring Rules for Sequence Alignment.
Evolution Test Study Guide Answers
Mutation and DNA Mutation = change(s) in the nucleotide/base sequence of DNA; may occur due to errors in DNA replication or due to the impacts of chemicals.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Tree Inference Methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
MAT 4830 Mathematical Modeling 4.1 Background on DNA
Calculating branch lengths from distances. ABC A B C----- a b c.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Pairwise Sequence Analysis-III
Overview -Overview of Grass flower morphology -Floral organ identity and the evolution of the Grass flower -SEPALLATA3 genes and floral organ developent.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Construction of Substitution matrices
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Chapter 13 Population Evolution and Life on Earth $100 $200 $300 $400 $500 $100$100$100 $200 $300 $400 $500 Passing on the Genes Its in the Balance Calculations.
Evolution of Duplicated Genomes Talline Martins
Please feel free to chat amongst yourselves until we begin at the top of the hour. 1.
Modelling evolution Gil McVean Department of Statistics TC A G.
Proportional Hazards Model Checking the adequacy of the Cox model: The functional form of a covariate The link function The validity of the proportional.
Taxonomy & Phylogeny. B-5.6 Summarize ways that scientists use data from a variety of sources to investigate and critically analyze aspects of evolutionary.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.
Hidden Markov Models BMI/CS 576
Phylogeny and the Tree of Life
Using BLAST to Identify Species from Proteins
PrIME Probabilistic Integrated Models of Evolution
Evolutionary genomics can now be applied beyond ‘model’ organisms
Evolution of eukaryotic genomes
Evolution of gene function
Basics of Comparative Genomics
Task: It is necessary to choose the most suitable variant from some set of objects by those or other criteria.
Dr. Kenneth Stanley September 25, 2006
Gil McVean Department of Statistics, Oxford
Using BLAST to Identify Species from Proteins
Chapter 3: Nature, Nurture, and Human Diversity
What makes a mutant?.
Bayesian inference Presented by Amir Hadadi
Sequence comparison: Significance of similarity scores
Empirical mutation-selection phase diagram of tumor evolution.
The evolution of WCI for biological specificity makes organisms more evolvable. The evolution of WCI for biological specificity makes organisms more evolvable.
Empirical mutation-selection phase diagram of tumor evolution.
Gene duplications: evolutionary role
EVIDENCE THAT SUPPORTS EVOLUTION
Volume 11, Issue 3, Pages (March 2018)
Volume 112, Issue 7, Pages (April 2017)
CONTEXT DEPENDENT CLASSIFICATION
Dr. Kenneth Stanley February 6, 2006
Extra chromosomal Agents Transposable elements
Gautam Dey, Tobias Meyer  Cell Systems 
Basics of Comparative Genomics
Volume 11, Issue 3, Pages (March 2018)
Using BLAST to Identify Species from Proteins
Presentation transcript:

Models for the evolution of gene-duplicates: Applications of Phase-Type distributions. Tristan Stark1, David Liberles1, Małgorzata O’Reilly2,3 and Barbara Holland2 1 Temple University, Philadelphia 2 University of Tasmania, Australia 3 ARC Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS) 13-15 February 2019 The Tenth International Conference on Matrix-Analytic Methods in Stochastic Models This research was supported by the Australian Government through the Australian Research Council's Discovery Projects funding scheme (project DP180100352)

Talk Aims: Set up the biological background required to understand the problem (for both this talk and Jiahao’s) so bear with me Show how the problem can be approached using tools from the MAM toolkit Encourage more interaction between the math biology and MAM communities

Biological background Gene duplication is thought to be a major source of evolutionary novelty For a gene to be maintained in a genome it needs to be protected by selection, but, by definition, when it arises a gene duplicate is redundant… Various authors have proposed that this results in a “race” between different possible fates One copy of the gene gets destroyed by mutation (pseudogenization) Both copies get kept but with reduced and complementary functionality (subfunctionalization) One gene acquires a new function that becomes protected (neofunctionalization)

Genes can have more than one function Many genes have more than one function, e.g. they might be expressed in different tissue or at different developmental stages Different subfunctions tend to be controlled by different regulatory elements within the genome

Theoretical model for evolution of a duplicate gene pair, based on paper by Force et al… Duplication Genes are modelled as having two components: regulatory regions (short boxes) each responsible for some function of the gene, and the coding region (long boxes) which codes for protein. Full function Lost function New function Loss of function Loss of function Nonfunctionalisation Subfunctionalisation Neofunctionalisation Force, A., Lynch, M., Pickett, F. B., Amores, A., Yan, Y. L., & Postlethwait, J. (1999). Preservation of duplicate genes by complementary, degenerative mutations. Genetics, 151(4), 1531-1545.

Absorbing state Markov chains Deuce Adv. Player 1 Adv. Player 2 Game Player 1 Game Player 2

Subfunctionalization Pseudofunctionalization State transition diagram for a duplicate pair with z = 4 regulatory regions just considering pseudogenisation and subfunctionalisation. Black regions are unaffected by mutation; white regions have had a null mutation meaning that function is lost; grey regions are protected from null mutations by selection. The top row shows gene pairs that have subfunctionalised, i.e. both genes are protected by selection; the bottom row and far right show pseudogenisation, i.e. one copy of the gene has been lost.

Phase Type distributions The problem is similar to a PH distribution with the distinction that we have two absorbing states: pseudogenization (P) and subfunctionalisation (S) Q* V 𝑧 is the number of regulatory regions States 0 up to 𝑧−1 track the number of regulatory regions that have been lost 𝜇 𝑐 and 𝜇 𝑟 are the rates of loss of coding and regulatory regions respectively

Phase Type distributions The problem is similar to a PH distribution with the distinction that we have two absorbing states: pseudogenization (P) and subfunctionalisation (S) Q* V

Two kinds of hazard rates Instantaneous rate of transition into state P given that the process is has not yet been absorbed into either state S or P. Instantaneous rate of transition into state P given that the process has not yet been absorbed into state P (we call this the pseudogenization rate)

Different parameter choices give different hazard functions Different choices of 𝜇 𝑐 and 𝜇 𝑟 (the rates of loss causing mutations in the coding a regulatory regions) and z (the number of regulatory regions/functions) give different shaped curves. When 𝜇 𝑟 / 𝜇 𝑐 < a critical threshold (that depends on z) the change in concavity occurs in positive time, otherwise the shape of the hazard function is indistinguishable from exponential decay

Fitting to data The data we have consists of counts of the number of duplicate pairs in a genome with corresponding estimates of the cumulative number of silent substitutions per silent site (i.e. a proxy for age) To draw a link between the hazard rate curves and the data we also need to make some assumptions about how duplicate genes arise. Assume that gene duplicates arise according to a Poisson process with rate 𝛽 0 Assume that all gene duplicates evolve under the same set of parameters

Pulling it all together Define a random variable Y(t) as the number of gene duplicates that have survived to time t This allows us to fit our model to data using a Maximum Likelihood approach

Results Previous results had suggested that subfunctionalization was not a good explanation for observed data Using our model we could show that subfunctionalization actually fits observed data pretty well.

Extensions More than 2 genes Ongoing duplication Partial duplication Neofunctionalization Speciation More generally, it seems like evolutionary biology should be rife with other examples of PH distributions E.g. the covarion model of sequence evolutions Current Birth/Death models for phylogenetic trees assume exponential waiting times (terrible fit to actual tree shapes)