Maximum Likelihood Phylogenetic Reconstruction from High-Resolution Whole-Genome Data and a Tree of 68 Eukaryotes By: Yu Lin Fei Hu , Jijun Tang Bernard.

Slides:



Advertisements
Similar presentations
Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS ratios Why Selection? The Problem.
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Molecular Evolution Revised 29/12/06
© Wiley Publishing All Rights Reserved. Phylogeny.
Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
Chapter 2 Opener How do we classify organisms?. Figure 2.1 Tracing the path of evolution to Homo sapiens from the universal ancestor of all life.
Inferring Phylogeny using Permutation Patterns on Genomic Data 1 Md Enamul Karim 2 Laxmi Parida 1 Arun Lakhotia 1 University of Louisiana at Lafayette.
Phylogenetic trees Sushmita Roy BMI/CS 576
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Terminology of phylogenetic trees
Molecular phylogenetics
Binary Encoding and Gene Rearrangement Analysis Jijun Tang Tianjin University University of South Carolina (803)
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
Introduction to Phylogenetics
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Introduction to Phylogenetic trees Colin Dewey BMI/CS 576 Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Monkey Business Bioinformatics Research Center University of Aarhus Thomas Mailund Joint work with Asger Hobolth, Ole F. Christiansen and Mikkel H. Schierup.
Introduction to Bioinformatics Resources for DNA Barcoding
WABI: Workshop on Algorithms in Bioinformatics
Phylogenetic basis of systematics
Distance based phylogenetics
Research in Computational Molecular Biology , Vol (2008)
Pipelines for Computational Analysis (Bioinformatics)
In-Text Art, Ch. 16, p. 316 (1).
Multiple Alignment and Phylogenetic Trees
Slide 1: Thank you Elizabeth for the introduction, and hello everybody. So, I have been a PhD student with Charles Semple and Mike Steel at the UoC since.
Patterns in Evolution I. Phylogenetic
Phylogenetic Trees.
Summary and Recommendations
Mattew Mazowita, Lani Haque, and David Sankoff
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
A Unifying View of Genome Rearrangement
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Chapter 19 Molecular Phylogenetics
The Most General Markov Substitution Model on an Unrooted Tree
Phylogeny.
CS 394C: Computational Biology Algorithms
Double Cut and Join with Insertions and Deletions
Algorithms for Inferring the Tree of Life
Unit Genomic sequencing
Arun ganesh (UC BERKELEY)
But what if there is a large amount of homoplasy in the data?
MAGE: Models and Algorithms for Genome Evolution 2013
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Maximum Likelihood Phylogenetic Reconstruction from High-Resolution Whole-Genome Data and a Tree of 68 Eukaryotes By: Yu Lin Fei Hu , Jijun Tang Bernard M.E. Moret Presented by: Trevor Thompson

Content Introduction Overview History Methods Encoding genomes into binary sequences Estimating transition parameters Reconstructing the phylogeny Results Conclusion

Content contd. 3. Results a)Simulating phylogenetic trees/evolutionary events along branches in the trees b)Results for simulations under rearrangements c)Results for simulations under the general model d)Results for simulated poor assemblies e)Results for a dataset of high-resolution eukaryotic genomes 4. Conclusion

1. Introduction Overview Phylogenetic analysis: one of the main tools of evolutionary biology. carried out using sequence data morphological data a branch of biology dealing with the study of the form and structure of organisms and their specific structural features.

Sequence data Cheap Large Amounts Well understood Unfortunately requires: Accurate results in orthologies different parts of the genome may evolve at different rates

3 events There are 3 events which: Change structure of the entire genome Create a better understanding of the history These events are: Rearrangements Duplications Lose

Whole Genome Data Has become very popular Researchers have found links between : large-scale genomic events various diseases large-scale genomic events: Rearrangements Duplications Diseases and health conditions Cancer Autism

The Challenges Shown to be a lot more challenging than assumed; Methods till now have shown: Oversimplified models Poor accuracy Poor scaling Lack of statistical assessment etc...

Prior Wor Rearrangement Data: Used 80 years ago by Sturtevant and Dobzhansky ignored for the next 45 years revived by Palmer and Thompson and Day and Sankoff last 30 years models: Evolution Distance measures Algorithms These all being of high interest

Sequenced Based Approaches Approaches for Whole Genome data: Parsimony-based approaches= minimize the total number of events needed to produce the given genomes from a common ancestor Distance-based approaches first estimate the pairwise distances between every pair of leaves then apply a method such as Neighbor-Joining Maximum-likelihood (ML) approaches seek the tree and associated model parameters that maximize the probability of producing the given set of leaf genomes

Down fall to ML approaches These approaches tend to be: Computationally expensive Of high accuracy Some software has been developed: RAxML Allowing for analysis of large trees and long sequences

2. Methods Will use both : gene adjacencies gene content To determine : transition parameters ML reconstruction to infer the tree Our new approach : Maximum Likelihood on Whole-genome Data =(MLWD)

2.1 Encoding genomes into binary sequences We represent the information as follows: adjacency information gene content Denote gene g as follows: gt=the tail of a gene g gh= its head Then +g to indicate an orientation from tail to head (gt → gh) And −g otherwise (gh → gt)

Multsets We have 2 consecutive genes a and b ; Connected by one adjacency of the types: {at, bt}, {ah, bt}, {at, bh}, and {ah, bh} And a gene c lies at one end of a linear chromosome; singleton set, {ct} or {ch} Called a telomere A genome can then be represented as a multiset of adjacencies and telomeres

Toy Genome example Composed of one linear chromosome; (+a, +b,−c, +a, +b, −d, +a) And a circular one, (+e, −f) Represented by the multiset of adjacencies and telomeres ; {{at}, {ah, bt}, {bh, ch}, {ct, ah}, {ah, bt}, {bh, dh}, {dt, ah}, {ah}, {eh, fh}, {et, ft}} No one-to-one correspondence between genomes and multisets of genes, adjacencies, and telomeres

The genome composed of the linear chromosome; (+a, +b,−d, +a, +b,−c, +a) Circular one; (+e, −f) Have the same multisets of adjacencies and telomeres as our toy example

Total Numbers of distinction If the total number of distinct genes among the input genomes is n; total number of distinct adjacencies and telomeres is: Actual appearance is typically far smaller—in fact, it is usually linear in n rather than quadratic

The resulting two toy genomes of Figure 1, the resulting binary sequences and their derivation are shown in Table 1.

2.2. Estimating transition parameters The parameters of the model are simply the transition probability from: presence (1) to absence (0) absence (0) to presence (1) First look at adjacencies: Every DCJ operation will select two adjacencies (or telomeres) uniformly at random (if adjacencies) break them to create two new adjacencies Each genome has n + O(1) adjacencies and telomeres O(1) is the number of linear chromosomes in the genome

the transition probability from 1 to 0 at some fixed index= Since there are up to possible adjacencies and telomeres the transition probability from 0 to 1 = Thus the transition from 0 to 1 is roughly 2n times less likely than that from 1 to 0

2.3. Reconstructing the phylogeny Once we have the binary sequences encoding the input genomes and the transition parameters use the ML reconstruction program RAxML; to build a tree from these sequences uses a time-reversible model=the model does not care which sequence is the ancestor and which is the descendant so long as all other parameters are held constant

3. Results 3.1. Experimental Design Ran a series of experiments on simulated datasets; to evaluate the performance of our approach against a known “ground truth” Ran the reconstruction algorithm on a dataset of 68 eukaryotic genomes: from unicellular parasites to mammalians obtained from the Eukaryotic Gene Order Browser (eGOB) database

Simulating phylogenetic trees A model tree consists of a rooted tree topology and corresponding branch lengths trees are generated by a three-step process; first generate birth-death trees using the tree generator (from the geiger library) in the software R birth rate of 0.001 and a death rate of 0 The branch lengths in such trees are ultrametric the root-to-leaf paths all have the same length

The second step, the branch lengths are modified as follows choose a parameter c; each branch we sample a number s uniformly from the interval [−c, +c] multiply the original branch length by es for the experiments in this paper, we set c = 2 Finally, we rescale all branch lengths to achieve a target diameter D; the length of the longest path, defined as the sum of the edge lengths along that path

Our experiments are conducted by varying three main parameters: the number of taxa the number of genes the target diameter Two values for each of the first two parameters: 50 and 100 taxa 1, 000 and 5, 000 genes The third parameter, the diameter of the tree; varied it from n to 4n n is the number of genes For each setting of the parameters, we generated 100 datasets

Simulating evolutionary events along branches in the trees In the rearrangement-only model, all evolutionary events along the branches are DCJ operations next event is then chosen uniformly at random among all possible DCJ operations In the general model, an event can be a DCJ operation or one of ; gene duplication, gene insertion, or gene loss Randomly sample three parameters for each branch: probability of occurrence of a gene duplication, pd, probability of occurrence of a gene insertion, pi probability of occurrence of a gene loss, pl

The probability of occurrence of a DCJ operation is then just pr = 1 − pd − pi − pl For gene duplication; set Lmax as the maximum number of genes in the duplicated segment; assume that the number of genes in that segment is a uniform random number between 1 and Lmax Lmax = 5 For gene insertion, we tested two different possible scenarios; genomes of prokaryotic type other for genomes of eukaryotic type We uniformly select one position and insert a new gene; we uniformly select one existing gene and mutate it into a new gene for gene loss, we uniformly select one gene and delete it.

3.2. Results for simulations under rearrangements We compared the accuracy of three different approaches; MLWD, MLWD* = same procedure as MLWD, but does not use our computation of transition probabilities TIBA=fast distance-based tool to reconstruct phylogenies from rearrangement data

3.3. Results for simulations under the general model We generated more complex datasets than for the previous set of experiments among our simulated eukaryotic genomes: largest genome has more than 20,000 genes biggest gene family in a single genome has 42 members In our approach: encoded sequence of each genome combines; adjacency and gene content information difficult to compute optimal transition probabilities If the transition probability of any gene or adjacency is set to be; m times less than that in the opposite direction

We name it MLWD(m); m = 10, 100, 1000 Whereas the best ratio in the rearrangement model was 2n The best ratio under the general model is much smaller can be attributed to the relatively modest change in gene content compared to the change in adjacencies

3.4. Results for simulated poor assemblies High-throughput sequencing has made it possible to sequence many genomes producing a good assembly are time-consuming and may require much additional laboratory work Many sequenced genomes remain broken; inducing a loss of adjacencies in the source data In addition, some assemblies may have errors; thereby producing spurious adjacencies losing others We introduce artificial breakages in the leaf genomes by “losing” adjacencies breaking chromosomes into multiple contigs

MLWD-x% represents the cases of losing x% of adjacencies where; x% of the adjacencies are selected uniformly at random and discarded for each genome. Our approach is relatively insensitive to the quality of assembly; when the tree diameter is large, that is, when it includes highly diverged taxa

3.5. Results for a dataset of high-resolution eukaryotic genomes Shows the reconstructed phylogeny of 68 eukaryotic genomes from the eGOB (Eukaryotic Gene Order Browser) database The database contains the order information of: orthologous genes (identified by OrthoMCL) 74 different eukaryotic species The total number of different gene markers; is around 100, 000 We selected 68 genomes for their size varying from 3k to 42k 6 genomes in the database have too few adjacencies fewer than 3, 000

We encode the adjacency and gene content information of all 68 genomes; into 68 binary sequences of length 652, 000 set the bias ratio to be 100 took under 3 hours of computing time on a desktop computer To explore the relationship between gene content and gene order; we ran MLWD* on the 68 eukaryotic genomes using only adjacency information/content information In this particular dataset; most genomes differ significantly in gene content the tree based on gene-content information approximately that obtained with both gene adjacencies and gene content

4. Conclusion Practice to date has continued to use selected sequences; moderate length using nucleotide- Aminoacid- codon-level models Lack of suitable tools that has prevented more widespread use of whole-genome data; previous tools all suffered from serious problems other statistical assessment procedures The approach we presented; is the first to overcome all of these difficulties is very accurate quite robust against typical assembly errors and omissions of genes supports standard bootstrapping methods

Contd. Our analysis of a 68-taxon collection of eukaryotic genomes; could not have been conducted without accepting severe compromises in the data equalizing gene content the quality of the analysis also helps make the case for phylogenetic reconstruction based on whole-genome data Also helps make the case for phylogenetic reconstruction based on whole-genome data; able to run a complete analysis on a “Tree of Life” with very divergent genomes (and hence very large pairwise distances) without taking any special precautions without pre interpreting the data