Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina

Slides:



Advertisements
Similar presentations
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Advertisements

Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
School of CSE, Georgia Tech
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Greedy Algorithms CS 466 Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix of.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Greedy Algorithms CS 6030 by Savitha Parur Venkitachalam.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Gene an d genome duplication Nadia El-Mabrouk Université de Montréal Canada.
Branch & Bound Algorithms
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Phylogenetic reconstruction
Lectures on Network Flows
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader] Kexue Liu CMSC 838 Presentation.
Bioinformatics Chromosome rearrangements Chromosome and genome comparison versus gene comparison Permutations and breakpoint graphs Transforming Men into.
Genome Halving – work in progress Fulton Wang ACGT Group Meeting.
Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
Branch and Bound Similar to backtracking in generating a search tree and looking for one or more solutions Different in that the “objective” is constrained.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
FPGA Acceleration of Phylogeny Reconstruction for Whole Genome Data Jason D. Bakos Panormitis E. Elenis Jijun Tang Dept. of Computer Science and Engineering.
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
. Comput. Genomics, Lecture 5b Character Based Methods for Reconstructing Phylogenetic Trees: Maximum Parsimony Based on presentations by Dan Geiger, Shlomo.
FPGA Acceleration of Gene Rearrangement Analysis Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina Columbia, SC USA.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Inferring Phylogeny using Permutation Patterns on Genomic Data 1 Md Enamul Karim 2 Laxmi Parida 1 Arun Lakhotia 1 University of Louisiana at Lafayette.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Binary Encoding and Gene Rearrangement Analysis Jijun Tang Tianjin University University of South Carolina (803)
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Greedy Algorithms CS 498 SS Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Gene: A sequence of nucleotides coding for protein Gene Prediction Problem: Determine the beginning and end positions of genes in a genome Gene Prediction:
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
Significance Tests for Max-Gap Gene Clusters Rose Hoberman joint work with Dannie Durand and David Sankoff.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
Genome Rearrangement By Ghada Badr Part I.
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Application of Phylogenetic Networks in Evolutionary Studies Daniel H. Huson and David Bryant Presented by Peggy Wang.
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Distance based phylogenetics
Original Synteny Vincent Ferretti, Joseph H. Nadeau, David Sankoff, 1996 Presented by: Suzy Sun.
Lectures on Network Flows
Chapter 5. Optimal Matchings
Multiple Alignment and Phylogenetic Trees
Lecture 3: Genome Rearrangements and Duplications
BNFO 602 Phylogenetics Usman Roshan.
Mattew Mazowita, Lani Haque, and David Sankoff
Phylogeny.
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Presentation transcript:

Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina (803)

Outline Backgrounds Branch-and-Bound Algorithms for the Median Problem Maximum Likelihood Methods for Phylogenetic Reconstruction Post-Analysis Conclusions

Simple Rearrangements

Phylogenetic Reconstruction

Rearrangement Phylogeny

Median Problem Goal: find M so that D AM +D BM +D CM is minimized NP hard for most metric distances

Multichromosomal Reversal Median problem To find a median genome that minimizes the summation of the multichromosomal HP distances on the three edges Events considered: reversal, translocation, fusion, fission Exact and heuristic solvers exist for the Unichromosomal Reversal Median Problem (reversals are the only events)

Capless Breakpoint Graph Genome A → Non-perfect Matching M(A) Let a,b be adjacency genes in A. Then (a t,b h ) is an edge in M(A) A genome is composed of a set of edges and ends. Matchings naturally correspond to Undirected Genomes (Flipping of chromosomes does not alter matchings)

Matchings : M(A): M(B): : A-end : B-end Example Example Genomes A={‹ -5, 1, 6, 3 ›, ‹ 2, 4 ›} B={‹ 1, 6 ›, ‹ -5, -4, -3, -2 ›} Adjacency Graph

: A-end : B-end Capless Breakpoint Graph AB-paths of length 0 Denote C(A,B) #Cycles, AB #AB-Paths, AA #AA-paths, BB #BB-paths in G(A,B), n #genes n = 6,C(A,B) = 1,AB = 4, d HP ≥ 6-1-4/2 = 3

A Lower Bound of the HP Distance A simpler lower bound only contains #genes, #cycles, #paths. Derived from Hannenhalli, Pevzner 1995 d HP (A,B)≥n – C(A,B) - AB/2 + AA - BB Pseudo-cycle of A and B:

Pseudo-cycle distance Median Problem Pseudo-cycle distance : Pseudo-cycle distance Median Problem (PMP): to find a median genome that minimizes the summation of the Pseudo- cycle distance on the three edges We use the Pseudo-cycle distance as a lower bound for the HP distance to derive a RMP solver

Branch-and-Bound Algorithm Enumerate the solution genomes gene by gene (Genome Enumeration) After enumerated a gene, compute an upper bound based on the partial solution genome Bound: check whether the upper bound of the partial solution is less than a criteria Branch If it is true, the partial genome is discarded, enumerate another gene Otherwise update the criteria and continue enumeration

Genome Enumeration for Multichromosome Genomes Genome Enumeration For genomes on gene {1,2,3}

Features Main Components: Contraction Operation Upper Bound on the number of pseudo- cycles Genome enumeration Extension of Caprara’s method for unichromosomal genomes (1999)

Contraction Operation Contraction e={a t,b h } on M(A): M(A)/e Case(2): Case(3): Case(1)

Upper Bound on the Number of Pseudo-cycles Let S be a genome and Z={G 1, G 2, G 3 } a set of three input genomes The maximal γ(S,Z) is denoted by γ* Based on triangle inequality, an upper bound on the number of pseudo-cycles can be derived:

Notes qn- γ* is the lower bound of the sum of pseudo- cycle distances between any S and each genome in Z ={G 1, G 2, G 3 } Given an edge e, assume genome S contains e and maximizes γ(S,Z); let Z’={G 1 /e, G 2 /e, G 3 /e}, and assume S’ maximizes Z’=γ(S’,Z’), then S = S’ ∪ {e}

Upper Bound Test In a step of the algorithm, the current partial solution is S i ={e 1,e 2,…,e i } The upper bound of γ(S,Z) of genoms containing S i is the following: Let UB be the current upper bound If UB Si <UB, then the best upper bound of the genomes containing S i is worse than UB

Branch-and-Bound Algorithm for Multichromosomal Genomes Compute an initial Upper Bound (UB) from the input genomes. In each step, either an end or an edge is fixed in the solution. End Fixing: Mark a node as an end of a chromosome. Edge Fixing: Fix an edge e to the current partial solution genome S i.

Genome Enumeration for Multichromosome Genomes Genome Enumeration For genomes on gene {1,2,3} Red line: end fixing Black line: edge fixing

Properties Can be extended to compute a given tree using iterative or progressive approaches However, median computation is still difficult Large nuclear genomes Complex events We also need to search the best tree from the large tree space N species: 20 species :

Statistical Approaches Combinatorial approaches are the focus of genome rearrangement research Only one MCMC method exists Maximum Likelihood methods have been very popular in sequence phylogenetic analysis Bootstrapping (data resampling) is a popular method to assess quality of obtained trees Hard to directly apply ML and bootstrapping to gene order

Sequence ML Phylogeny For each position, generate all possible tree structures Based on the evolutionary model, calculate likelihood of these trees and sum them to get the column likelihood Calculate tree likelihood by multiplying the likelihood for each position Choose tree with the greatest likelihood

Example Aacgcaa Bacataa Catgtca Dgcgtta ABCDACBDADCB

All Possible Evolutionary Paths (Column 1) aaag a c g t

Likelihood for One Path aaag ag t

Sum of All Paths (Column 1) aaag a c g t

Whole Sequence ABCD

MLBE Convert the gene-orders into binary sequences based on adjacencies Convert the binary sequences into protein or DNA sequence Use RAxML to compute a ML tree on the sequences Binary encoding was used before for parsimony analysis, with reasonable results

Binary Encoding

MLBE Sequences

Experimental Setup Generate random trees of N taxa Each tree is equally likely Birth-death model is preferred Starting from the root, apply r events along each edge r is the expected number of events Actual number is a sample between 1…2r Comparing the inferred tree with the true tree using RF rate

Experimental Results (Equal Content 1) 80% inversion, 20% transposition

Experimental Results (Equal Content 2) 80% inversion, 20% transposition

Experimental Results (Unequal 1) 90% inversion, 10% of del/ins/dup, 5-30 genes per segment

Experimental Results (Unequal 2) 90% inversion, 10% of del/ins/dup, 5-30 genes per segment

Multistate Endocing

MLME Results (200 genes 20 genomes)

MLME Results (1000 genes 20 genomes)

Post Analysis Bootstrapping has been widely used to assess the quality of sequence phylogeny The same procedure is impossible for gene order data since there is only one character We tested the procedure of jackknifing through simulated data to obtain Is jackknifing useful The best jackknifing rate What is the threshold of the support values

46 DNA bootstrapping

Bootstrapping Results

Jackknifing Procedure Generate a new dataset by removing half of the genes from the original genomes (orders are preserved) Compute a tree on the new dataset Repeat K times and obtain K replicates Obtain a consensus tree with support values

An Example—New Genomes … …

Jackknifing Rate

Support Value Threshold - FP Up to 90% FP can be identified with 85% as the threshold

Trees with FP

Support Value Threshold - FN

Low Support Branches

Jackknife Properties Jackknifing is necessary and useful for gene order phylogeny, and a large number of errors can be identified 40% jackknifing rate is reasonable 85% is a conservative threshold, 75% can also be used Low support branches should be examined in detail

Conclusions Great progress has been made in genome rearrangement research We are able to handle real size data Now the question is what data Data quality and biological modeling Ancestral genome reconstruction is still difficult Putting everything together has just started

Thank You!