A Combinatorial Approach to Genome-Wide Ortholog Assignment: Beyond Sequence Similarity Search Tao Jiang Department of Computer Science University of California, Riverside Joint work with X. Chen, Z. Fu, J. Zheng, V. Vacic, P. Nan, Y. Zhong, and S. Lonardi
Outline An introduction to orthology Existing ortholog assignment methods Ortholog assignment via genome rearrangement An introduction to genome rearrangement Computing signed reversal distance with duplicates Minimum Common Substring Partition Maximum Cycle Decomposition Experimental results Summary and future directions 9/17/2018
Outline An introduction to orthology Previous ortholog assignment methods Ortholog assignment via genome rearrangement An introduction to genome rearrangement Computing signed reversal distance with duplicates Minimum Common Substring Partition Maximum Cycle Decomposition Experimental Results Summary and future directions 9/17/2018
Orthology Homolog Paralog Ortholog mouse Gene family chicken Duplication Ortholog Speciation mouse chicken frog (from http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html) 9/17/2018
Orthology a b Homolog Paralog Ortholog mouse Gene family chicken Duplication Ortholog Speciation mouse chicken frog (from http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html) 9/17/2018
Orthology a b Homolog Paralog Ortholog mouse Gene family chicken Duplication Ortholog Speciation mouse chicken frog (from http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html) 9/17/2018
Orthology – the more complicated picture Speciation 1 Gene duplication 1 B C Speciation 2 Speciation 2 B1 C1 B2 C2 C3 True exemplar is the direct descendant of the ancestral gene of a given set of inparalogs. A main ortholog pair is defined as two true exemplar genes of two co-orthologous gene sets. Gene duplication 2 Outparalogs evolved via a duplication prior to a given speciation event. B1 C1 A1 B1 C1 B2 C2 C3 Inparalogs evolved via a duplication posterior to a given speciation event. B2 C2 C3 G1 G2 G3 9/17/2018
Significance Orthologous genes in different species are evolutionary and functional counterparts. Many methods use orthologs in a critical way: Function inference Protein structure prediction Motif finding Phylogenetic analysis Pathway reconstruction and more ... Identification of orthologs, especially exemplar genes, is a fundamental and challenging problem. 9/17/2018
Outline An introduction to orthology Previous ortholog assignment methods Ortholog assignment via genome rearrangement An introduction to genome rearrangement Computing signed reversal distance with duplicates Minimum Common Substring Partition Maximum Cycle Decomposition Experimental Results Summary and future directions 9/17/2018
Existing Methods Methods based on sequence similarity BBH Inparanoid/Multiparanoid PhiGs COG/KOG OrthoMCL MGD TOGA/EGO KEGG HomoloGene Methods based on phylogenetic trees Reconciled tree Orthostrapper OrthologID RAP RIO PhyOP TreeFam Methods based that take into account gene locations Shared genomic synteny 9/17/2018
Observations Sequence similarity-based methods assume that the evolutionary rates of all genes in a homologous family are equal and thus the divergence time could be estimated by comparing the sequence of genes. Tree-based methods critically rely on the correctness of reconstructed gene and species trees. Global genome rearrangements are not considered in gene location-based methods. 9/17/2018
Outline An introduction to orthology Previous ortholog assignment methods Ortholog assignment via genome rearrangement An introduction to genome rearrangement Computing signed reversal distance with duplicates NP-hard A low bound Minimum Common Substring Partition Maximum Cycle Decomposition Experimental Results Summary and future directions 9/17/2018
Molecular Evolution Local mutation Base substitution Base insertion Base deletion Global rearrangement and duplication Inversion/Reversal Translocation Transposition Fusion/Fission Duplication/Loss A complete ortholog assignment system should make use of information from both levels of molecular evolution. 9/17/2018
Genome Rearrangement Operations Reversal (inversion) 1 2 3 4 5 6 7 8 9 1 2 3 -6 -5 -4 7 8 9 Translocation 1 2 3 4 5 6 1 2 3 11 12 13 7 8 9 10 11 12 13 7 8 9 10 4 5 6 Fusion 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Fission 9/17/2018
Example a1 b c a2 d e f g The ancestral genome Speciation a1 c a2 d e f g b reversal a1 b c a2 d e f g a3 duplication a1 c a2 d e f g b a4 duplication Genome a1 b c a2 d e f g a3 fission Genome Given the evolutionary scenario, main ortholog pairs and inparalogs could be identified in a straightforward way. 9/17/2018
The Parsimony Approach Identify homologs using sequence similarity search (e.g.) BLASTp. Reconstruct the evolutionary scenario on the basis of the parsimony principle: postulate the minimum possible number of rearrangement events and duplication events in the evolution of two closely related genomes since their splitting so as to assign orthologs. Ortholog assignment problem could be formulated as a problem of finding a most parsimonious transformation from one genome into the other, without explicitly inferring their ancestral genome. 9/17/2018
RD (Rearrangement-Duplication) Distance RD distance: denotes the number of rearrangement events in a most parsimonious transformation denotes the number of gene duplications in a most parsimonious transformation 9/17/2018
The key algorithmic problem -SRDD Two related (unichromosomal) genomes No inparalogs, i.e. no post-speciation duplications No gene losses, and thus equal gene content Only reversals have occurred Signed Reversal Distance with Duplicates How to find a shortest sequence of reversals Almost untouched in the literature Duplicated genes are present Generalizes the problem of sorting by reversal A high-throughput system for assigning orthologs on a genome scale. 9/17/2018
When there are no (post-speciation) duplications The most parsimonious rearrangement scenario may suggest the true orthology. 9/17/2018
Outline An introduction to orthology Previous ortholog assignment methods Ortholog assignment via genome rearrangement An introduction to genome rearrangement Computing signed reversal distance with duplicates NP-hard A low bound Minimum Common Substring Partition Maximum Cycle Decomposition Experimental Results Summary and future directions 9/17/2018
Sorting by reversal Sorting a permutation into the identity by reversals Distinct genes only Signed vs. unsigned version 3 4 1 2 -2 -1 -4 -3 1 2 -4 -3 1 2 3 4 Sorting signed permutation 3 4 1 2 1 4 3 2 1 2 3 4 Sorting unsigned permutation 3 4 1 2 A permutation A high-throughput system for ssigning orthologs on a genome scale. 9/17/2018
Sorting signed permutation Hannenhalli-Pevzner (HP) theory Polynominal-time solvable Breakpoint graph Breakpoint, cycle, hurdle, fortress HP formula: 0 5 6 7 8 1 2 3 4 9 Breakpoint graph d = 3 – 1 + 1 + 0. 3 4 1 2 A permutation Hannenhalli and Pevzner, STOC, 178-187, 1995 9/17/2018
Sorting unsigned permutation NP-hard (Caprara, 1997) Breakpoint graph Maximum alternating cycle decomposition (NP-hard) 1.375-approximation (Berman, et al. 2002) 0 4 2 1 3 5 Breakpoint graph 0 4 2 1 3 5 Alternating cycle decomposition d = 3 – 1 + 1 + 0. 4 2 1 3 A permutation Caprara, RECOMB, 75-83, 1997 9/17/2018
A brief history signed unsigned Kececioglu and Sankoff (1995) 2-approximation Bafna and Pevzner (1996) 1.5-approximation 1.75-approximation Hannenhalli and Pevzner (1995) Polynomial Special cases – polynomial Caprara (1997) NP-hard Christie (1998) Bader, et al (2001) Linear – distance only Berman, et al (2002) 1.375-approximation d = 3 – 1 + 1 + 0. The work has also been extended to genomes with multiple chromosomes (Hannenhalli and Pevaner, 1995; Tesler, 2002; Ozery-Flato and Shamir, 2003) 9/17/2018
Outline An introduction to orthology Previous ortholog assignment methods Ortholog assignment via genome rearrangement An introduction to genome rearrangement Computing signed reversal distance with duplicates Computing Minimum Common Substring Partition Computing Maximum Cycle Decomposition Experimental results Summary and future directions 9/17/2018
SRDD – The exhaustive method Given genomes and , . : the set of all the possible ortholog assignments : the genome after orthologs have been assigned Assume one family with ten duplicated genes in each genome 9/17/2018
SRDD – Hardness SRDD is NP-hard, even when the maximum size of a gene family is limited to two. Reduction from the problem of sorting an unsigned permutation by reversals 3 4 1 2 An unsigned permutation +3 -3 +4 -4 +1 -1 +2 -2 A signed sequence with duplicates 1 2 3 4 +1 -1 +2 -2 +3 -3 +4 -4 No breakpoint No breakpoint +3 -3 Case 1: Case 2: 9/17/2018
SRDD – A lower bound Partial graph : the number of edges linking two nodes labeled by and , respectively The number of breakpoints: Let and be a pair of related genomes. Their reversal distance is lower bounded by +3 -1 -2 +1 +4 3h 3t 1t 1h 2t 2h 1h 1t 4h 4t 3h 3t 1h 1t 2h 2t 1h 1t 4h 4t +3 +1 +2 +1 +4 9/17/2018
(Sub)optimal assignment rules Rule one: a b c f d e Trivial Non-trivial a b c f d e Trivial Non-trivial / Rule two: a b c f d -e -d -b -c Trivial Non-trivial e a b c f d -e -d -b -c Trivial Non-trivial e / 9/17/2018
Outline An introduction to orthology Previous ortholog assignment methods Ortholog assignment via genome rearrangement An introduction to genome rearrangement Computing signed reversal distance with duplicates Computing Minimum Common Substring Partition Computing Maximum Cycle Decomposition Experimental results Summary and future directions 9/17/2018
The MCSP problem Minimum Common Substring Partition This may help eliminate many duplicates, but is different from syntenic blocks. Give two related genomes and , we have G: 3 1 2 -1 4 H: -4 1 2 3 1 G: 3 1 2 -1 4 H: -4 1 2 3 1 Without loss of generality that the first genes and the last genes of the two related genomes are identical and positive singletons, respectively 9/17/2018
Goldstein, Kolman, and Zheng, ISAAC, 473-484, 2004 MCSP - Hardness Let k-MCSP denote the version of MCSP where each gene family is of size at most k. The problem k-MCSP is NP-hard, for any k > 1. Petr Kolman gave a linear time O( )-approximation algorithm for k-MCSP (MFCS’05), and thus k-SRDD. The approximation ratio was recently improved to O(k). Goldstein, Kolman, and Zheng, ISAAC, 473-484, 2004 9/17/2018
MCSP – Pair-match graph A pair-match graph Single match v.s. pair match Incompatible pair-matches The maximum independent set problem on is equivalent to the minimum common substring partition problem, i.e., . G: 3 1 H: 3 1 G: 2 -1 H: 1 -2 G: -1 4 H: -1 4 G: 1 2 H: -2 -1 G: 3 1 2 -1 4 H: 3 1 -2 -1 4 )) , ( ) E V MIS n H G L Ã - = Goldstein, Kolman, and Zheng, ISAAC, 473-484, 2004 9/17/2018
MCSP – Approximation Algorithm APPROX-MCSP( , ) /* and are a pair of related genomes */ Construct the pair-match graph for and Find an approximation of the vertex cover of Identify segments based on the pair-matches in Output all the segments as a common substring partition If the common substring parititon found by the above algorithm APPROX-MCSP is , then where is the ratio of the approximation algorithm for vertex cover and is the genome size. In particular for 2-MCSP, the algorithm achieves an approximation ratio of 1.5. 9/17/2018
Outline An introduction to orthology Previous ortholog assignment methods Ortholog assignment via genome rearrangement An introduction to genome rearrangement Computing signed reversal distance with duplicates Computing Minimum Common Substring Partition Computing Maximum Cycle Decomposition Experimental results Summary and future directions 9/17/2018
Maximum cycle decomposition What if there still are some duplicates? Given any two genomes without duplicated genes, the (revised) HP formula for computing the rearrangement distance between the two genomes is as follows: Genome rearrangement distance: (Hannenhalli and Pevaner, 1995; Tesler, 2002; Ozery-Flato and Shamir, 2003) We could approximate the minimum rearrangement distance between two genomes by decomposing the complete-breakpoint graph to maximize , where is the number of cycles and paths and is the number of . 9/17/2018
MSOAR MSOAR is a high-throughput system for ortholog assignment between closely related genomes. MSOAR employs a heuristic algorithm to calculate the rearrangement/duplication (RD) distance between two genomes using the sub-optimal assignment rules, MCSP and MCD, which can be used to reconstruct a most parsimonious evolutionary scenario. MSOAR extends SOAR by allowing for multi-chromosomal genomes and the detection of inparalogs. 9/17/2018
“Noise” gene pair detection The previous steps determine a one-to-one gene matching between two genomes. Unmatched genes are removed and marked as inparalogs. Remove gene pairs whose deletion decreases the rearrangement distance by at least two. Since each pair incurs two duplications, the RD distance will not increase: These deleted genes form inparalogs. 9/17/2018
An outline of MSOAR Dataset A Dataset B Homology search: 1. Apply all-vs.-all comparison by BLASTp 2. Only select the blast hits with similarity score above cutoff 3. Keep up to five top bi-directional best hits List of orthologous gene pairs output Assign orthologs by minimizing RD distance: 1. Apply suboptimal rules 2. Apply minimum common partition 3. Maximum graph decomposition 4. Detect inparalogs by identifying “noise” gene pairs 2. Apply minimum common substring partition 3. Maximum cycle decomposition 9/17/2018
Outline An introduction to orthology Previous ortholog assignment methods Ortholog assignment via genome rearrangement An introduction to genome rearrangement Computing signed reversal distance with duplicates Computing Minimum Common Substring Partition Computing Maximum Cycle Decomposition Experimental results Summary and future directions 9/17/2018
Simulated data test Simulated genome : 100 distinct genes Simulated genome : Randomly perform reversals on to obtain another genome Experiments One: Randomly copy some genes and insert them back into Two: Randomly copy some genes and insert them back into and (Inserted genes are inparalogs by definition.) 9/17/2018
Simulated data test Randomly generate two genomes ( , , , ) Average on 20 random instances for each parameter set Our heuristic algorithm v.s. the iterated exemplar algorithm (Sankoff, Bioinformatics, 1999) 9/17/2018
Real data Homo sapiens: Mus musculus: Build 36.1 human genome assembly (UCSC hg18, March 2006) 20161 protein sequences in total Mus musculus: Build 36 mouse genome assembly (UCSC mm8, February 2006) 19199 protein sequences in total 9/17/2018
MSOAR vs. Inparanoid Validation: Official gene symbols extracted from the UniProt release 6.0 (September 2005) For 20161 human protein sequences and 19199 mouse protein sequences, MSOAR assigned 14362 orthologs between Human and Mouse, among which 11050 are true positives, 1748 are unknown pairs and 1508 are false positives, resulting in a sensitivity of 92.26% and a specificity of 87.99%. The comparison between MSOAR and Inparanoid (Remm et al., J. Mol. Biol., 2001) 9/17/2018
MSOAR vs. Inparanoid Human chromosome 20 Mouse chromosome 2 SNRPB STK35 TGM3 TGM6 ZNF343 TMC2 NOL5A IDH3B Snrpb Stk35 Tgm3 Tgm6 Tmc2 Nol5a Idh3b Mouse chromosome 2 The ortholog pair SNRPB (Human) and Snrpb (Mouse) are not bi-directional best hits, which could be missed by the sequence-similarity based ortholog assignment methods like Inparanoid. 9/17/2018
Number of main ortholog pairs assigned by MSOAR across the chromosome pairs 9/17/2018
An alignment between syntenic blocks and MSOAR blocks 9/17/2018
Validation by HCOP The HGNC Comparison of Orthology Predictions (HCOP) is a tool that integrates and displays the human-mouse orthology assertions made by Ensembl, Homologene, Inparanoid, PhIGS, MGD and HGNC. (http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/hcop.pl) 9/17/2018
Other validations By PANTHER protein sequence classification (ftp://ftp.pantherdb.org/sequence_classifications/) MSOAR identified 14083 ortholog pairs with valid Geneid between human and mouse, among which 11887 pairs have both orthologous genes in the same protein subfamily. 9/17/2018
Outline An introduction to orthology Previous ortholog assignment methods Ortholog assignment via genome rearrangement An introduction to genome rearrangement Computing signed reversal distance with duplicates Computing Minimum Common Substring Partition Computing Maximum Cycle Decomposition Experimental results Summary and future directions 9/17/2018
Summary and future work Presented a novel approach to assign orthologs between two genomes via genome rearrangement and gene duplication Introduced a rearrangement/duplication (RD) distance for genome comparisons Proposed a heuristic algorithm for assigning orthologs under maximum parsimony Developed a high-throughput system for ortholog assignment (MSOAR) Tested the system on simulated data and real genomic data of human and mouse MSOAR vs. Iterated exemplar algorithm MSOAR vs. Inparanoid Various validation methods Future directions More efficient algorithms for MCSP and MCD Refine the evolutionary model for MSOAR (transposition, tandem duplication, gene loss, etc.) Ortholog assignment for multiple genome comparison More explicit treatment of one-to-many and many-to-many orthology relationship 9/17/2018
References X. Chen, J. Zheng, Z. Fu, P. Nan, Y. Zhong, S. Lonardi, and T. Jiang. Computing the assignment of orthologous genes via genome rearrangement. Proc. 3rd Asia-Pacific Bioinformatics Conference (APBC), 2005, pp. 363-378. X. Chen, J. Zheng, Z. Fu, P. Nan, Y. Zhong, S. Lonardi, and T. Jiang. Assignment of orthologous genes via genome rearrangement. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 2-4, pp. 302-315, 2005. Z. Fu, X. Chen, V. Vacic, P. Nan, Y. Zhong, and T. Jiang. A parsimony approach to genome-wide ortholog assignment. Proc. 10th Annual International Conference on Research in Computational Molecular Biology (RECOMB), 2006, pp. 578-594. Z. Fu, X. Chen, V. Vacic, P. Nan, Y. Zhong, and T. Jiang. MSOAR: A High-throughput ortholog assignment system based on genome rearrangement. Submitted, 2007. Z. Fu and T. Jiang. Clustering of main orthologs for multiple genomes. To be presented at LSI Conference on Computational Systems Biology (CSB), 2007. 9/17/2018
Acknowledgement NSF DoE Genomes to Life (GtL) program National Key Project for Basic Research NSFC Changjiang Visiting Professorship, Tsinghua Univ. Discussion with Marek Chrobak, Petr Kolman, and Lan Liu on MCSP and MCIP 9/17/2018