Orthology Analysis Erik Sonnhammer Center for Genomics and Bioinformatics Karolinska Institutet, Stockholm
Outline Basic concepts BLAST-based approaches to orthology Tree-based approaches to orthology Domain-level orthology
Homologs = genes with a common origin May be genes in the same or in different organisms Does not say that function is identical Can only be true or false, and not a percentage! Homologs have the same 3D-structure layout
Homologs Orthologs Paralogs
Gene Y1 in human Gene Y in rat Gene Y2 in human D Gene X in ancient animal Gene Y in ancient mammal In-paralogs Orthologs: separated by speciation Gene X in ancient mammal Gene X in human Gene X in rat Time Orthologs Out-paralogs paralogs speciation D S S
In/Out-paralog definition In-paralogs ~ co-orthologs paralogs that were duplicated after the speciation and hence are orthologs to a cluster in the other species Out-paralogs = not co-orthologs paralogs that were duplicated before the speciation. Not necessarily in the same species. Sonnhammer & Koonin, Trends Genet. 18: (2002)
Orthologs for functional genomics Co-orthologs / inparalogs are more likely than outparalogs to have identical biochemical functions and biological roles. Co-orthologs can be used to discover human gene function via model organism experiments Co-orthologs are key to exploit functional genomics/proteomics data in in model organisms
Orthology and function conservation Orthology does not say anything about evolutionary distance. Close orthologs, e.g. human-mouse are very likely to have the same biological role in the organism. Distant orthologs, e.g. human-worm are less likely to have the same phenotypical role, but may have the same role in the corresponding pathway.
Ortholog Databases Sequence databaseOrthology detection method Ortholog database SwTrembl proteomesInparanoid (blast)Inparanoid proteomesCOGs (blast)COGs / KOGs TIGR gene indexCOGs (blast)TOGA/EGO proteomesOrthoMCL (blast)OrthoMCL PfamOrthostrapper (tree)HOPS PfamRIO (tree)
How to find orthologs? 1. Calculate phylogenetic tree, look for orthologs in the tree (Orthostrapper, Rio): 2. Two-way best matches between two species can be used to find orthologs without trees. [However, in-paralogs are harder to find this way]
Two-way best match approach to finding orthologs
COGs COG2813: Out- paralogs orthologs
Inpara-n-oid Inparalog ‘n ortholog identification Blue = species 1 Red = species 2
Inparanoid Blue = species 1 Red = species 2
No overlap - no problems: Partial overlap - separate: Complete overlap - merge: Resolve overlapping clusters
Inparalog score Score for inparalog P = (scoreAP - scoreAB) / (scoreAA - scoreAB) % A P B
Confidence values for main orthologs from sampling TVHIVDDEEPVR---KSLAFM---LTMNGFA T+ ++DD +R K L M +T+ G A TILLIDDHPMLRTGVKQLISMAPDITVVGEA Sampling with replacement; insertions kept intact GAFDEP---LVTHVR GA + ++T +R GAEEHMAPDILTLLR “Bootstrap alignment” -> “bootstrap score” Confidence = (bootstrap alignments best-best matches / nr of bootstraps)
inparanoid.cgb.ki.se Remm et al, J. Mol. Biol. 314: (2001) Homo Sapiens vs. C. elegans
Ortholog group sizes, human vs X
Nr of inparalogs per ortholog group SpeciesAvg. inparalogs in model organism ortholog groups Avg. inparalogs in human ortholog groups Mouse Fly Worm Mustard weed Yeast E. coli
No guarantee that the same segment is used in different sequences No evolutionary distance model Does not take multiple domains into account Drawbacks of Blast-based orthology assignment
Domain orthology Inparanoid Human-Fly ortholog pairs with domains in Pfam-A 13.0: Different domain architectures: 5411 –Many of these are minor differences, e.g. 22 vs 21 Spectrin repeats –Sometimes the difference is big: ef-handUCH TBCUCH
Tree-based approaches
Distance-based tree building Bootstrapping: –randomly pick columns to bootstrap alignment, calculate tree –Repeat 1000 times, frequency of node = bootstrap support A2A3 A148 A210 A1 A2 A A1 MKFYSLPNFPEN A2 MKYYKLPDLPDE A3 MRFYTACENPRS Distance matrix
Orthology by tree reconciliation Species tree Gene tree Infer 2 duplications and 2 losses
Assumption that the species tree is fully known Does not give confidence values Gene trees become unreliable when involving a lot of sequences (more data -> less certainty) Computationally expensive Drawbacks of tree reconciliation for orthology assignment
Partial tree reconciliation Find pairwise orthologs by computer parsing of tree.
C14F5.4 AAF AH6.2 F37H8.4 Y6E2A.9 C47D12.3 T04F8.1 AAF PIR-S67168 Pairwise orthology confidence by ‘orthostrapping’ The original tree with bootstrap support values
C14F5.4 AAF AH6.2 F37H8.4 Y6E2A.9 C47D12.3 T04F8.1 AAF PIR-S67168 Pairwise orthology confidence by ‘orthostrapping’ 01 C14F T04F C47D Y6E2A.9 00 F37H AH6.2 AAF AAF Fly Worm
C14F5.4 AAF AH6.2 F37H8.4 Y6E2A.9 C47D12.3 T04F8.1 AAF PIR-S67168 Pairwise orthology confidence by ‘orthostrapping’ 02 C14F T04F C47D Y6E2A.9 00 F37H AH6.2 AAF AAF Fly Worm
C14F5.4 AAF AH6.2 F37H8.4 Y6E2A.9 C47D12.3 T04F8.1 AAF PIR-S67168 Pairwise orthology confidence by ‘orthostrapping’ 099 C14F T04F C47D Y6E2A F37H AH6.2 AAF AAF Fly Worm
orthostrapper.cgb.ki.se
Orthology is not transitive! Multiple species at different distances may give erroneous groups, that includes out-paralogs
Orthology is not transitive! -> Orthology strictly defined for only 2 species/clades Combining species of different distances is very dangerous But OK to combine multiple equidistant ones Y H1 D1 H2 D2 D1D1 H2 Y
Domain-level orthology
HOPS - Hierarchy of Orthologs and Paralogs eukaryota metazoa viridiplantae fungi nematoda arthropoda chordata 1.All species in Pfam are bundled in groups according to scheme: 2.Apply Orthostrapper to groups at same level in Pfam families 3.Display results in NIFAS
Pfam
Pfam in brief: Profile-HMM HMMer-2.0 FULL alignment Search database Manually curatedAutomatically made SEED alignment representative members Description file Release 13.0 (April 2004): –7426 families Pfam-A domain families –Based on sequences (Swissprot & Trembl) –21980 unique Pfam-A domain architectures –73% of all proteins have >=1 Pfam-A domain
HOPS results Pfam 10, 6190 families: 2450 families (40%) have HOPS orthologs 1319 families (21%) have HOPS orthologs in all 6 pairwise comparisons pairwise orthology assignments (> 75% orthostrap) Storm and Sonnhammer, Genome Research 13: (2003)
Ways to access HOPS NIFAS graphical browser By sequence ID at Pfam.cgb.ki.se/HOPS Flatfiles (Orthostrap tables of 2 clades)
Pfam.cgb.ki.se/HOPS
Evolution of Domain Architectures NIFAS:
ATP sulfurylase /APS kinase
Orthologous shuffled domains? ATP sulfurylase domain, metazoa vs fungi
APS kinase domain
HOPS orthologs of PPS1_HUMAN (ATP sulfurylase/APS kinase)
Summary of ATP sulfurylases/APS kinases: Shuffled non-orthologous domains Fungi Metazoa
Conclusions Orthologs can be detected by –Blast: fast –tree: slow but less error-prone Species at different evolutionary distances should not be combined in orthology analysis Inparanoid and Orthostrapper were designed to find inparalogs but not outparalogs HOPS/NIFAS can be used to find domain orthologs and analyze domain architecture evolution
Future perspectives Multiparanoid – multiple species merging of pairwise Inparalogs. Functional divergence among inparalogs
Acknowledgments –Christian Storm –Maido Remm –Andrey Alexeyenko –Volker Hollich –Mats Jonsson