Orthology Analysis Erik Sonnhammer Center for Genomics and Bioinformatics Karolinska Institutet, Stockholm.

Slides:



Advertisements
Similar presentations
1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Advertisements

Phylogenetics workshop: Protein sequence phylogeny week 2 Darren Soanes.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Genomic Innovations- Orthology Paralogy. Genomic innovation.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Types of homology BLAST
Comparative genomics Joachim Bargsten February 2012.
Molecular Evolution Revised 29/12/06
© Wiley Publishing All Rights Reserved. Phylogeny.
M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.
Structural bioinformatics
Benchmarking Orthology in Eukaryotes Nijmegen Tim Hulsen.
Finding Orthologous Groups René van der Heijden. What is this lecture about? What is ‘orthology’? Why do we study gene-ancestry/gene-trees (phylogenies)?
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Bioinformatics and Phylogenetic Analysis
Tree Pattern Matching in Phylogenetic Trees Automatic Search for Orthologs or Paralogs in Homologous Gene Sequence Databases By: Jean-François Dufayard,
FOG: High-Resolution Fungal Orthologous Groups René van der Heijden Project 5.10: Comparative genomics for the prediction of protein function and pathways.
Bas E. Dutilh Phylogenomics Using complete genomes to determine the phylogeny of species.
Protein Modules An Introduction to Bioinformatics.
Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences Database searching for sequences Multiple sequence alignment Protein classification.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Finding Orthologous Groups René van der Heijden. What is this lecture about? What is ‘orthology’? Why do we study gene-ancestry/gene-trees (phylogenies)?
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Comparative Genomics of the Eukaryotes
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Automatic methods for functional annotation of sequences Petri Törönen.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
Protein Bioinformatics Course
Functional Linkages between Proteins. Introduction Piles of Information Flakes of Knowledge AGCATCCGACTAGCATCAGCTAGCAGCAGA CTCACGATGTGACTGCATGCGTCATTATCTA.
Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Calculating branch lengths from distances. ABC A B C----- a b c.
NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki.
Identifying conserved segments in rearranged and divergent genomes Bob Mau, Aaron Darling, Nicole T. Perna Presented by Aaron Darling.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
Protein and RNA Families
Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.
Genome Analysis II Comparative Genomics Jiangbo Miao Apr. 25, 2002 CISC889-02S: Bioinformatics.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. [many slides borrowed from various sources]
Using blast to study gene evolution – an example.
Phylogenetic analysis taken from and es/MSAPhylogeny.htm.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
HomologyIf twp proteins are homologous, they have a common fold and a common ancestor If two proteins have >25% identity across their entire length, they.
Gene3D, Orthology and Homology-Based Inheritance of Protein-Protein Interactions Corin Yeats
Phylogeny and the Tree of Life
Sequence similarity, BLAST alignments & multiple sequence alignments
BLAST program selection guide
Basics of Comparative Genomics
Comparative Genomics.
Protein Sequence Alignments
Genome Annotation Continued
Protein Bioinformatics Course
Basics of Comparative Genomics
Presentation transcript:

Orthology Analysis Erik Sonnhammer Center for Genomics and Bioinformatics Karolinska Institutet, Stockholm

Outline Basic concepts BLAST-based approaches to orthology Tree-based approaches to orthology Domain-level orthology

Homologs = genes with a common origin May be genes in the same or in different organisms Does not say that function is identical Can only be true or false, and not a percentage! Homologs have the same 3D-structure layout

Homologs Orthologs Paralogs

Gene Y1 in human Gene Y in rat Gene Y2 in human D Gene X in ancient animal Gene Y in ancient mammal In-paralogs Orthologs: separated by speciation Gene X in ancient mammal Gene X in human Gene X in rat Time Orthologs Out-paralogs paralogs speciation D S S

In/Out-paralog definition In-paralogs ~ co-orthologs paralogs that were duplicated after the speciation and hence are orthologs to a cluster in the other species Out-paralogs = not co-orthologs paralogs that were duplicated before the speciation. Not necessarily in the same species. Sonnhammer & Koonin, Trends Genet. 18: (2002)

Orthologs for functional genomics Co-orthologs / inparalogs are more likely than outparalogs to have identical biochemical functions and biological roles. Co-orthologs can be used to discover human gene function via model organism experiments Co-orthologs are key to exploit functional genomics/proteomics data in in model organisms

Orthology and function conservation Orthology does not say anything about evolutionary distance. Close orthologs, e.g. human-mouse are very likely to have the same biological role in the organism. Distant orthologs, e.g. human-worm are less likely to have the same phenotypical role, but may have the same role in the corresponding pathway.

Ortholog Databases Sequence databaseOrthology detection method Ortholog database SwTrembl proteomesInparanoid (blast)Inparanoid proteomesCOGs (blast)COGs / KOGs TIGR gene indexCOGs (blast)TOGA/EGO proteomesOrthoMCL (blast)OrthoMCL PfamOrthostrapper (tree)HOPS PfamRIO (tree)

How to find orthologs? 1. Calculate phylogenetic tree, look for orthologs in the tree (Orthostrapper, Rio): 2. Two-way best matches between two species can be used to find orthologs without trees. [However, in-paralogs are harder to find this way]

Two-way best match approach to finding orthologs

COGs COG2813: Out- paralogs orthologs

Inpara-n-oid Inparalog ‘n ortholog identification Blue = species 1 Red = species 2

Inparanoid Blue = species 1 Red = species 2

No overlap - no problems: Partial overlap - separate: Complete overlap - merge: Resolve overlapping clusters

Inparalog score Score for inparalog P = (scoreAP - scoreAB) / (scoreAA - scoreAB) % A P B

Confidence values for main orthologs from sampling TVHIVDDEEPVR---KSLAFM---LTMNGFA T+ ++DD +R K L M +T+ G A TILLIDDHPMLRTGVKQLISMAPDITVVGEA Sampling with replacement; insertions kept intact GAFDEP---LVTHVR GA + ++T +R GAEEHMAPDILTLLR “Bootstrap alignment” -> “bootstrap score” Confidence = (bootstrap alignments best-best matches / nr of bootstraps)

inparanoid.cgb.ki.se Remm et al, J. Mol. Biol. 314: (2001) Homo Sapiens vs. C. elegans

Ortholog group sizes, human vs X

Nr of inparalogs per ortholog group SpeciesAvg. inparalogs in model organism ortholog groups Avg. inparalogs in human ortholog groups Mouse Fly Worm Mustard weed Yeast E. coli

No guarantee that the same segment is used in different sequences No evolutionary distance model Does not take multiple domains into account Drawbacks of Blast-based orthology assignment

Domain orthology Inparanoid Human-Fly ortholog pairs with domains in Pfam-A 13.0: Different domain architectures: 5411 –Many of these are minor differences, e.g. 22 vs 21 Spectrin repeats –Sometimes the difference is big: ef-handUCH TBCUCH

Tree-based approaches

Distance-based tree building Bootstrapping: –randomly pick columns to bootstrap alignment, calculate tree –Repeat 1000 times, frequency of node = bootstrap support A2A3 A148 A210 A1 A2 A A1 MKFYSLPNFPEN A2 MKYYKLPDLPDE A3 MRFYTACENPRS Distance matrix

Orthology by tree reconciliation Species tree Gene tree Infer 2 duplications and 2 losses

Assumption that the species tree is fully known Does not give confidence values Gene trees become unreliable when involving a lot of sequences (more data -> less certainty) Computationally expensive Drawbacks of tree reconciliation for orthology assignment

Partial tree reconciliation Find pairwise orthologs by computer parsing of tree.

C14F5.4 AAF AH6.2 F37H8.4 Y6E2A.9 C47D12.3 T04F8.1 AAF PIR-S67168 Pairwise orthology confidence by ‘orthostrapping’ The original tree with bootstrap support values

C14F5.4 AAF AH6.2 F37H8.4 Y6E2A.9 C47D12.3 T04F8.1 AAF PIR-S67168 Pairwise orthology confidence by ‘orthostrapping’ 01 C14F T04F C47D Y6E2A.9 00 F37H AH6.2 AAF AAF Fly Worm

C14F5.4 AAF AH6.2 F37H8.4 Y6E2A.9 C47D12.3 T04F8.1 AAF PIR-S67168 Pairwise orthology confidence by ‘orthostrapping’ 02 C14F T04F C47D Y6E2A.9 00 F37H AH6.2 AAF AAF Fly Worm

C14F5.4 AAF AH6.2 F37H8.4 Y6E2A.9 C47D12.3 T04F8.1 AAF PIR-S67168 Pairwise orthology confidence by ‘orthostrapping’ 099 C14F T04F C47D Y6E2A F37H AH6.2 AAF AAF Fly Worm

orthostrapper.cgb.ki.se

Orthology is not transitive! Multiple species at different distances may give erroneous groups, that includes out-paralogs

Orthology is not transitive! -> Orthology strictly defined for only 2 species/clades Combining species of different distances is very dangerous But OK to combine multiple equidistant ones Y H1 D1 H2 D2 D1D1 H2 Y

Domain-level orthology

HOPS - Hierarchy of Orthologs and Paralogs eukaryota metazoa viridiplantae fungi nematoda arthropoda chordata 1.All species in Pfam are bundled in groups according to scheme: 2.Apply Orthostrapper to groups at same level in Pfam families 3.Display results in NIFAS

Pfam

Pfam in brief: Profile-HMM HMMer-2.0 FULL alignment Search database Manually curatedAutomatically made SEED alignment representative members Description file Release 13.0 (April 2004): –7426 families Pfam-A domain families –Based on sequences (Swissprot & Trembl) –21980 unique Pfam-A domain architectures –73% of all proteins have >=1 Pfam-A domain

HOPS results Pfam 10, 6190 families: 2450 families (40%) have HOPS orthologs 1319 families (21%) have HOPS orthologs in all 6 pairwise comparisons pairwise orthology assignments (> 75% orthostrap) Storm and Sonnhammer, Genome Research 13: (2003)

Ways to access HOPS NIFAS graphical browser By sequence ID at Pfam.cgb.ki.se/HOPS Flatfiles (Orthostrap tables of 2 clades)

Pfam.cgb.ki.se/HOPS

Evolution of Domain Architectures NIFAS:

ATP sulfurylase /APS kinase

Orthologous shuffled domains? ATP sulfurylase domain, metazoa vs fungi

APS kinase domain

HOPS orthologs of PPS1_HUMAN (ATP sulfurylase/APS kinase)

Summary of ATP sulfurylases/APS kinases: Shuffled non-orthologous domains Fungi Metazoa

Conclusions Orthologs can be detected by –Blast: fast –tree: slow but less error-prone Species at different evolutionary distances should not be combined in orthology analysis Inparanoid and Orthostrapper were designed to find inparalogs but not outparalogs HOPS/NIFAS can be used to find domain orthologs and analyze domain architecture evolution

Future perspectives Multiparanoid – multiple species merging of pairwise Inparalogs. Functional divergence among inparalogs

Acknowledgments –Christian Storm –Maido Remm –Andrey Alexeyenko –Volker Hollich –Mats Jonsson