BLAST program selection guide http://www.ncbi.nlm.nih.gov/blast/producttable.shtml#tab31
Orthology, Paralogy, Xenology Homology Orthology, Paralogy, Xenology
Fitch WM. Trends Genet. 2000 May;16(5):227-31.
Analogy vs Homology Analogy The relationship of any two characters that have descended convergently from unrelated ancestors. Homology The relationship of any two characters that have descended, usually with divergence, from a common ancestral character.
Orthology The relationship of any two homologous characters whose common ancestor lies in the cenancestor of the taxa from which the two sequences were obtained. Paralogy The relationship of any two homologous characters arising from a duplication of the gene for that character. Xenology The relationship of any two homologous characters whose history, since their common ancestor, involves an interspecies (horizontal) transfer of the genetic material for at least one of those characters.
A classic example (Figure from NCBI)
Test Yourself A1 – B1 A1 – B2 A1 – C3 B1 – C2 C2 – C3 B2 – C3 C3 – AB1
Test Yourself A1 – B1 = Ortho A1 – B2 = Ortho A1 – C3 = Ortho B1 – C2 = Para (out) C2 – C3 = Para (in) B2 – C3 = Ortho C3 – AB1= Xeno
Homology on a Genome-Scale How many and which genes are common to two or more organisms? Which genes differentiate one organism from another? How is homology related to function?
Orthologs are the set of genes/proteins with gene trees identical to the species tree. We can understand other types of homology relationships by comparison to the species tree. But often we don’t know the species tree, and phylogenetic methods are complex
Consider two genomes Use BLASTP to compare one set of proteins (proteome) to the other Which set will you use as the query and which as the database? What criteria will you use to define “a match”? GenomeA – gene 1 GenomeB– gene 1 A1, A3, B2 and B3 are homologs (assuming the aligned regions overlap) GenomeA – gene 2 GenomeB – gene 2 GenomeA – gene 3 GenomeB – gene 3
Reciprocal Best Hits Use BLASTP to compare sets of proteins (proteome) to each other First using GenomeA to query against GenomeB Then using GenomeB to query against GenomeA Save only one best match for each query Save only the reciprocal best matches as “orthologs” GenomeA – gene 1 GenomeB– gene 1 GenomeA – gene 2 GenomeB – gene 2 GenomeA – gene 3 GenomeB – gene 3 GenomeA – gene 1 GenomeB– gene 1 GenomeA – gene 2 GenomeB – gene 2 GenomeA – gene 3 GenomeB – gene 3 GenomeA – gene 1 GenomeB– gene 1 Lose A3-B2 and A1-B3 homology GenomeA – gene 2 GenomeB – gene 2 GenomeA – gene 3 GenomeB – gene 3
One case where RBH works GenomeA – gene 1 GenomeB– gene 1 GenomeA – gene 2 GenomeB – gene 2 GenomeA – gene 3 GenomeB – gene 3 One case where RBH works GenomeA – gene 1 GenomeB– gene 1 GenomeA – gene 2 GenomeB – gene 2 GenomeA – gene 3 GenomeB – gene 3 GenomeA – gene 1 GenomeB– gene 1 GenomeA – gene 2 GenomeB – gene 2 GenomeA – gene 3 GenomeB – gene 3 GenomeA – gene 1 Glucose transport GenomeB – gene 2 Glucose transport GenomeA – gene 3 Fructose transport GenomeB – gene 3 Galactose transport
One case where RBH fails GenomeA – gene 1 GenomeB– gene 1 GenomeA – gene 2 GenomeB – gene 2 GenomeA – gene 3 GenomeB – gene 3 One case where RBH fails GenomeA – gene 1 GenomeB– gene 1 GenomeA – gene 2 GenomeB – gene 2 GenomeA – gene 3 GenomeB – gene 3 GenomeA – gene 1 GenomeB– gene 1 GenomeA – gene 2 GenomeB – gene 2 GenomeA – gene 3 GenomeB – gene 3 In paralogs- duplication since speciation GenomeA – gene 1 Glucose transport GenomeA– gene 3 Glucose transport GenomeB– gene 2 Fructose transport GenomeB – gene 3 Galactose transport
Software/Methods for Predicting Orthologs from Genome Sequences RBH RSD (Reciprocal Shortest Distance) INPARANOID RIO Orthostrapper Ortholuge TribeMCL OrthoMCL
Li L, Stoeckert CJ Jr, Roos DS Li L, Stoeckert CJ Jr, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003 Sep;13(9):2178-89.
Pre-computed OrthoMCL results http://www.orthomcl.org/
Evaluating performance No “gold standard” set of true orthologs Latent Class Analysis Agreement between methods provides confidence 27,562 proteins from 6 eukarotes assigned to Pfams
Performance Metrics actual \ predicted negative positive Negative TN Accuracy – Proportion correct TN+TP/total TPR (Recall) – Proportion of predicted positives that are correct TP/FP+TP Sensitivity – Proportion of positives correctly predicted TP/FN+TP Specificity – Proportion of negatives correctly predicted TN/TN+FP actual \ predicted negative positive Negative TN FP Positive FN TP
Chen F, Mackey AJ, Vermunt JK, Roos DS Chen F, Mackey AJ, Vermunt JK, Roos DS. Assessing performance of orthology detection strategies applied to eukaryotic genomes. PLoS ONE. 2007 Apr 18;2(4):e383.
Method Comparison
Is context useful for assigning homology type? Prokaryotes vs eukaryotes Evolutionary origin Paralogs that arise as tandem repeats of single genes Parlogs that arise from duplication of larger regions Xenologs that arise from acquisition of a similar gene from another lineage
Example: pectate lyases of soft-rot enterobactia may be SymBets, but genome context suggests they may not be orthologs