Presentation is loading. Please wait.

Presentation is loading. Please wait.

Benchmarking Orthology in Eukaryotes 12-01-2004 Nijmegen Tim Hulsen.

Similar presentations


Presentation on theme: "Benchmarking Orthology in Eukaryotes 12-01-2004 Nijmegen Tim Hulsen."— Presentation transcript:

1 Benchmarking Orthology in Eukaryotes 12-01-2004 Nijmegen Tim Hulsen

2 Summary (1) An introduction to orthology (2) Orthology determination methods (3) Benchmarking: –co-expression –conservation of co-expression –SwissProt name (4) Conclusions

3 An introduction to orthology (from http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html)

4 Orthology determination methods Orthology databases/methods: COG/KOG Inparanoid OrthoMCL Inclusiveness: one-to-one/one-to-many/many-to-many organisms Best bidirectional hit/Phylogenetic trees

5 Benchmarking orthology Quality of orthology difficult to test; no golden standard Orthologs should have highly similar functions Measuring conservation of function: –functional annotation –co-expression –domain structure

6 Benchmarked orthology determination methods BBH: Best Bidirectional Hit KOG: euKaryotic Orthologous Groups INP: INPARANOID MCL: OrthoMCL Z1H: All pairs with Z >= 100 COM: Comics Phylogenetic Tree Method EQN: Equal SwissProt Names

7 Data set used ‘Protein World’: all proteins in all available (SPTREMBL) proteomes compared to each other Smith-Waterman with Z-value statistics: 100 randomized shuffles to test significance of SW score O. MFTGQEYHSV shuffle 1. GQHMSVFTEY 2. YMSHQFTVGE etc. # seqs SW score rnd ori: 5*SD  Z = 5

8 Data set used Z-value compensates for: –bias in amino acid composition –sequence length Proteomes used: –Human: 28,508 proteins –Mouse: 20,877 proteins  595,161,516 pairs

9 BBH method Easiest method: ‘best bidirectional hit’ Human protein (1)  SW  best hit in mouse (2) Mouse protein (2)  SW  best hit in human (3) If 3 equals 1, the human and mouse protein are considered to be orthologs 12,817 human-mouse orthologous pairs (12,817 human, 12,817 mouse proteins)

10 KOG method KOG: euKaryotic Orthologous Groups Eukaryotic version of COG, Clusters of Orthologous Groups COG method: –All-vs-all seq. comparison (BLAST) –Detect and collapse obvious paralogs Sp1-Sp1 Sp2-Sp2 Sp1-Sp2 E Hs-Hs < E BBH  paralogs E Mm-Mm < E BBH  paralogs etc. for other species  determine BBHs

11 KOG method –Detect triangles of best hits –Merge triangles with a common side to form COGs –Case-by-case ‘manual’ analysis, examination of large COGs (might be split up)

12 KOG method KOG method mainly the same as COG method; special attention for eukaryotic multidomain structure Group orthologies: many-to-many Cognitor: assign a KOG to each protein (mouse not yet in KOG) 810,697 human-mouse orthologous pairs (20,478 human, 15,640 mouse proteins) Tatusov et al., “The COG database: an updated version includes eukaryotes”, BMC Bioinformatics. 2003 Sep 11;4(1):41

13 INP method All-vs-all followed by a number of extra steps to add ‘in-paralogs’  many-to-many relations possible 54,553 human-mouse orthologous pairs (19,504 human, 17,030 mouse proteins) Remm et al., “Automatic clustering of orthologs and in-paralogs from pairwise species comparisons”, J Mol Biol. 2001 Dec 14; 314(5):1041-52

14 MCL method All-vs-all BLASTP  determine orthologs + ‘recent’ paralogs  use Markov clustering to determine ortholog groups 7,322 human-mouse orthologous pairs (human 6,332, mouse 6,115 proteins) Li et al., “OrthoMCL: identification of ortholog groups for eukaryotic genomes”, Genome Res. 2003 Sep;13(9):2178-89

15 Z1H method All human-mouse pairs with Z >= 100 in Protein World set are considered to be orthologs 290,176 human-mouse orthologous pairs (19,055 human, 16,149 mouse proteins)

16 COM method Human All 9 eukaryotic proteomes in Protein World Z>20, RH>0.5*QL 24,263 groups PHYLOME SELECTION OF HOMOLOGS ALIGNMENTS AND TREES PROTEOME PROTEOMES TREE SCANNING LIST Hs-Mm: 85,848 pairs Hs-Dm: 55,934 pairs etc.

17 COM method Example: BMP6 (Bone Morphogenetic Protein 6)  5 Hs-Mm orthologous relations defined

18 EQN method Consider all Hs-Mm pairs with equal SwissProt names to be orthologous e.g. ANDR_HUMAN  ANDR_MOUSE Used as benchmark later on 5,214 Hs-Mm orthologous pairs (5,214 human, 5,214 mouse proteins)

19 Benchmarking through co-expression Comparison of expression profiles of each orthologous gene pair Using GeneLogic Expressor data set: organismsamplesfragmentstissue categories SNOMED tissue categories human32694479211515 mouse859367012512

20 Expression tissue categories HUMANMOUSE 1 Blood vessel 2 Cardiovascular system 3 Digestive organs 4 Digestive system 5 Endocrine gland- 6 Female genital system 5 Female genital system 7 Hematopoietic system 6 Hematopoietic system 8 Integumentary system 7 Integumentary system HUMANMOUSE 9 Male genital system 8 Male genital system 10 Musculoskeletal system 9 Musculoskeletal system 11 Nervous system10 Nervous system 12 Product of conception - 13 Respiratory system 11 Respiratory system 14 Topographic region - 15 Urinary tract12 Urinary tract

21 Co-expression calculation Calculation of the correlation coefficient: N  xy – (  x)(  y) r = ---------------------------- sqrt( (N  x 2 - (  x) 2 )(N  y 2 – (  y) 2 )) Measured over the 12 corresponding SNOMED tissue categories

22 Co-expression example #1 High correlation: 0.914167

23 Co-expression example #2 Low correlation: -0.935731

24 Benchmarking through co-expression - +

25 Benchmarking through conservation of co-expression Human Gene A Gene B Mouse Gene A’ Gene B’ Co-expression = Cab (-1<=corr.<=1) Ca’b’ >= Cab  Increases probability that A and B are involved in the same process (Co-expression calculated over 115 tissues in human, 25 in mouse) All-vs-all: Human: 40,678 chip fragments Mouse: 29,910 chip fragments

26 Benchmarking through conservation of co-expression Gene Ontology (GO) database: hierarchical system of function and location descriptions Orthologs are in same functional category when they are in the same 4th level GO Biological Process class

27 Benchmarking through conservation of co-expression

28 Benchmarking through SwissProt name How many of the predicted orthologous relations have equal SwissProt names (EQN set in other benchmarks) + reliable because checked by hand - assumes only one-to-one relationships are possible

29 Benchmarking through SwissProt name (ALL: if all possible human-mouse pairs (or random fraction) would be orthologs)

30 Conclusions Hard to point out the ‘best’ orthology determination method In most cases: less=better, more=worse Method that should be used depends on research question: do you need few reliable orthologies or many less reliable orthologies? Future directions: look at conservation of domain structure as a benchmark

31 Credits Martijn Huynen Peter Groenen Comics Group Gert Vriend Rest of CMBI Organon Bioinf. Group


Download ppt "Benchmarking Orthology in Eukaryotes 12-01-2004 Nijmegen Tim Hulsen."

Similar presentations


Ads by Google