Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Basics of Comparative Genomics Dr G. P. S. Raghava.
First release of HOGENOM, a database of homologous genes from complete genome Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.
An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical.
Benchmarking Orthology in Eukaryotes Nijmegen Tim Hulsen.
Comparison of Networks Across Species CS374 Presentation October 26, 2006 Chuan Sheng Foo.
Finding Orthologous Groups René van der Heijden. What is this lecture about? What is ‘orthology’? Why do we study gene-ancestry/gene-trees (phylogenies)?
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Bioinformatics and Phylogenetic Analysis
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
Project Proposals Due Monday Feb. 12 Two Parts: Background—describe the question Why is it important and interesting? What is already known about it? Proposed.
Similar Sequence Similar Function Charles Yan Spring 2006.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Comparative Genomics of the Eukaryotes
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
“Homology-enhanced probabilistic consistency” multiple sequence alignment : a case study on transmembrane protein Jia-Ming Chang 2013-July-09 Chang, J-M,
Functional Associations of Protein in Entire Genomes Sequences Bioinformatics Center of Shanghai Institutes for Biological Sciences Bingding.
HOGENOM a phylogenomic database
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Complementarity of network and sequence information in homologous proteins March, Department of Computing, Imperial College London, London, UK 2.
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
Protein and RNA Families
Markov Cluster (MCL) algorithm Stijn van Dongen.
Genome Analysis II Comparative Genomics Jiangbo Miao Apr. 25, 2002 CISC889-02S: Bioinformatics.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Generic Database. What should a genome database do? Search Browse Collect Download results Multiple format Genome Browser Information Genomic Proteomic.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Nothing in (computational) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970) Comparative genomics, genome context.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Construction of Substitution matrices
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
(H)MMs in gene prediction and similarity searches.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Testing sequence comparison methods with structure Organon, Oss Tim Hulsen.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
What is sequencing? Video: WlxM (Illumina video) WlxM.
The Artemis Comparison Tool
PINALOG Protein Interaction Network Alignment and its implication in function prediction and complex detection Hang Phan Prof. Michael J.E. Sternberg.
Sequence similarity, BLAST alignments & multiple sequence alignments
Blast Basic Local Alignment Search Tool
Basics of Comparative Genomics
Identifying templates for protein modeling:
Comparative Genomics.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Basics of Comparative Genomics
Basic Local Alignment Search Tool
Overall diagram of the analysis to identify and classify relaxases, T4CPs, and T4SSs. Overall diagram of the analysis to identify and classify relaxases,
Presentation transcript:

Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity pairs Putative orthologs Within Species: Reciprocal better similarity pairs (Recent) paralogs Similarity cutoff: P-value % overlap

Similarity Matrix Markov Clustering Ortholog groups with (recent) paralogs Cluster tightness: Inflation values (I)

─22000 B2 220─0 150 B1 00─200 A ─ A1 B2B1A2A1 Species B Ortholog 150 Species A A2A1 Paralog 200 B1B2 Paralog 220 Similarity Matrix Similarity score

Markov Clustering (MCL) Algorithm Transition probability matrix Markov Matrix Matrix Inflation (entry powering) Matrix Expansion (matrix powering) Similarity Matrix Final matrix as clustering Terminate when no further change

Application of OrthoMCL to Plasmodium, human and other model organisms Plasmodium falciparum, Human, Arabidopsis, Worm, Fly, Yeast E. coli 6241 ortholog groups 160 all included 551 only Eukaryotes 1182 only Metazoa 24 only Plasmodium & Arabidopsis 114 Plasmodium Not human …

An Example of Gamma-tubulin Ortholog Group

Comparing OrthoMCL with INPARANOID ( two species) INPARANOID clusters both orthologs and in-paralogs from two species by pairwise similarity –Find two-way best hits from pairwise similarity scores as main ortholog pair –Add additional orthologs (in-paralogs) from the same species for each main ortholog by comparing similarity scores between the main ortholog with putative in-paralogs with the score between the main ortholog pair –Resolve overlapping groups by merging, deleting, dividing them based on a set of rules OrthoMCL can cluster orthologs and in-paralogs from multiple species

I. Yeast – Worm dataset (estimation ) Yeast: 6358 proteins Worm: proteins 4985 proteins: Yeast: 2283 Worm: groups 4428 proteins: Yeast: 2158 Worm: 2270 INPARANOIDOrthoMCL I = ? ? (paralog groups?) 3931 same from both methods ? Coherent grouping

Contained groups ∩ OrthoMCL groupINPARANOID group ∩ OrthoMCL groupINPARANOID group Coherent groups = same groups + contained groups

Inflation (I) # groups # groups of paralogs % seqs with same grouping * % seqs with contained grouping* % seqs with coherent grouping * * Percentage of 3931 sequences identified by both OrthoMCL and Inparanoid Inflation value (I) regulates cluster tightness tight loose So, choose I = 1.1 as the optimal inflation value

Possible reasons for including different sequences OrthoMCLINPARANOID BLAST versionWU-BLASTNCBI-BLAST BLAST Search All-against-all, SEG filtered, fixed database size Pairwise Similarity cutoffP<1e-5 Score>=50bits Overlap > 50% Reciprocal “best” hits P-value, percent identity Score Recent paralogs Bi-directional better within-species similarity One-way better within-species similarity from orthologs

Yeast: 6358 proteins Worm: proteins 4985 proteins: Yeast: 2283 Worm: groups 3949 proteins: Yeast: 1927 Worm: 2022 INPARANOIDOrthoMCL I = groups 3765 same from both methods 86.3% same groups 98.1% coherent groups Default parameters: Similarity cutoff: P-value 50% Cluster tightness: Inflation values I =1.1

II. Worm – Fly dataset (test) Worm: proteins Fly: proteins proteins: Worm: 5399 Fly: groups 9623 proteins Worm: 4997 Fly: 4626 INPARANOIDOrthoMCL I = groups 8856 same from both methods 86% same groups 98% coherent groups In conclusion: OrthoMCL and INPARANOID have similar clustering behavior when comparing two species

Comparison of OrthoMCL with EGO (multiple species) III. Yeast – Worm – Fly dataset EGO: TC/NPProtein sequences BLASTP 4776 unique proteins formed 3125 unique groups seqs 4776 proteins Remove redundancy OrthoMCL: proteins formed 4033 groups

4392 same proteins from both 2.3% OrthoMCL contained in EGO 44.2% same groups 62% EGO contained in OrthoMCL 93.8% coherent groups

Hsc70-1 Hsc70-4 Fly SSA1 SSA2 SSA3 SSA4 Hsp-1 Worm Yeast An Example: EGO Groups contained by OrthoMCL Groups EGO : Hsp-1, Hsc70-4, SSA2 OrthoMCL: Hsp-1, Hsc70-1, Hsc70-4, SSA1, SSA2, SSA3, SSA4

Back to Apicomplexa … 5333 Proteins 1846 orthologous to the other 6 organisms 1693 orthologous to Arabidopsis 483 orthologous to E. coli 1421 orthologous to yeast 1771 orthologous to fly, worm or human 1824 non- orthologous to human

Summary OrthoMCL automatically delineates the many-to-many orthologous relationship across multiple eukaryotic genomes When applied to pairwise comparison of two species, the performance of OrthoMCL is comparable to INPARANOID which was designed for comparing two species When applied to multiple species and compared with EGO database, OrthoMCL tend to identify more orthologous genes The underlying object-based relational storage model permits integration with organismal data and queries based on user-defined species distribution provides a snapshot of shared/diversified biological processes across species

Related Posters and Reference 114A. Web-Based Biological Discovery using an Integrated Database. 146A. The Genomics Unified Schema (GUS). 170A. TESS-II: Describing and Finding Gene Regulatory Sequences with Grammars. Remm et al. Automatic Clustering of Orthologs and In- paralogs from Pairwise Species Comparisons. J.MOL.Biol. (2001) 314 Lee et al. Cross-Referencing Eukaryotic Genomes: TIGR Orthologous Gene Alignments (TOGA). Genome Res. (2002) 12 Enright et al. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. (2002) 30