Ortoloogide ennustamine

Slides:



Advertisements
Similar presentations
Large scale genomes comparisons Bioinformatics aspects (Introduction) Fredj Tekaia Institut Pasteur EMBO Bioinformatic and Comparative.
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Orthology Analysis Erik Sonnhammer Center for Genomics and Bioinformatics Karolinska Institutet, Stockholm.
M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.
An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical.
Benchmarking Orthology in Eukaryotes Nijmegen Tim Hulsen.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Bioinformatis and Evolutionary Genomics Genome Duplications.
Bioinformatics and Phylogenetic Analysis
1 Convolution and Its Applications to Sequence Analysis Student: Bo-Hung Wu Advisor: Professor Herng-Yow Chen & R. C. T. Lee Department of Computer Science.
Bioinformatics Genome anatomy Comparisons of some eukaryotic genomes Allignment of long genomic sequences Comparative genomics Oxford Grid Reconstruction.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Alexis Dereeper Homology analysis and molecular phylogeny CIBA courses – Brasil 2011.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Proliferation cluster (G12) Figure S1 A The proliferation cluster is a stable one. A dendrogram depicting results of cluster analysis of all varying genes.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Identification of Compositionally Similar Cis-element Clusters in Coordinately Regulated Genes Anil G Jegga, Ashima Gupta, Andrew T Pinski, James W Carman,
Complementarity of network and sequence information in homologous proteins March, Department of Computing, Imperial College London, London, UK 2.
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
Protein and RNA Families
Markov Cluster (MCL) algorithm Stijn van Dongen.
Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.
Genome Analysis II Comparative Genomics Jiangbo Miao Apr. 25, 2002 CISC889-02S: Bioinformatics.
Pairwise Sequence Analysis-III
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Brückner et al., Fig. 1b Brückner et al., Fig. 1B a c b 6 Fig. 1. Circular representation of Streptococcus pneumoniae genome comparisons.
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.
Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.
Testing sequence comparison methods with structure Organon, Oss Tim Hulsen.
Bioinformatics What is a genome? How are databases used? What is a phylogentic tree?
Automatic genome-wide reconstruction of phylogenetic gene trees
PINALOG Protein Interaction Network Alignment and its implication in function prediction and complex detection Hang Phan Prof. Michael J.E. Sternberg.
BLAST program selection guide
Basics of Comparative Genomics
LMO
Comparative Genomics.
P-POD-PANTHER: update
Mida kasutame sarnaste järjestuste leidmiseks:
Genome Annotation Continued
Fülogeneesi rekonstrueerimine
CISC 841 Bioinformatics (Fall 2007) Hidden Markov Models
Principal component analysis of the GO category composition of all genes in each genome/transcriptome and WGD paralogs. Principal component analysis of.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Phylogenetic Trees.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis ehk Mykrobe predictor Phelim.
Bacterial genomics: The controlled chaos of shifty pathogens
Comparative Genomics.
Conserved Seed Pairing, Often Flanked by Adenosines, Indicates that Thousands of Human Genes are MicroRNA Targets  Benjamin P. Lewis, Christopher B. Burge,
The Human Transcription Factors
Human Promoters Are Intrinsically Directional
Gautam Dey, Tobias Meyer  Cell Systems 
Volume 133, Issue 7, Pages (June 2008)
Basics of Comparative Genomics
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Identification of a homozygous loss‐of‐function mutation in RASGRP1 in two siblings with Hodgkin lymphoma and defective immunity to EBV Identification.
Volume 20, Issue 9, Pages (May 2010)
Neighbor-joining distance tree based on Hsp90 sequences indicating that the cytosolic and ER resident forms of these protein form paralogous gene families,
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Fig. 4 Identification of C
Comparison of shRNA scoring approaches.
Presentation transcript:

Ortoloogide ennustamine Automaatsed meetodid

definitsioonid Homoloogid – on geenid, mis omavad ühtset evolutsioonilist eellast. Ortoloogid – kui vaadeldavad kaks liiki on tekkinud samast eellasest ja neil on säilinud sama geen, mis nende eellasl, siis neid geene “uutes liikides” nimetatakse ortoloogideks. e. eri liikides olevaid samast eellasest pärinevaid geene. Omavad sarnast funktsiooni Paraloogid – samast liigist pärit geenid, millel on ühine eellane. Tavaliselt tekkinud geenide duplikatsiooni tulemusena. Hiljuti tekkinud paraloogid – omavad sarnast funktsiooni e. In-paraloogid Ammu tekkinud paraloogid – funktsioon on erinev e. Out-paraloogid

Miks? Otsitakse eri organismide sarnase funktsiooniga valke. Abiks genoomide annoteerimisel Geeni/valgu perekondade leidmisel Aitab kindlaks teha geenide horisontaalset ülekannet

Kuidas ? Vaja on võrrelda kahe genoomi valgud omavahel ja iseendaga. Genoomide A ja B puhul: A  A B  B A  B B  A Leitakse sarnaseimad paarid genoomide vahel. Püütakse grupeeride vältides ülekattumisi.

Võimalused ja piirangud Kahe järjestuse võrdlemine Järjestusi on palju – kaks genoomi ca 20 000 seq Blast Kiire - otsib lokalset srnasust - skoor sõltub etteantud järjestuste järjekorrast Clustalw – aeglane, joondaise pealt saab arvutada evolutsioonilist kaugust ja joonistada puid rdp – reciprocal best hit: otsing tuleb teostada mõlemat pidi ja leida parim skoor paaride a  b ja b  a vahel rsd – reciprocal smallsest distance

Võimalused ja piirangud gruppide leidmine Ortoloogsete gruppide leidmine Grupp peaks haarama kõik sama funktsiooni kandvaid geene ortoloogid in-paraloogid Grupp ei tohi sisaldada valesid positiivseid Grupid ei tohi kattude Clustering of additional orthologs (in-paralogs). Each circle represents a sequence from species A (black) or species B (grey). Main orthologs (pairs with mutually best hit) are denoted A1 and B1. Their similarity score is shown as S. The score should be thought of as reverse distance between A1 and B1, higher score corresponding to shorter distance. The main assumption for clustering of in-paralogs is that the main ortholog is more similar to in-paralogs from the same species than to any sequence from other species. On this graph it means that all in-paralogs with score S or better to the main ortholog are inside the circle with diameter S that is drawn around the main ortholog. Sequences outside the circle are classi®ed as out-paralogs. In-paralogs from both species A and B are clustered independently.

Võimalused ja piirangud gruppide leidmine Evolutsioonilise puu kasutamine Puu arvutamiseks on vaja järjestuste vahelisi kaugusi - globaalset alignmenti Clustalw aeglane Puu pealt ortoloogide leidmine on halvasti automatiseeritav

Ortoloogide andmebaasid OrthoMCL andmebaas http://www.cbil.upenn.edu/gene-family/ In Paranoid andmebaas www.cgb.ki.se/inparanoid/ COG andmebaas www.ncbi.nlm.nih.gov/COG TOGA andmebaas www.tigr.org/tdb/toga/toga.shtml

OrtoMCL Ortoloogsete gruppide automaatseks leidmiseks järjestuste võrdlemiseks kasutab WU Blast’i Klasterdamiseks Markov Cluster Lagoritmi MCL

Flow chart

Sarnasus maatriks Sarnasusmaatriksis blasti skoorid normaliseeritakse, et vältida in-paraloogide liiga suurte skooride mõju MCL algoritmile võrreldes ortoloogide omavahelise skooriga

Worm & Fly Table 1. Comparison of Ortholog Groups Identified by OrthoMCL vs. INPARANOID Total OrthoMCLa INPARANOID Grouped by both ()b Identical groups Coherent groups # Protein sequences 33,062 10,849 (33%) 11,357 (34%) 10,597 (98/93%) 8,629 (81%)c 10,229 (97%)c Fly data set 13,288 5,133 (39%) 5,550 (42%) 5,006 (98/90%) 4,058 (81%) 4,820 (96%) Wormdata set 19,774 5,716 (29%) 5,807 (29%) 5,591 (98/96%) 4,571 (82%) 5,409 (97%) # Groups 4,061 4,135 3,735 (92/90%)d 3,888 3,912e (96/95%)d a Using inflation index I = 1.5 (see text). b Percentages indicate percent of sequences grouped by either OrthoMCL (left) or INPARANOID (right). c Percent of sequences grouped by both OrthoMCL and INPARANOID. d Percent of OrthoMCL groups (left); percent of INPARANOID groups (right). e OrthoMCL groups entirely contained within INPARANOID groups (left); INPARANOID groups entirely contained within OrthoMCL groups (right).

Three-species Data Set

Kirjandus Li Li, Christian J. Stoeckert Jr., and David S. Roos "OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes" Genom Res. 2003 Maido Remm1,2, Christian E. V. Storm1 and Erik L. L. Sonnhammer1 "Automatic Clustering of Orthologs and In-paralogs from Pairwise Species Comparisons" JMB 2001 Wall D.P. et al. "Detecting putative orthologs" Bioinformatics Appl notes 2003 v19 pp1710-1711

Lingid ortoloogide andmebaasidele OrthoMCL andmebaas http://www.cbil.upenn.edu/gene-family/ In Paranoid andmebaas www.cgb.ki.se/inparanoid/ COG andmebaas www.ncbi.nlm.nih.gov/COG TOGA andmebaas www.tigr.org/tdb/toga/toga.shtml