Comparative genomics and proteomics in Ensembl Sep 2006.

Slides:



Advertisements
Similar presentations
GBrowse at TAIR Philippe Lamesch TAIR curator. Seqviewer.
Advertisements

1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Phylogenetics workshop: Protein sequence phylogeny week 2 Darren Soanes.
Gramene Comparative & Phylogenomics Resources for Plants Joshua C. Stein 1, William Spooner 1, Sharon Wei 1, Liya Ren 1, Doreen Ware 1,2 1 Cold Spring.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Phylogenetic reconstruction
1/30 Comparative Genomics. 2/30 Overview of the Talk Comparing Genomes Homologies & Families Sequence Alignments.
Comparative genomics Joachim Bargsten February 2012.
Molecular Evolution Revised 29/12/06
© Wiley Publishing All Rights Reserved. Phylogeny.
Bioinformatics Chromosome rearrangements Chromosome and genome comparison versus gene comparison Permutations and breakpoint graphs Transforming Men into.
Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication.
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
Some new sequencing technologies. Molecular Inversion Probes.
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
Bioinformatics and Phylogenetic Analysis
Tree Pattern Matching in Phylogenetic Trees Automatic Search for Orthologs or Paralogs in Homologous Gene Sequence Databases By: Jean-François Dufayard,
CS273a Lecture 10, Aut 08, Batzoglou Multiple Sequence Alignment.
Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,
CS273a Lecture 9/10, Aut 10, Batzoglou Multiple Sequence Alignment.
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
Bioinformatics Genome anatomy Comparisons of some eukaryotic genomes Allignment of long genomic sequences Comparative genomics Oxford Grid Reconstruction.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Phylogenetic trees Sushmita Roy BMI/CS 576
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Genome projects and model organisms Level 3 Molecular Evolution and Bioinformatics Jim Provan.
Nucleotide sequence alignments in Compara Stephen Fitzgerald
The Ensembl Gene set The “Genebuild” 21 April 2008.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genomes School B&I TCD Bioinformatics May Genome sizes Completed eukaryotic nuclear genomes Type of organismSpeciesGenome size (10 6 base pairs)
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
An Introduction to Ensembl Presented By Hilary O. Pavlidis.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
1/29 Comparative Genomics. 2/29 Overview of the Talk Comparing Genomes Homologies & Families Sequence Alignments.
Introduction to Phylogenetics
1 Genome Evolution Chapter Introduction Genomes contain the raw material for evolution; Comparing whole genomes enhances – Our ability to understand.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
Protein and RNA Families
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. [many slides borrowed from various sources]
Comparative genomics Haixu Tang School of Informatics.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Using blast to study gene evolution – an example.
Phylogenetic analysis taken from and es/MSAPhylogeny.htm.
Cédric Notredame (08/12/2015) Molecular Evolution Cédric Notredame.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. WTCCB Bioinformatics Core [many slides borrowed from various sources]
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Step 3: Tools Database Searching
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Gene3D, Orthology and Homology-Based Inheritance of Protein-Protein Interactions Corin Yeats
Sequence similarity, BLAST alignments & multiple sequence alignments
Genetics and Evolutionary Biology
Basics of Comparative Genomics
Comparative Genomics.
Genome Projects Maps Human Genome Mapping Human Genome Sequencing
Mattew Mazowita, Lani Haque, and David Sankoff
Basics of Comparative Genomics
Presentation transcript:

Comparative genomics and proteomics in Ensembl Sep 2006

2 of 56 Rationale Species available Comparative proteomics –Orthologue and paralogue prediction –Protein clustering into families Comparative genomics –Genome-wide DNA alignments –Synteny block characterisation Future and perspectives Overview

3 of 56 The Compara database is one single multispecies database Gene orthology/paralogy prediction Protein clustering Whole genome alignments Synteny regions Compara

4 of 56 The era of sequencing genomes ? ? Million years ? ? Chordata Vertebrata Amniota Tetrapoda Teleostei Urochordata Arthropoda Nematoda Fungi Red : whole genome assembly available Green : whole genome assembly due within the next year in Ensembl * 19 species currently in Ensembl + 10 Pre! Ensembl S. cerevisiae (baker’s yeast) * C. elegans (nematode) * A. mellifera (honey bee) * D. rerio (zebrafish) * D. melanogaster (fruitfly) * A. gambiae (African malaria mosquito) * A. aegypti (yellow fever mosquito) + C. intestinalis (transparent sea squirt) * C. savignyi (sea squirt) + T. rubripes (torafugu) * T. nigroviridis (spotted green pufferfish) * O. latipes (Japanese medaka) G. aculeatus (Stickleback) + 23 O. aries (sheep) G. gallus (chicken) * X. laevis (African clawed frog) M. musculus (house mouse) * R. norvegicus (Norway rat) * M. mulatta (rhesus macaque) * P. troglodytes (chimpanzee) * C. familiaris (dog) * F. catus (cat) E. caballus (horse) S. scrofa (pig) B. taurus (cow) * M. domestica (opossum) * 170 L. africana (elephant) H. sapiens (human) * + X. tropicalis (western clawed frog) * Amphibia Aves Metatheria Mammalia Eutheria

5 of 56 From the Ensembl perspective joins species through –orthologous/paralogous genes links –chromosome synteny links –protein family links From a broader perspective –Where are syntenic regions located? –How many genes are conserved? –Where are orthologous/paralogous genes? –Is gene order conserved? –Where are potential regulatory regions? –What is missing in one species, present only in another? Comparing different species

6 of 56 Orthologue and Paralogue Prediction Evolutionary studies Identify potential species-specific proteins/genes Identify orthologues of (human) genes in model organisms

7 of 56 Gene Evolution Divergence Speciation / Duplication Change within allelic population Point Mutations / Selection / Drift Exon/domain shuffling Transposition / Translocation Retroposition (reverse transcription) Horizontal gene transfer? Orthologues and Paralogues Reconstruct the Molecular Evolutionary history from the evidence visible within the known extant genes

8 of 56 Orthologues : any gene pairwise relation where the ancestor node is a speciation event Paralogues : any gene pairwise relation where the ancestor node is a duplication event HomologueRelationships Homologue Relationships

9 of 56 A time Duplication M 2’ Speciation Duplication M 2 A 1 A 2 M 1 H 1 H 2 Inparalogues Outparalogues Orthologues Inparalogues Orthologous genes have originated from a single ancestor (often have equivalent functions). Paralogous are genes related via duplication: Inparalogues (ortholog_one2one, ortholog_one2many, etc.) duplication follows speciation and Between_species_paralog (outparalogues). Duplication precedes speciation Homologue Relationships

10 of 56 Find orthologous genes by comparing the protein sets of two species (only the longest peptide considered). blastp+sw all versus all (on a paired species basis) Build a graph of gene relations based on BRH (best reciprocal hit) and BSR (BLAST score ratio) Extract connected components (single linkage clusters ), each cluster representing a gene family MouseHumanMouseHuman MouseHuman Orthology Prediction Algorithm

11 of 56 GeneTree prediction: MUSCLE/PHYML Multiple alignment of clusters with MUSCLE (based on BRH and BSR). Unrooted gene tree built using PHYML (Guidon & Gascuel, 2003) Tree reconciliation (gene tree with species tree) to call duplication event on internal nood and root the tree using RAP (Dufayard et al. 2005) Infer pairwise relations of orthology and paralogy types (from each tree)

12 of 56 Molecular Phylogenetics Protein sequences in different species, both: Provide information about the history of evolution Reconstruct evolution We are after an alignment that equally reflects all species: Modeling the branching processes by comparing gene and species trees (tree reconciliation)

13 of 56 Phylogenies Duplication node Speciation node or leaf Revealing the evolutionary history that has led to the organisms at the current stage. - Leaves are real genomes - Internal nodes are ancestors

14 of 56 Orthologue and Paralogue types ortholog_one2one ortholog_one2many ortholog_many2many apparent_ortholog_one2one within_species_paralog between_species_paralog

15 of 56 …in Ensembl…

16 of 56 Orthologue and Paraloguetypes Orthologue and Paralogue types

17 of 56 GeneView

18 of 56 GeneView

19 of 56 Links to ATV and JalView GeneTree MUSCLE protein alignment GeneTreeView

20 of 56 Duplication node (red) Speciation node (blue) GeneTreeView

21 of 56 ATV

22 of 56 Protein clustering into families Cluster proteins from different organisms that may share the same function Obtain some kind of description for ‘novel’ genes/proteins Locate family members over the whole genome Identify possible orthologues and paralogues in other species

23 of 56 Protein Dataset Nearly a million proteins clustered: –All Ensembl proteins from all species in Ensembl 513,256 predicted proteins –All metazoan (animal) proteins in UniProt 55,892 UniProt/Swiss-Prot 469,725 UniProt/TrEMBL Blastp all versus all, then clustering with MCL

24 of 56 Clustering Strategy BLASTP all-versus-all comparison Markov clustering For each cluster: –Calculation of multiple sequence alignments with ClustalW –Assignment of a consensus description

25 of 56 Markov Clustering (MCL) MCL for Markov CLustering algorithm, based on flow simulation in graphs ( Keeps into the same graph/cluster only very well inter- connected nodes (proteins) in the same graph (cluster) Allows rapid and accurate detection of protein families on large-scale. Automatic description and clustalw multiple alignment applied on each cluster MCL

26 of 56 Link to FamilyView ProtView

27 of 56 Ensembl family members within human Ensembl family members in other species JalView multiple alignments FamilyView

28 of 56 For each cluster We store –Description and score –Multiple alignment Future extensions –Improving descriptions –Multiple alignment assessment –Build phylogeny on each cluster Using the multiple alignment Using dS values (mainly inside mammals) Extend paralogous prediction

29 of 56 Aligning complete genomes

30 of 56 Whole Genome Alignments Understand what evolution has done on the species compared, after speciation –What is missing in one species, present only in another? –Differences between closely related species may help understanding speciation Define syntenic regions, those long regions of DNA sequences were order and orientation is highly conserved Conserved non-coding regions –Guides to putative regulatory regions

31 of 56 Evolution at the DNA level …ACTGACATGTACCA… …AC----CATGCACCA… Mutation Sequence edits Rearrangements Deletion Inversion Translocation Duplication

32 of 56 Basic Idea Functional sequences evolve more slowly than non-functional sequences Comparing genomic sequences from species at different evolutionary distances allows us to identify: –Coding genes –Non-coding genes –Non-coding regulatory sequences

33 of 56 Aligning large genomic sequences Independent from protein/gene predictions Should find all highly similar regions between two sequences Should allow for segments without similarity, rearrangements etc. –Computes run only by few dedicated groups Issues –Heavy process –Scalability, as more and more genomes are sequenced –Time constraint –Computes run only by few dedicated groups –As the «true» alignment is not known, then difficult to measure the alignment accuracy and apply the right method

34 of 56 Using a local aligner Local alignment –Find all highly similar regions over 2 sequences Find the orthologous as well as all the paralogous sequences –Separated by segments without alignment –Can handle rearranged sequences –Need post- filtering to limit too much overlapping alignments

35 of 56 Local v Global Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTTAATC AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA LocalGlobal Advantages Compares large genomic regions (requires syntenic maps) Can detect, rearrangements like translocations, inversions and duplications (!) Detects insertions and deletions Disadvantages Fails to identify insertions or deletions Fails to detect rearrangements (inversions)

36 of 56 GlocalAlignment Problem Glocal Alignment Problem Find least cost transformation of one sequence into another using new operations Sequence edits (indels, mutations) Inversions Translocations Duplications A combination of these GTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGAG AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACT Glocal aligner (Brudno et al., 2003)

37 of 56 BLASTZ-net, tBLAT and MLAGAN BLASTZ-net (comparison on nucleotide level) is used for species that are evolutionary close, e.g. human - mouse Translated BLAT (comparison on amino acid level) is used for evolutionary more distant species, e.g. human - zebrafish MLAGAN global alignment used for multispecies alignments

38 of 56 all versus all approach using BLASTZ (collaboration with UCSC) Can handle large sequences Used 2-weighted spaced seeding strategy Dynamic masking Makes distinction between repeat and non-repeat sequences (soft masking) Try aligning inside repeats One iterative step with lower threshold to expand alignments

39 of 56 Blastz strategy 10Mb Human fragments (3000) 30Mb Mouse fragments (100) Lineage-specific repeats removed 48 hours on 1024 CPUs Generates 9Gb of output When filtered for Best hit on Human, reduced to 2.5Gb 10Mb Human fragments (3000) 30Mb Mouse fragments (100)

40 of 56 Blastz human genome coverage 40% of the human genome is covered by an alignment of mouse sequences By rescoring the alignment over a “tight” matrix that is very stringent and look for high conservation (>70% identity), the coverage goes down to 6%

41 of 56 DNA/DNA matches web display ContigView human EPO Conserved sequences

42 of 56 DotterView Mouse sequence Human sequence

43 of 56 Multiple alignments Currently 3 sets: –MLAGAN-primates: –MLAGAN-amniote vertebrates: –MLAGAN-eutherian mammals:

44 of 56 Strategy Use all coding exons Get sets of best reciprocal hits Use all coding exons Get sets of best reciprocal hits Create orthology maps Use all coding exons Get sets of best reciprocal hits Create orthology maps Build multiple global alignments

45 of 56 MultiContigV iew

46 of 56 Multiple alignments ContigView human EPO

47 of 56 Alignment on basepair level Human Dog Rat Mouse Export alignments AlignSpliceView

48 of 56 MultiContigView vs. AlignSliceView

49 of 56 AlignView

50 of 56 GeneSeqalignView

51 of 56 GeneSeqalignView

52 of 56 Syntenic Regions Genome alignments are refined into larger syntenic regions Alignments are clustered together when the relative distance between them is less than 100 kb and order and orientation are consistent Any clusters less than 100 kb are discarded

53 of 56 SyntenyView Human chromosome Mouse chromosomes Orthologues

54 of 56 Syntenic blocks CytoView

55 of 56 Outlook OrthoView Displaying alignments both from whole genome alignments and on orthologues Consider all isoforms for each gene Calculate dN/dS

56 of 56 Acknowledgements Abel Ureta-Vidal Benoît Ballester Kathryn Beal Stephen Fitzgerald Javier Herrero Albert Vilella Ensembl team Sep 2006

57 of 56 Basic idea Speciation event selection alignment mutations Ancestor sequence Mutation Regulatory region Exon

58 of 56 Global v Local Alignments Local Global AdvantagesDisadvantages Local Compares large genomic regions (uses syntenic maps) Can detect, rearrangements like translocations, inversions and duplications (!) Fails to identify insertions or deletions Global Detects insertions and deletions Fails to detect rearrangements (inversions) (-) inversion duplication Glocal aligner (Brudno et al., 2003) pairwise only

59 of 56 Adapted from Sonnhammer & Koonin (2002) TIG 18, 12: 620 Inparalogues vs Outparalogues

60 of 56 Problems: weak orthologies

61 of 56 Problems: missalignments

62 of 56 Possible solutions Weak orthologies: Poor alignments: –report to author –edit alignments, detect wrong edges, redefine blocks –use another aligner

63 of 56 From Edgar, R. C. (2004) NAR 32: