Comparative genomics and proteomics in Ensembl Sep 2006
2 of 56 Rationale Species available Comparative proteomics –Orthologue and paralogue prediction –Protein clustering into families Comparative genomics –Genome-wide DNA alignments –Synteny block characterisation Future and perspectives Overview
3 of 56 The Compara database is one single multispecies database Gene orthology/paralogy prediction Protein clustering Whole genome alignments Synteny regions Compara
4 of 56 The era of sequencing genomes ? ? Million years ? ? Chordata Vertebrata Amniota Tetrapoda Teleostei Urochordata Arthropoda Nematoda Fungi Red : whole genome assembly available Green : whole genome assembly due within the next year in Ensembl * 19 species currently in Ensembl + 10 Pre! Ensembl S. cerevisiae (baker’s yeast) * C. elegans (nematode) * A. mellifera (honey bee) * D. rerio (zebrafish) * D. melanogaster (fruitfly) * A. gambiae (African malaria mosquito) * A. aegypti (yellow fever mosquito) + C. intestinalis (transparent sea squirt) * C. savignyi (sea squirt) + T. rubripes (torafugu) * T. nigroviridis (spotted green pufferfish) * O. latipes (Japanese medaka) G. aculeatus (Stickleback) + 23 O. aries (sheep) G. gallus (chicken) * X. laevis (African clawed frog) M. musculus (house mouse) * R. norvegicus (Norway rat) * M. mulatta (rhesus macaque) * P. troglodytes (chimpanzee) * C. familiaris (dog) * F. catus (cat) E. caballus (horse) S. scrofa (pig) B. taurus (cow) * M. domestica (opossum) * 170 L. africana (elephant) H. sapiens (human) * + X. tropicalis (western clawed frog) * Amphibia Aves Metatheria Mammalia Eutheria
5 of 56 From the Ensembl perspective joins species through –orthologous/paralogous genes links –chromosome synteny links –protein family links From a broader perspective –Where are syntenic regions located? –How many genes are conserved? –Where are orthologous/paralogous genes? –Is gene order conserved? –Where are potential regulatory regions? –What is missing in one species, present only in another? Comparing different species
6 of 56 Orthologue and Paralogue Prediction Evolutionary studies Identify potential species-specific proteins/genes Identify orthologues of (human) genes in model organisms
7 of 56 Gene Evolution Divergence Speciation / Duplication Change within allelic population Point Mutations / Selection / Drift Exon/domain shuffling Transposition / Translocation Retroposition (reverse transcription) Horizontal gene transfer? Orthologues and Paralogues Reconstruct the Molecular Evolutionary history from the evidence visible within the known extant genes
8 of 56 Orthologues : any gene pairwise relation where the ancestor node is a speciation event Paralogues : any gene pairwise relation where the ancestor node is a duplication event HomologueRelationships Homologue Relationships
9 of 56 A time Duplication M 2’ Speciation Duplication M 2 A 1 A 2 M 1 H 1 H 2 Inparalogues Outparalogues Orthologues Inparalogues Orthologous genes have originated from a single ancestor (often have equivalent functions). Paralogous are genes related via duplication: Inparalogues (ortholog_one2one, ortholog_one2many, etc.) duplication follows speciation and Between_species_paralog (outparalogues). Duplication precedes speciation Homologue Relationships
10 of 56 Find orthologous genes by comparing the protein sets of two species (only the longest peptide considered). blastp+sw all versus all (on a paired species basis) Build a graph of gene relations based on BRH (best reciprocal hit) and BSR (BLAST score ratio) Extract connected components (single linkage clusters ), each cluster representing a gene family MouseHumanMouseHuman MouseHuman Orthology Prediction Algorithm
11 of 56 GeneTree prediction: MUSCLE/PHYML Multiple alignment of clusters with MUSCLE (based on BRH and BSR). Unrooted gene tree built using PHYML (Guidon & Gascuel, 2003) Tree reconciliation (gene tree with species tree) to call duplication event on internal nood and root the tree using RAP (Dufayard et al. 2005) Infer pairwise relations of orthology and paralogy types (from each tree)
12 of 56 Molecular Phylogenetics Protein sequences in different species, both: Provide information about the history of evolution Reconstruct evolution We are after an alignment that equally reflects all species: Modeling the branching processes by comparing gene and species trees (tree reconciliation)
13 of 56 Phylogenies Duplication node Speciation node or leaf Revealing the evolutionary history that has led to the organisms at the current stage. - Leaves are real genomes - Internal nodes are ancestors
14 of 56 Orthologue and Paralogue types ortholog_one2one ortholog_one2many ortholog_many2many apparent_ortholog_one2one within_species_paralog between_species_paralog
15 of 56 …in Ensembl…
16 of 56 Orthologue and Paraloguetypes Orthologue and Paralogue types
17 of 56 GeneView
18 of 56 GeneView
19 of 56 Links to ATV and JalView GeneTree MUSCLE protein alignment GeneTreeView
20 of 56 Duplication node (red) Speciation node (blue) GeneTreeView
21 of 56 ATV
22 of 56 Protein clustering into families Cluster proteins from different organisms that may share the same function Obtain some kind of description for ‘novel’ genes/proteins Locate family members over the whole genome Identify possible orthologues and paralogues in other species
23 of 56 Protein Dataset Nearly a million proteins clustered: –All Ensembl proteins from all species in Ensembl 513,256 predicted proteins –All metazoan (animal) proteins in UniProt 55,892 UniProt/Swiss-Prot 469,725 UniProt/TrEMBL Blastp all versus all, then clustering with MCL
24 of 56 Clustering Strategy BLASTP all-versus-all comparison Markov clustering For each cluster: –Calculation of multiple sequence alignments with ClustalW –Assignment of a consensus description
25 of 56 Markov Clustering (MCL) MCL for Markov CLustering algorithm, based on flow simulation in graphs ( Keeps into the same graph/cluster only very well inter- connected nodes (proteins) in the same graph (cluster) Allows rapid and accurate detection of protein families on large-scale. Automatic description and clustalw multiple alignment applied on each cluster MCL
26 of 56 Link to FamilyView ProtView
27 of 56 Ensembl family members within human Ensembl family members in other species JalView multiple alignments FamilyView
28 of 56 For each cluster We store –Description and score –Multiple alignment Future extensions –Improving descriptions –Multiple alignment assessment –Build phylogeny on each cluster Using the multiple alignment Using dS values (mainly inside mammals) Extend paralogous prediction
29 of 56 Aligning complete genomes
30 of 56 Whole Genome Alignments Understand what evolution has done on the species compared, after speciation –What is missing in one species, present only in another? –Differences between closely related species may help understanding speciation Define syntenic regions, those long regions of DNA sequences were order and orientation is highly conserved Conserved non-coding regions –Guides to putative regulatory regions
31 of 56 Evolution at the DNA level …ACTGACATGTACCA… …AC----CATGCACCA… Mutation Sequence edits Rearrangements Deletion Inversion Translocation Duplication
32 of 56 Basic Idea Functional sequences evolve more slowly than non-functional sequences Comparing genomic sequences from species at different evolutionary distances allows us to identify: –Coding genes –Non-coding genes –Non-coding regulatory sequences
33 of 56 Aligning large genomic sequences Independent from protein/gene predictions Should find all highly similar regions between two sequences Should allow for segments without similarity, rearrangements etc. –Computes run only by few dedicated groups Issues –Heavy process –Scalability, as more and more genomes are sequenced –Time constraint –Computes run only by few dedicated groups –As the «true» alignment is not known, then difficult to measure the alignment accuracy and apply the right method
34 of 56 Using a local aligner Local alignment –Find all highly similar regions over 2 sequences Find the orthologous as well as all the paralogous sequences –Separated by segments without alignment –Can handle rearranged sequences –Need post- filtering to limit too much overlapping alignments
35 of 56 Local v Global Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTTAATC AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA LocalGlobal Advantages Compares large genomic regions (requires syntenic maps) Can detect, rearrangements like translocations, inversions and duplications (!) Detects insertions and deletions Disadvantages Fails to identify insertions or deletions Fails to detect rearrangements (inversions)
36 of 56 GlocalAlignment Problem Glocal Alignment Problem Find least cost transformation of one sequence into another using new operations Sequence edits (indels, mutations) Inversions Translocations Duplications A combination of these GTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGAG AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACT Glocal aligner (Brudno et al., 2003)
37 of 56 BLASTZ-net, tBLAT and MLAGAN BLASTZ-net (comparison on nucleotide level) is used for species that are evolutionary close, e.g. human - mouse Translated BLAT (comparison on amino acid level) is used for evolutionary more distant species, e.g. human - zebrafish MLAGAN global alignment used for multispecies alignments
38 of 56 all versus all approach using BLASTZ (collaboration with UCSC) Can handle large sequences Used 2-weighted spaced seeding strategy Dynamic masking Makes distinction between repeat and non-repeat sequences (soft masking) Try aligning inside repeats One iterative step with lower threshold to expand alignments
39 of 56 Blastz strategy 10Mb Human fragments (3000) 30Mb Mouse fragments (100) Lineage-specific repeats removed 48 hours on 1024 CPUs Generates 9Gb of output When filtered for Best hit on Human, reduced to 2.5Gb 10Mb Human fragments (3000) 30Mb Mouse fragments (100)
40 of 56 Blastz human genome coverage 40% of the human genome is covered by an alignment of mouse sequences By rescoring the alignment over a “tight” matrix that is very stringent and look for high conservation (>70% identity), the coverage goes down to 6%
41 of 56 DNA/DNA matches web display ContigView human EPO Conserved sequences
42 of 56 DotterView Mouse sequence Human sequence
43 of 56 Multiple alignments Currently 3 sets: –MLAGAN-primates: –MLAGAN-amniote vertebrates: –MLAGAN-eutherian mammals:
44 of 56 Strategy Use all coding exons Get sets of best reciprocal hits Use all coding exons Get sets of best reciprocal hits Create orthology maps Use all coding exons Get sets of best reciprocal hits Create orthology maps Build multiple global alignments
45 of 56 MultiContigV iew
46 of 56 Multiple alignments ContigView human EPO
47 of 56 Alignment on basepair level Human Dog Rat Mouse Export alignments AlignSpliceView
48 of 56 MultiContigView vs. AlignSliceView
49 of 56 AlignView
50 of 56 GeneSeqalignView
51 of 56 GeneSeqalignView
52 of 56 Syntenic Regions Genome alignments are refined into larger syntenic regions Alignments are clustered together when the relative distance between them is less than 100 kb and order and orientation are consistent Any clusters less than 100 kb are discarded
53 of 56 SyntenyView Human chromosome Mouse chromosomes Orthologues
54 of 56 Syntenic blocks CytoView
55 of 56 Outlook OrthoView Displaying alignments both from whole genome alignments and on orthologues Consider all isoforms for each gene Calculate dN/dS
56 of 56 Acknowledgements Abel Ureta-Vidal Benoît Ballester Kathryn Beal Stephen Fitzgerald Javier Herrero Albert Vilella Ensembl team Sep 2006
57 of 56 Basic idea Speciation event selection alignment mutations Ancestor sequence Mutation Regulatory region Exon
58 of 56 Global v Local Alignments Local Global AdvantagesDisadvantages Local Compares large genomic regions (uses syntenic maps) Can detect, rearrangements like translocations, inversions and duplications (!) Fails to identify insertions or deletions Global Detects insertions and deletions Fails to detect rearrangements (inversions) (-) inversion duplication Glocal aligner (Brudno et al., 2003) pairwise only
59 of 56 Adapted from Sonnhammer & Koonin (2002) TIG 18, 12: 620 Inparalogues vs Outparalogues
60 of 56 Problems: weak orthologies
61 of 56 Problems: missalignments
62 of 56 Possible solutions Weak orthologies: Poor alignments: –report to author –edit alignments, detect wrong edges, redefine blocks –use another aligner
63 of 56 From Edgar, R. C. (2004) NAR 32: