Comparative Genomics
Overview Orthologues and paralogues Protein families Genome-wide DNA alignments Syntenic blocks
Comparative Genomics Allows us to achieve a greater understanding of vertebrate evolution Tells us what is common and what is unique between different species at the genome level The function of human genes and other regions may be revealed by studying their counterparts in lower organisms Helps identify both coding and non-coding genes and regulatory elements
Species in Ensembl MAMMALS BIRDS REPTILES FISHES CAMBRI ORDO SIL DEV CARBON PER TRIA JURA CRETAC TERTIA 570 505 438 408 360 286 245 208 144 65 MYBP MAMMALS PLACENTALS MONOTREMES MARSUPIALS OTHER BIRDS BIRDS PALEOGNATHS REPTILES PASSERINES CROCODILES TURTLES LIZARDS AMPHIBIANS TELEOSTS FISHES SHARKS RAYS LATIMERIA BICHIR/POLYPTERUS LUNGFISHES AGNATHANS NON-VERTEBRATES
Orthologue / Paralogue Prediction Algorithm (1) Load the longest translation of each gene from all species used in Ensembl. (2) Run WUBLASTp+SmithWaterman of every gene against every other (both self and non-self species) in a genome-wise manner. (3) Build a graph of gene relations based on Best Reciprocal Hits (BRH) and Blast Score Ratio (BSR) values. (4) Extract the connected components (=single linkage clusters), each cluster representing a gene family. (5) For each cluster, build a multiple alignment based on the protein sequences using MUSCLE. (6) For each aligned cluster, build a phylogenetic tree using PHYML. An unrooted tree is obtained at this stage. (7) Reconcile each gene tree with the species tree to call duplication event on internal nodes and root the tree, using RAP. (8) From each gene tree, infer gene pairwise relations of orthology and paralogy types.
Homologue Relationships Orthologues : any gene pairwise relation where the ancestor node is a speciation event Paralogues : any gene pairwise relation where the ancestor node is a duplication event
Orthologue and Paralogue Types
Orthologue and Paralogue types
GeneView
GeneView
GeneTreeView MUSCLE protein alignment GeneTree
GeneTreeView Speciation node (blue) Duplication node (red)
Protein Dataset More than 1,500,000 proteins clustered: All Ensembl protein predictions from all species supported ~ 670,000 protein predictions All metazoan (animal) proteins in UniProt: ~ 80,000 UniProt/Swiss-Prot ~ 830,000 UniProt/TrEMBL
Clustering Strategy BLASTP all-versus-all comparison Markov clustering For each cluster: Calculation of multiple sequence alignments with ClustalW Assignment of a consensus description
GeneView / TransView / ProtView Link to FamilyView
FamilyView Consensus annotation JalView multiple alignments Ensembl family members within human UniProt family members Ensembl family members in other species
JalView
Whole Genome Alignments Functional sequences evolve more slowly than non-functional sequences, therefore sequences that remain conserved may perform a biological function. Comparing genomic sequences from species at different evolutionary distances allows us to identify: Coding genes Non-coding genes Non-coding regulatory sequences
Selection of Species for DNA comparisons Both coding and non-coding sequences ~70-75% ~150 MYA 4.2 Opossum 0.4 2.5 3.0 Size (Gbp) ~65% ~80% >99% Sequence conservation (in coding regions) Primarily coding sequences Recently changed sequences and genomic rearrangements Aids identification of… ~450 MYA ~ 65 MYA ~5 MYA Time since divergence Pufferfish Mouse Chimpanzee Human vs..
Alignment Algorithm Should find all highly similar regions between two sequences Should allow for segments without similarity, rearrangements etc. Issues Heavy process Scalability, as more and more genomes are sequenced Time constraint
BLASTZ-net, tBLAT and PECAN BLASTZ-net (comparison on nucleotide level) is used for species that are evolutionary close, e.g. human - mouse Translated BLAT (comparison on amino acid level) is used for evolutionary more distant species, e.g. human - zebrafish PECAN is used for multispecies alignments 7 eutherian mammals 10 amniota vertebrates
BLASTZ-net, tBLAT and PECAN For which combinations of species whole genome alignments have been done is shown on the Comparative Genomics page (Help & Documentation > Genomic Data > Comparative Genomics):
ContigView Constrained elements Conservation score PECAN alignments Blastz mouse tBLAT zebrafish
MultiContigView Conserved sequences human Conserved sequences dog
AlignSliceView Human Mouse Dog Rat
MultiContigView vs. AlignSliceView
AlignView
GeneSeqalignView
GeneSeqalignView
Syntenic Blocks Genome alignments are refined into larger syntenic regions Alignments are clustered together when the relative distance between them is less than 100 kb and order and orientation are consistent Any clusters less than 100 kb are discarded
SyntenyView Human chromosome Orthologues Mouse chromosomes
CytoView Syntenic blocks Orientation Chromosome
Q & A Q U E S T I O N S A N S W E R S