Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),
Comparative genomics: Overview & Tools Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune
4. Lecture WS 2003/04Bioinformatics III1 Whole Genome Alignment (WGA) When the genomic DNA sequences of closely related organisms become available one.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Heuristic alignment algorithms and cost matrices
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Bioinformatics and Phylogenetic Analysis
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Protein Modules An Introduction to Bioinformatics.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Sequencing a genome and Basic Sequence Alignment
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
Comparative Genomics of Viruses: VirGen as a case study Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune Pune
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Genome projects and model organisms Level 3 Molecular Evolution and Bioinformatics Jim Provan.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Pairwise & Multiple sequence alignments
An Introduction to Bioinformatics
Title: GeneWiz browser: An Interactive Tool for Visualizing Sequenced Chromosomes By Peter F. Hallin, Hans-Henrik Stærfeldt, Eva Rotenberg, Tim T. Binnewies,
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Chapter 26: Phylogeny and the Tree of Life Objectives 1.Identify how phylogenies show evolutionary relationships. 2.Phylogenies are inferred based homologies.
Genome Organization and Evolution. Assignment For 2/24/04 Read: Lesk, Chapter 2 Exercises 2.1, 2.5, 2.7, p 110 Problem 2.2, p 112 Weblems 2.4, 2.7, pp.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)
Sequencing a genome and Basic Sequence Alignment
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Protein and RNA Families
Genome Analysis II Comparative Genomics Jiangbo Miao Apr. 25, 2002 CISC889-02S: Bioinformatics.
Genomic and comparative genomic analysis BIO520 BioinformaticsJim Lund.
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Cédric Notredame (08/12/2015) Molecular Evolution Cédric Notredame.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Significance Tests for Max-Gap Gene Clusters Rose Hoberman joint work with Dannie Durand and David Sankoff.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Copyright OpenHelix. No use or reproduction without express written consent1.
Finding genes in the genome
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Bioinformatics Overview
Basics of Comparative Genomics
Genomes and Their Evolution
Today… Review a few items from last class
Genomes and Their Evolution
Introduction to Bioinformatics II
Basic Local Alignment Search Tool
Pairwise Sequence Alignment
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
Basics of Comparative Genomics
Basic Local Alignment Search Tool (BLAST)
Presentation transcript:

Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 2 Genome sequence: Fact file 1995: The first complete genome sequence of Haemophilus infuenzae Rd-was published Biological systems are dynamic and evolving The forth dimension: Time Genome sequence is a snapshot of evolution Correlation between Phenotypic properties and Genomic region is not straightforward as phenotypic properties are result of many to many interactions

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 3 Genomes: the current status  Published complete genomes: 403 » Archaeal: 81 » Bacterial: 1226 »Eukaryal: 169 Ongoing: »Archaeal: 107 »Prokaryotic: 3478 »Eukaryotic: 1209  As of Jan 21, 2010 GOLD database Viral: >4500 Metagenomics:203

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 4 Genome databases Genomes at NCBI, EBI, TIGR

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 5 H. influenzae Complete Genome

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 6 Function information clock of E. coli Generated on March 2K4

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 7 Comparison of the coding regions Begins with the gene identification algorithm: infer what portions of the genomic sequence actively code for genes. There are four basic approaches.

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 8 Knowledge of Full Genome sequence: Solutions or new questions…? Still struggling with the gene counters… Correct # of genes…?

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 9 Genome analyses Variation in –Genome size –GC content –Codon usage –Amino acid composition –Genome organisation Single circular chromosomes Linear chromosome + extra chromosomal elements G, A, P, R: GC rich I, F, Y, M, D: AT rich E. coli: 4.6Mbp M. pneumoniae: 0.81Mbp B. subtilis: 4.20Mbp B. burgdorferi: 29% M. tuberculosis: 68%

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 10 CG: Comparisons between genomes The stains of the same species The closely related species The distantly related species –List of Orthologs –Evolution of individual genes –Evolution of organisms

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 11

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 12 CG helps to ask some interesting questions Identification similarities/differences between genomes may allow us to understand : –How 2 organisms evolved? –Why certain bacteria cause diseases while others do not? –Identification and prioritization of drug targets

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 13 CG: Unit of comparison Unit of comparison: Gene/Genome –Number –Content (sequence) –Location (map position) –Gene Order –Gene Cluster (Genes that are part of a known metabolic pathway, are found to exist as a group) –Colinearity of gene order is referred as synteny –A conserved group of genes in the same order in two genomes as a syntenic groups or syntenic clusters –Translocation: movement of genomic part from one position to another

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 14 Structure of tryptophan operon Numbers: Gene number Arrows: Direction of transcription //: Dispersion of operon by 50 genes Domain fusion trpD and trpG trpF and trpC trpB and trpA genetically linked separate genes Dandekar et al., 1998

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 15 Important observations with regard to Gene Order Order is highly conserved in closely related species but gets changed by rearrangements With more evolutionary distance, no correspondence between the gene order of orthologous genes Group of genes having similar biochemical function tend to remain localized –Genes required for synthesis of tryptophan (trp genes) in E. coli and other prokaryotes

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 16 Synteny Refers to regions of two genomes that show considerable similarity in terms of –sequence and –conservation of the order of genes likely to be related by common descent.

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 17 COGs: Phylogenetic classification of proteins encoded in complete genomes

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 18 Genome Pairwise genome comparison of protein homologs (symmetrical best hits) Pairwise genome comparison of protein homologs (symmetrical best hits)

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 19 Integr8: CG site at EBI

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 20 Comparative Genomics Tools BLAST2 MUMmer PipMaker AVID/VISTA Comparisons and analyses at both –Nucleic acid and protein level

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 21 BLAST2 Available at NCBI Input: GI or FASTA sequence (range can be specified) Output: –Graphical –Alignment of 2 genomes

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 22 Genome Alignment Algorithm: MUMmer Developed by –Dr. Steven Salzberg’s group at TIGR –NAR (1999) 27: –NAR (2002) 30: Availability –Free –TIGR site

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 23 Features of MUMmer The algorithm assumes that sequences are closely related Can quickly compare millions of bases Outputs: –Base to base alignment –Highlights the exact matches and differences in the genomes –Locates SNPs Large inserts Significant repeats Tandem repeats and reversals

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 24 Definitions are drawn from biology SNP: Single mutation surrounded by two matching regions –Regions of DNA where 2 sequences have diverged by more than one SNP Large inserts: regions inserted into one of the genomes –Sequence reversals, lateral gene transfer Repeats: the form of duplication that has occurred in either genome. Tandem repeats: regions of repeated DNA in immediate succession but with different copy number in different genomes. –A repeat can occur 2.5 times

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 25 Techniques used in the MUMmer Algorithm Compute Suffix trees for every genome Longest Increasing Subsequence (LIS) Alignment using Smith & Waterman algorithm Integration of these techniques for genome alignment

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 26 MUMmer: Steps in the alignment process Read two genomes Perform Maximum Unique Match (MUM) of genomes Sort and order the MUMs using LIS Close the gaps in the Alignment Using SNPs, mutation regions, repeats, tandem repeats Output alignment MUMs regions that do not match exactly

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 27 MUMmer steps Locating MUMs Sorting MUMs Closure with gaps G1: ACTGATTACGTGAACTGGATCCA G2: ACTCTAGGTGAAGTGATCCA

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 28 Genome1: ACTGATTACGTGAACTGGATCCA Genome2: ACTCTAGGTGAAGTGATCCA Genome1: ACTGATTACGTGAACTGGATCCA Genome2: ACTCTAGGTGAAGTGATCCA ACTGATTACGTGAACTGGATCCA ACTC--TAGGTGAAGT-GATCCA

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 29 What is a MUM? MUM is a subsequence that occurs exactly once in both genomes and is NOT part of any longer sequence Two characters that bound a MUM are always mismatches Principle: if a long matching sequence occurs exactly once in each genome, it is certainly to be part of global alignment GenA: tcgatcGACGATCGCCGCCGTAGATCGAATAACGAGAGAGCATAAcgactta GenB: gcattaGACGATCGCCGCCGTAGATCGAATAACGAGAGAGCATAAtccagag Similar to BLAST & FASTA!!

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 30 Sorting & ordering MUMs MUMs are sorted according to their position in Genome A The order of matching MUMs in Genome B is considered LIS algorithm to locate longest set of MUMs which occur in ascending order in both genomes 2 4 MUM5: transposition MUM3: Random match Inexact repeat Leads to Global MUM-alignment

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 31 MUMmer Results 2 strains of M. tuberculosis –H37Rv & CDC1551 –Genome size: 4Mb –Time: 55 s Generating suffix tree: 5 s Sorting MUMs: 45s S&W alignment: 5 s

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 32 Alignment of M. tuberculosis strains CDC1551 (Top) & H37Rv (bottom) Single green lines indicate SNPs Blue lines indicate insertions

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 33 Comparison of 2 Mycoplasma genomes cousins that are distantly related M. genitalium: nt M. pneumoniae: ( ) Analysis of proteins tell us that all M.g. proteins are present in P.m. Alignment was carried using –FASTA (dividing each genome into 1000 bp) –All-against-all searches –Fixed length of pattern (25) –Using MUMmer (length = 25)

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 34 Comparison of 2 Mycoplasma genomes Using FASTA Fixed length patterns: 25mers MUMmer

Jan 21, 2010© UKK, Bioinformatics Centre, University of Pune. 35 Post-sequencing challenges Genome sequencing is just the beginning to appreciate biocomplexity Sequence-based function assignment approaches fail as the sequence similarity drops … Structure-based function prediction approaches are limited by the availability of structures, association of structural motifs & associated functional descriptor As a result, in any genome, Genes with unknown function: ~60% Genes with known function: ~ 40%