Efficient multiple genome comparison Mario Huerta

Efficient multiple genome comparison Mario Huerta http://revolutionresearch.http://revolutionresearch.uab.es

1)Whole genome analysis A)Gene count B)Gene classification C)Repeat content D)Chromosomal duplications 2)Multi-Genome Analysis A)Synteny B)Sequence similarity C)Gene classification comparisons Genome Analysis

u Importance of Genome Alignment :  Identify important matched and mismatched regions  “matches” represent homolog pairs, conserved regions or long repeats  “mismatches”represent foreign fragments inserted by transposition, sequence reversal or lateral transfer  Detect functional differences between pathogenic/ non-pathogenic strains, evolutionary distance, mutations leading to disease, phenotypes, etc. u Problems  Large computational power, memory and execution time  Existing algorithms apply dynamic programming only to subsequences  Computationally intensive to apply to whole sequences (O(n 2 ))  Thus applicable only to closely related genomes Genome Alignment

Identify differences between organisms that may lead to the understanding of: How do the two organisms evolved? E.g. How are we different from chimps? Why do certain bacteria cause diseases while their cousins do not ? (Indels/ HGT/ Mutations) Why are certain people more susceptible to disease while others are not? (SNPs) Identifying new drug target (targets are unique to pathogen) Whole Genome Alignment

Why need a special algorithm for whole genomes alignment ? BLAST suitable for local alignments against large databases. Dynamic Programming suitable for pairwise alignment of small sequences. Need an algorithm that: – Scale up well (to millions of characters). – Able to detect large scale changes.

Why do not use standard pairwise methods ? Different features we are looking at: – For standard alignment: point mutation, insertion and deletion, etc. – For genome alignment: transposition, large insertion/deletion, syntenic blocks,... Etc. Technical concers: – Time and space complexity,... Etc.

Information presented by whole genome alignment Difference in repeat patterns: – Duplication (large fragment, chromosomal). – Tandem repeats. Large insertions and deletions. translocation (moving from one part of genome to another). Single Nucleotide Polymorphism.

Comparative Genome Analysis MUMmer MGA Malgen PipMaker Vista

Whole genome alignments Compares closely related sequences Search the Maximally Unique Matching subsequences agctcgatGGGCTTTAGACTCTCGATAggcgcagagGCTCGCTAGAATCGCTAGATCac agacctaaGGGCTTTAGACTCTCGATAagtctatccGCTCGCTAGAATCGCTAGATCta MUMmer

Segmentally duplicated regions in the Arabidopsis genome, detected using MUMmer Individual chromosomes are depicted as horizontal grey bars (with chromosome 1 at the top), centromeres are marked black. Coloured bands connect corresponding duplicated segments. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. 2000.Nature 408:796-815

Suffix tree A compact data structure holding all suffix of an input string S solaped by their common prefixes. Each suffix in S can be located by a unique path in the tree. A suffix tree can be built in O(N) time, where N is the length of the string. Locating a substring T(with length M) in S can be done in O(M) time.

Suffix tree

How MUMmer finds the MUMs 1. Build a suffix tree from all suffixes of genome A. 2. Insert every suffix of genome B into the suffix tree. 3. Label each leaf node with the genome it represents.

Suffix tree 4. The MUMs are the common prefixes that ends in two leafs, one of each original string.

MUMmer AC…….TT TC…….TA TC…….TA$AC…….TT A…C G…C G…T A…A Maximum Unique Matches Suffix Tree Construction Same Order Longest Set A…TG…C..ATGACGAGA---GAC---AGTTTA---TAT… Close gaps longer than limit by SW or NW

Malgen The green lines show MUMs, the red lines show MUMs between complementary substrings.

On-Line Search of MUMs using Slidding Suffix-tree. The algorithm find the MUMs between the first sequence and the prefix of the second one read. Space cost is lineal to the shorter secuence. Time is lineal to the secuence pair. Led obtainning the MUMs between a set of whole genomes in space lineal to the shorter secuence too, and time lineal to the set of secuences.

Slidding suffix tree data structure The suffix-links in the data structure used by the sufix tree construction algorithm with slide nodes are slightly different than their classic definition. This difference provides the transversal path, the main keyword of the MUMs On-Line algorithm.

The transverse path Concatenating the t-prefixes, you obtain the original secuece. These t-prefixes and the vinculed t-links conforms el transverse path.

The list of MUMs indexed by the slidding suffix-tree After constructing the slidding suffix-tree, the list of MUMs is updated while the second secuence is read. In this process, the algorithm jumps between both structures, but without updating the suffix-tree.

MUMs On-line for parwise alignment Let S 1 and S 2 be genomic sequences, then the list of MUMs between them can be found on- line in linear time O(|S 1 |+|S 2 |) and linear space O(|S 1 |). Let S 1...S n be genomic sequences, then the list of MUMs between S 1 and the rest, can be found in linear time O(|S 1 |+...+|S n |) and linear space O(|S 1 |).

After constructing the slidding suffix-tree, the list of MUMs is updated while the second secuence is read. In this process, the algorithm jumps between both structures, but without updating the suffix-tree. Empirical results: MUMs On-Line vs MUMmer

Efficient Multiple Genome Alignment Goal: To align more than two genomic secuences. Global multiple alignment methods: – Multidimensional Dinamic Programming: fairly slow. – Divide-and-Conquer Alignment. – Iterative Parwise Alignment: ClustalW MUMmer – Anchor-based multiple Alignment: MGA MUMS On-Line(Malgen)

Iterative Pairwise Alignment Strategy of iteratively merging two multiple alignments.

Iterative Pairwise Alignment ATG…….TTA AGC…….TAG ATGC…….TTAG CTC…….TCG TGC…….TAG ATC…….TTA TGC…….TAG CTC…….TCG ATG…….TTA AGC…….TAG Pairwise Distances UPGMA or Neighbor-Joining Align iteratively

Iterative Pairwise Alignment ATC…….TTA TGC…….TAG CTC…….TCG ATG…….TTA AGC…….TAG …G…C… …A…T… …T…C…

Anchor-based Multiple Alignment Identify substrings likely part of the alignment: Anchors of the alignment. Then align the Anchors and close the gaps. The MUMmer algorithm only can obtain the MUM anchors for align a pairwise.

Anchor-based Multiple Alignment AC…….TT TC…….TA AC…….TC GC…….TC A…C G…C A…C G…T C…G A…A T…G Substrings that are likely be part of global alignment A…CG…CA…CG…T Maximal Nonoverlaping Set of Strings Closing the gaps (recursively or DP)..ATGACGAGA---GAC---AGTTTA---TAT…

MGA Capable of aligning 3 or More Genomes. 3 phases: – Detect maximal multiple exact matches (multiMEMs) Compute anchors consisting of the longest non-overlapping sequence of multiMEMs. Close the gaps between the anchors.

MGA vs MUMmer Computation of MEMs based on virtual suffix trees: less space and faster matches than original suffix tree. Anchor finding: O(m 2 ) vs. O(m lgm). Alignment for gaps: O(r*s) vs. O(e.min(r,s)).

Empirical results: MGA vs MUMmer Phase 1: Computing the MUMs of length >= L The time showed is only the time inverted in consulting the structure. No the time for build the structure.

Empirical results: MGA vs MUMmer Phase 2, 3 : Closing the gaps.

MUMs On-line for multiple alignment Let S 1...S n be genomic sequences, then the list of MUMs between all of them can be found in linear time O(|S 1 |+...+|S n |) and linear space O(|S 1 |).

MUMs On-Line vs MGA MGA needs to store the structure of all the genomes to be compared. MUMs On-Line only needs the structure for the shorter sequence and the updated list of MUMs. The slidding suffix tree is much more construction time efficient than the virtual suffix tree. In MUMs On-Line you ever knows the MUMs between the genomes read. You don’t need the postprocess for consulting the structure for obtaining the anchors. The MUMs describes better the hidden structure of the genomes compared. Splitting the shorter sequence, you can obtain the MUMs in a computer with a very few memory.

Bibliografy Efficient Space and Time multicomparison of genomes, Mario Huerta, Xavier Messeguer, Technical report LSI-02-64-R. Llenguatjes i Sistemes Informatics, Universitat Politècnica de Catalunya (2002). Efficient Space and Time multicomparison of genomes Suffix tree construction with slide nodes, Mario Huerta, Technical report LSI-02-63-R. Llenguatjes i Sistemes Informatics, Universitat politècnica de Catalunya (2002). Suffix tree construction with slide nodes Identification of patterns in biological sequences at the ALGGEN server: PROMO and MALGEN, Domènec Farré, Mario Huerta, Romà Roset, José E. Adsuara, Llorenç Roselló, M. Mar Albà, and Xavier Messeguer, Nucleic Acids Research. 2003 31: 3651-3653 (2003). Identification of patterns in biological sequences at the ALGGEN server: PROMO and MALGEN

Related bibliografy Alignment of Whole Genomes, Delcher A..L., Kasif S., Fleischmann R. D., Peterson J, White O., Salzberg S.L., Nucleic Acids Research, 27:11, 2369-2376. (1999) Alignment of Whole Genomes Fast Algorithms for Large-scale Genome Alignment and Comparision, A.L. Delcher, A. Phillippy, J. Carlton, and S.L. Salzberg, Nucleic Acids Research (2002), Vol. 30, No. 11 2478-2483. Fast Algorithms for Large-scale Genome Alignment and Comparision CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position-specific Gap Penalties and Weight Matrix Choice. Thompson, J.D., Higgins, D.G., and Gibson, T.J., Nucleic Acids Research, 22(22), pp. 4673-4680, 1994. Available through CDL Efficient multiple genome alignment, Michael Hohl, Stefan Kurtz, and Enno Ohlebusch, ISMB 2002

Efficient multiple genome comparison Mario Huerta

Similar presentations

Presentation on theme: "Efficient multiple genome comparison Mario Huerta"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient multiple genome comparison Mario Huerta

Similar presentations

Presentation on theme: "Efficient multiple genome comparison Mario Huerta"— Presentation transcript:

Similar presentations

About project

Feedback