Mapping Genomes onto each other – Synteny detection CS 374 Aswath Manohar.

Slides:



Advertisements
Similar presentations
Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology.
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
BLAST Sequence alignment, E-value & Extreme value distribution.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Computer science is a field of study that deals with solving a variety of problems by using computers. To solve a given problem by using computers, you.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Parallel-Search Trie-based Scheme for Fast IP Lookup
Multiple sequence alignment
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Supplementary material Figure S1. Cumulative histogram of the fitness of the pairwise alignments of random generated ESSs. In order to assess the statistical.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Sequence comparison: Local alignment
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
Applied Discrete Mathematics Week 9: Relations
Chao “Bill” Xie, Victor Bolet, Art Vandenberg Georgia State University, Atlanta, GA 30303, USA February 22/23, 2006 SURA, Washington DC Memory Efficient.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Genome alignment Usman Roshan. Applications Genome sequencing on the rise Whole genome comparison provides a deeper understanding of biology – Evolutionary.
SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.
Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.
1 Linux Operating System 許 富 皓. 2 Memory Addressing.
Parallel Characteristics of Sequence Alignments Kyle R. Junik.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
From Smith-Waterman to BLAST
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
DNA, RNA and protein are an alien language
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Local alignment and BLAST Usman Roshan BNFO 601. Local alignment Global alignment recursions: Local alignment recursions.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Aligning Genomes Genome Analysis, 12 Nov 2007 Several slides shamelessly stolen from Chr. Storm.
BLAST BNFO 236 Usman Roshan. BLAST Local pairwise alignment heuristic Faster than standard pairwise alignment programs such as SSEARCH, but less sensitive.
All-pairs Shortest paths Transitive Closure
Phylogeny - based on whole genome data
Sequence comparison: Local alignment
Challenging Cloning Related Problems with GPU-Based Algorithms
13 Text Processing Hongfei Yan June 1, 2016.
Local alignment and BLAST
Sequence Alignment 11/24/2018.
Fast Sequence Alignments
Next-generation sequencing - Mapping short reads
Comparative RNA Structural Analysis
Next-generation sequencing - Mapping short reads
Dynamic Programming.
Presentation transcript:

Mapping Genomes onto each other – Synteny detection CS 374 Aswath Manohar

Necessity is the mother of invention Genome sequencing has given rise to voluminous amounts of genomic data. Genome sequencing has given rise to voluminous amounts of genomic data. Human genome has completely been sequenced. Rat and mouse genomes have also been completed. Human genome has completely been sequenced. Rat and mouse genomes have also been completed. What do we do with all this data? What do we do with all this data?

Necessity… Need to analyze all this data meaningfully. Need to analyze all this data meaningfully. Has given rise to the field of Comparative Genomics. Has given rise to the field of Comparative Genomics. Identification of functional DNA through comparative methods. Identification of functional DNA through comparative methods. A large set of functional elements in Rat/Human/Mouse genomes remains uncharacterized. (Pash: Kalafus et al) A large set of functional elements in Rat/Human/Mouse genomes remains uncharacterized. (Pash: Kalafus et al)

Analysis Methods Standard Dynamic Programming Alignment algorithms – Needleman Wunsch, Smith-Waterman. Standard Dynamic Programming Alignment algorithms – Needleman Wunsch, Smith-Waterman. Highly sensitive aligners. Highly sensitive aligners. Computationally prohibitive – impossible to apply to analysis of multiple mammalian genomes. Computationally prohibitive – impossible to apply to analysis of multiple mammalian genomes.

Methods… Faster implementations of dynamic programming such as LAGAN (Brudno et al 2003). Faster implementations of dynamic programming such as LAGAN (Brudno et al 2003). Works well on a megabase level, but requires prior information (‘anchors’) on a genomic scale. Works well on a megabase level, but requires prior information (‘anchors’) on a genomic scale. Seed and extend methods – a ‘seed’, hotspot is determined. Then it is extended on either side. Seed and extend methods – a ‘seed’, hotspot is determined. Then it is extended on either side. Again, extension step is computationally expensive. Again, extension step is computationally expensive.

Pash So what is the solution? So what is the solution? Use Positional Hashing!!! Use Positional Hashing!!! Pash: Efficient Genome-Scale Sequence Anchoring by Positonal Hashing Pash: Efficient Genome-Scale Sequence Anchoring by Positonal Hashing Authors: Ken Kalafus, Andrew Jackson and Aleksandar Milosavijevic Authors: Ken Kalafus, Andrew Jackson and Aleksandar Milosavijevic

Pash in figures

More formally… The sequences S, T are conceptually divided into sub- sequences of length L: The sequences S, T are conceptually divided into sub- sequences of length L: Si = [i*L+1,..., (i+1)*L] Si = [i*L+1,..., (i+1)*L] Ti’ = [i’*L+1,..., (i’+1)*L] Ti’ = [i’*L+1,..., (i’+1)*L]

Hashing The single scoring matrix is divided into L diagonal matrices. The single scoring matrix is divided into L diagonal matrices. These are further divided into L ‘diagonal segment’ matrices. These are further divided into L ‘diagonal segment’ matrices. We have L² ‘diagonal segment’ matrices. We have L² ‘diagonal segment’ matrices. We use a hash table for each ‘diagonal segment’ matrix. We use a hash table for each ‘diagonal segment’ matrix. Therefore Total #Hash tables = L² Therefore Total #Hash tables = L²

Hashing… Each k-mer is mapped to a bin in the hash table. Each k-mer is mapped to a bin in the hash table. The indices of the k-mer are stored in one of two linked lists (one for each sequence). The indices of the k-mer are stored in one of two linked lists (one for each sequence). We assume an efficient hash function. We assume an efficient hash function.

Hashing… If both the lists in a bin are non-empty, then the kmer corresponding to that bin, is a matching kmer! If both the lists in a bin are non-empty, then the kmer corresponding to that bin, is a matching kmer! Collation of matching kmers involves a single traversal of each list. Collation of matching kmers involves a single traversal of each list.

Running time Worst case?? Worst case?? When you have to perform an all against all comparison When you have to perform an all against all comparison O(M*N) O(M*N) Highly unrealistic Highly unrealistic

Running time… In practical applications, output size is O(M+N). In practical applications, output size is O(M+N). If k-mers of sufficient length are used, each of L² hash tables is populated with (M+N)/L k-mers. If k-mers of sufficient length are used, each of L² hash tables is populated with (M+N)/L k-mers. Hence running time = O(M+N)*L) Hence running time = O(M+N)*L) If you have L nodes, running time = O(M+N). If you have L nodes, running time = O(M+N).

Significance of Similarities For each sequence found, Pash reports both the number of matching bases and a bit score that indicates significance. For each sequence found, Pash reports both the number of matching bases and a bit score that indicates significance. The bit score is calculated according to the Algorithmic Significance method. The bit score is calculated according to the Algorithmic Significance method.

Significance of Similarities… Based on the number of bits saved in a minimal encoding of the target sequence X=T given that the source is known. Based on the number of bits saved in a minimal encoding of the target sequence X=T given that the source is known. D = I o (X) – I(X) D = I o (X) – I(X) I o (X) = 2 * n bits I o (X) = 2 * n bits

Kmer encoding… To encode I(X), one of two options are used on a case by case basis. To encode I(X), one of two options are used on a case by case basis. A 1 bit flag is used to denote which method is used. A 1 bit flag is used to denote which method is used. Let w be the number of matching kmers. Let w be the number of matching kmers. Let W be the maximum possible number of kmers in a match. Let W be the maximum possible number of kmers in a match. Conceptually, W corresponds to the length of the diagonal and is constant. Conceptually, W corresponds to the length of the diagonal and is constant.

Kmer encoding… There are C(W,w) possible lists of matching kmers. There are C(W,w) possible lists of matching kmers. To uniquely identify a kmer set we need log 2 C(W,w) bits To uniquely identify a kmer set we need log 2 C(W,w) bits Therefore Kmer encoding of I w (X): Therefore Kmer encoding of I w (X): I w (X) = 1 + log 2 W + log 2 C(W,w) bits I w (X) = 1 + log 2 W + log 2 C(W,w) bits

Base encoding Base encoding is very similar to kmer encoding. Base encoding is very similar to kmer encoding. Let b the number of bases defined in a match. Let b the number of bases defined in a match. Let B be defined as the maximum possible number of bases contained in a match. Let B be defined as the maximum possible number of bases contained in a match. I b (X) = 1 + log 2 B + log 2 C(B,b) bits. I b (X) = 1 + log 2 B + log 2 C(B,b) bits.

Significance of Similarities Therefore I min (X) = min(I w (X), I b (X)) Therefore I min (X) = min(I w (X), I b (X)) I(X) = I min (X) + 2*(n-b) bits I(X) = I min (X) + 2*(n-b) bits Therefore, after combining and simplifying, Therefore, after combining and simplifying, d = 2 * b – I min (X) d = 2 * b – I min (X)

Results Used in comparing the latest assembly of rat genome to the human and mouse ones. Used in comparing the latest assembly of rat genome to the human and mouse ones. Each pair-wise comparison took 4 days in 6 CPU’s = 24 CPU days Each pair-wise comparison took 4 days in 6 CPU’s = 24 CPU days Computers were running on 750 MHz Pentium III processors Computers were running on 750 MHz Pentium III processors Peak Ram usage = 500 MB (approx) Peak Ram usage = 500 MB (approx)

Results…

Discussion In contrast to seed and extend methods, Pash represents sequences as short kmers, rather than bases. In contrast to seed and extend methods, Pash represents sequences as short kmers, rather than bases. Efficiently parallizable. Efficiently parallizable. Applications requiring basepair level alignments, Pash can be used as an anchoring module Applications requiring basepair level alignments, Pash can be used as an anchoring module This can in turn be post processed by programs like LAGAN, AVID or BLASTZ. This can in turn be post processed by programs like LAGAN, AVID or BLASTZ.

Availiability Available free of charge for academic use. Available free of charge for academic use

Thanks!