Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mapping Genomes onto each other – Synteny detection CS 374 Aswath Manohar.

Similar presentations


Presentation on theme: "Mapping Genomes onto each other – Synteny detection CS 374 Aswath Manohar."— Presentation transcript:

1 Mapping Genomes onto each other – Synteny detection CS 374 Aswath Manohar

2 Necessity is the mother of invention Genome sequencing has given rise to voluminous amounts of genomic data. Genome sequencing has given rise to voluminous amounts of genomic data. Human genome has completely been sequenced. Rat and mouse genomes have also been completed. Human genome has completely been sequenced. Rat and mouse genomes have also been completed. What do we do with all this data? What do we do with all this data?

3 Necessity… Need to analyze all this data meaningfully. Need to analyze all this data meaningfully. Has given rise to the field of Comparative Genomics. Has given rise to the field of Comparative Genomics. Identification of functional DNA through comparative methods. Identification of functional DNA through comparative methods. A large set of functional elements in Rat/Human/Mouse genomes remains uncharacterized. (Pash: Kalafus et al) A large set of functional elements in Rat/Human/Mouse genomes remains uncharacterized. (Pash: Kalafus et al)

4 Analysis Methods Standard Dynamic Programming Alignment algorithms – Needleman Wunsch, Smith-Waterman. Standard Dynamic Programming Alignment algorithms – Needleman Wunsch, Smith-Waterman. Highly sensitive aligners. Highly sensitive aligners. Computationally prohibitive – impossible to apply to analysis of multiple mammalian genomes. Computationally prohibitive – impossible to apply to analysis of multiple mammalian genomes.

5 Methods… Faster implementations of dynamic programming such as LAGAN (Brudno et al 2003). Faster implementations of dynamic programming such as LAGAN (Brudno et al 2003). Works well on a megabase level, but requires prior information (‘anchors’) on a genomic scale. Works well on a megabase level, but requires prior information (‘anchors’) on a genomic scale. Seed and extend methods – a ‘seed’, hotspot is determined. Then it is extended on either side. Seed and extend methods – a ‘seed’, hotspot is determined. Then it is extended on either side. Again, extension step is computationally expensive. Again, extension step is computationally expensive.

6 Pash So what is the solution? So what is the solution? Use Positional Hashing!!! Use Positional Hashing!!! Pash: Efficient Genome-Scale Sequence Anchoring by Positonal Hashing Pash: Efficient Genome-Scale Sequence Anchoring by Positonal Hashing Authors: Ken Kalafus, Andrew Jackson and Aleksandar Milosavijevic Authors: Ken Kalafus, Andrew Jackson and Aleksandar Milosavijevic

7 Pash in figures

8 More formally… The sequences S, T are conceptually divided into sub- sequences of length L: The sequences S, T are conceptually divided into sub- sequences of length L: Si = [i*L+1,..., (i+1)*L] Si = [i*L+1,..., (i+1)*L] Ti’ = [i’*L+1,..., (i’+1)*L] Ti’ = [i’*L+1,..., (i’+1)*L]

9 Hashing The single scoring matrix is divided into L diagonal matrices. The single scoring matrix is divided into L diagonal matrices. These are further divided into L ‘diagonal segment’ matrices. These are further divided into L ‘diagonal segment’ matrices. We have L² ‘diagonal segment’ matrices. We have L² ‘diagonal segment’ matrices. We use a hash table for each ‘diagonal segment’ matrix. We use a hash table for each ‘diagonal segment’ matrix. Therefore Total #Hash tables = L² Therefore Total #Hash tables = L²

10 Hashing… Each k-mer is mapped to a bin in the hash table. Each k-mer is mapped to a bin in the hash table. The indices of the k-mer are stored in one of two linked lists (one for each sequence). The indices of the k-mer are stored in one of two linked lists (one for each sequence). We assume an efficient hash function. We assume an efficient hash function.

11 Hashing… If both the lists in a bin are non-empty, then the kmer corresponding to that bin, is a matching kmer! If both the lists in a bin are non-empty, then the kmer corresponding to that bin, is a matching kmer! Collation of matching kmers involves a single traversal of each list. Collation of matching kmers involves a single traversal of each list.

12 Running time Worst case?? Worst case?? When you have to perform an all against all comparison When you have to perform an all against all comparison O(M*N) O(M*N) Highly unrealistic Highly unrealistic

13 Running time… In practical applications, output size is O(M+N). In practical applications, output size is O(M+N). If k-mers of sufficient length are used, each of L² hash tables is populated with (M+N)/L k-mers. If k-mers of sufficient length are used, each of L² hash tables is populated with (M+N)/L k-mers. Hence running time = O(M+N)*L) Hence running time = O(M+N)*L) If you have L nodes, running time = O(M+N). If you have L nodes, running time = O(M+N).

14 Significance of Similarities For each sequence found, Pash reports both the number of matching bases and a bit score that indicates significance. For each sequence found, Pash reports both the number of matching bases and a bit score that indicates significance. The bit score is calculated according to the Algorithmic Significance method. The bit score is calculated according to the Algorithmic Significance method.

15 Significance of Similarities… Based on the number of bits saved in a minimal encoding of the target sequence X=T given that the source is known. Based on the number of bits saved in a minimal encoding of the target sequence X=T given that the source is known. D = I o (X) – I(X) D = I o (X) – I(X) I o (X) = 2 * n bits I o (X) = 2 * n bits

16 Kmer encoding… To encode I(X), one of two options are used on a case by case basis. To encode I(X), one of two options are used on a case by case basis. A 1 bit flag is used to denote which method is used. A 1 bit flag is used to denote which method is used. Let w be the number of matching kmers. Let w be the number of matching kmers. Let W be the maximum possible number of kmers in a match. Let W be the maximum possible number of kmers in a match. Conceptually, W corresponds to the length of the diagonal and is constant. Conceptually, W corresponds to the length of the diagonal and is constant.

17 Kmer encoding… There are C(W,w) possible lists of matching kmers. There are C(W,w) possible lists of matching kmers. To uniquely identify a kmer set we need log 2 C(W,w) bits To uniquely identify a kmer set we need log 2 C(W,w) bits Therefore Kmer encoding of I w (X): Therefore Kmer encoding of I w (X): I w (X) = 1 + log 2 W + log 2 C(W,w) bits I w (X) = 1 + log 2 W + log 2 C(W,w) bits

18 Base encoding Base encoding is very similar to kmer encoding. Base encoding is very similar to kmer encoding. Let b the number of bases defined in a match. Let b the number of bases defined in a match. Let B be defined as the maximum possible number of bases contained in a match. Let B be defined as the maximum possible number of bases contained in a match. I b (X) = 1 + log 2 B + log 2 C(B,b) bits. I b (X) = 1 + log 2 B + log 2 C(B,b) bits.

19 Significance of Similarities Therefore I min (X) = min(I w (X), I b (X)) Therefore I min (X) = min(I w (X), I b (X)) I(X) = I min (X) + 2*(n-b) bits I(X) = I min (X) + 2*(n-b) bits Therefore, after combining and simplifying, Therefore, after combining and simplifying, d = 2 * b – I min (X) d = 2 * b – I min (X)

20 Results Used in comparing the latest assembly of rat genome to the human and mouse ones. Used in comparing the latest assembly of rat genome to the human and mouse ones. Each pair-wise comparison took 4 days in 6 CPU’s = 24 CPU days Each pair-wise comparison took 4 days in 6 CPU’s = 24 CPU days Computers were running on 750 MHz Pentium III processors Computers were running on 750 MHz Pentium III processors Peak Ram usage = 500 MB (approx) Peak Ram usage = 500 MB (approx)

21 Results…

22 Discussion In contrast to seed and extend methods, Pash represents sequences as short kmers, rather than bases. In contrast to seed and extend methods, Pash represents sequences as short kmers, rather than bases. Efficiently parallizable. Efficiently parallizable. Applications requiring basepair level alignments, Pash can be used as an anchoring module Applications requiring basepair level alignments, Pash can be used as an anchoring module This can in turn be post processed by programs like LAGAN, AVID or BLASTZ. This can in turn be post processed by programs like LAGAN, AVID or BLASTZ.

23 Availiability Available free of charge for academic use. Available free of charge for academic use. http://www.br1.bcm.tmc.edu http://www.br1.bcm.tmc.edu http://www.br1.bcm.tmc.edu

24 Thanks!


Download ppt "Mapping Genomes onto each other – Synteny detection CS 374 Aswath Manohar."

Similar presentations


Ads by Google