The short-read alignment in distributed memory environment

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Combinatorial Pattern Matching CS 466 Saurabh Sinha.
CS252: Systems Programming Ninghui Li Program Interview Questions.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
G ENOME - SCALE D ISK - BASED S UFFIX T REE I NDEXING Phoophakdee and Zaki.
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
CS 171: Introduction to Computer Science II
1 Complexity of Network Synchronization Raeda Naamnieh.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Data Structures Using C++ 2E Chapter 11 Binary Trees and B-Trees.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Binary Trees Chapter 6.
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
CHAPTER 71 TREE. Binary Tree A binary tree T is a finite set of one or more nodes such that: (a) T is empty or (b) There is a specially designated node.
Tree.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
Algorithm Paradigms High Level Approach To solving a Class of Problems.
Starting at Binary Trees
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Tries Data Structure. Tries  Trie is a special structure to represent sets of character strings.  Can also be used to represent data types that are.
1. Efficient Peer-to-Peer Lookup Based on a Distributed Trie 2. Complex Queries in DHT-based Peer-to-Peer Networks Lintao Liu 5/21/2002.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Data Indexing Herbert A. Evans.
Tries 07/28/16 11:04 Text Compression
Top 50 Data Structures Interview Questions
Tries 5/27/2018 3:08 AM Tries Tries.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
IP Routers – internal view
Genomic Data Clustering on FPGAs for Compression
B+ Tree.
CS 430: Information Discovery
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
Digital Search Trees & Binary Tries
Chapter Trees and B-Trees
Chapter Trees and B-Trees
Binary Tree and General Tree
Strings: Tries, Suffix Trees
Quadtrees 1.
B- Trees D. Frey with apologies to Tom Anastasio
Digital Search Trees & Binary Tries
Greedy Algorithms TOPICS Greedy Strategy Activity Selection
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Data Structure and Algorithms
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Phylogeny.
Compact routing schemes with improved stretch
Strings: Tries, Suffix Trees
Important Problem Types and Fundamental Data Structures
Algorithms CSCI 235, Spring 2019 Lecture 30 More Greedy Algorithms
Huffman Coding Greedy Algorithm
A SRAM-based Architecture for Trie-based IP Lookup Using FPGA
Presentation transcript:

The short-read alignment in distributed memory environment Andrzej Dorobisz, Paweł Russek, Kazimierz Wiatr Description of the problem Distribution of data and work The task of short-read alignment, which comes from genomics, is to find the best match for very short DNA sequences (short reads) in a lengthy reference genome data. The short reads and genome are represented as text patterns of four letter alphabet {‘A’, ‘C’, ‘G’, ‘T’}, each letter denote base pair (bp) in DNA sequence. Typical parameters are as follows. The reference genome has length GLEN of a couple of Gbps, and a single run can comprise 50-100 million of short reads, 32-100 bps each. The main difficulty in this problem is that inexact matching is accepted, and the short read can be edited to fit its location in the genome. Three edit operations: insertion (CGAT -> CGACT) deletion (CGAT -> CGT) replacement (CGAT -> CGCT) Our goal was to propose a method for a distribution of the workload and data in a computer cluster to minimize data redundancy and inter-node communication. The key concept of our solution is that we can distribute trie among the one root process and many worker processes. There are one root and four workers (W=4) in Figure 1. We assign trie’s root node (‘_’) and top Ld trie levels to the root process (Ld=1 in Figure 1). Remaining subtrees are divided between worker processes in such a way that each root’s leaf node is a root node of the sub-trie that is stored by the single worker process. The root starts procedure for each short read. It traces all necessary path in the root’s sub-trie and delegates processing to the corresponding worker processes when the leaves node are reached. The worker node gets p=pposppos+1…pn and k to continue the backtracking procedure in its sub-trie. After processing each worker sends back results to root and then they are stored in the result file. This method provides that both data and work are distributed, so the problem is effectively parallelized. Simplified method If we assume that on first few letters cannot occur edit operation, then distribution of work becomes very elegant. Root process performs exact matching and then sends short-read only to one worker which continue processing with inexact matching. We have implemented this solution in the sequential and distributed versions. We have chosen C++11 and used MPI library to implement the distributed one. A C G T - || Example alignment of CGAT short-read against ACCGACTTCGTCGCGCTTA reference with one error allowed. Figure 1: An example of the trie for AGCATGCTGCAGTCATGCTTAGGCTA reference genome and its nodes partitioning to MPI processes Algorithm Results Since the maximum length of short reads is limited, we build the trie of strings that are all substrings of length Lmax of the reference genome, where Lmax is a maximum length of short reads. An example of the trie for Lmax=4 is given in Figure 1. The inexact matching in the trie is performed by the recursive procedure MismatchRec: 1: procedure MismatchRec(p, pos, cur, k) 2: if cur is NULL then return  3: end if 4: if pos == |p| + 1 then return SS(cur) 5: end if 6: R ← MismatchRec(p, pos+1, next(cur, ppos), k) 7: if k > 0 then 8: for all x  {A,C, G, T}, x ≠ ppos 9: R ← R  MismatchRec(p, pos+1, next(cur, x), k−1) 10: end for 11: for all x  {A,C, G, T} do 12: R ← R  MismatchRec(p, pos, next(cur, x), k−1) 13: end for 14: R ← R  MismatchRec(p, pos+1, cur, k−1) 15: end if 16: return R 17: end procedure In presented algorithm, p=p1p2...pn is a searched pattern, pos is a selected position in p, cur is a current trie node, and k is the number of an allowed edit operations. The next(node,letter) function returns the descendant of the node that corresponds to the selected letter, and SS(cur) returns positions of all substrings going through the node cur. The procedure is the backtrack search algorithm in the trie. At each trie node, all possible paths are checked for the unedited pattern (line 6), three replacements of the letter (line 9), four insertions (line 12), and the letter deletion (line 14). Our solution was tested on Zeus cluster at the ACC „CYFRONET”. We have measured time for the trie construction, data distribution and alignment. Lmax was set to 100 and we assume an exact prefix of length three. Results are given in Tables 1, 2, and 3. Table 1. Data distribution time [s] Table 2. Tries construction time [s] GLEN 50 000 100 000 1 000 000 sequential - W=4 0.26 s 0.24 s 2.17 s W=20 0.35 s 0.45 s 2.55 s GLEN 50 000 100 000 1 000 000 sequential 3.07 s 6.09 s 60.88 s W=4 1.06 s 2.17 s 26.63 s W=20 0.29 s 0.51 s 5.26 s Table 3. Overall short-read alignment time (2 000 reads and 100 kbps genome) [s] mismatches 3 4 5 sequential 3.35 s 34.46 s - W=4 0.90 s 9.75 s W=20 1.09 s 2.71 s 14.11 s Performed tests positively verified our solution. We can see that communication cost is small in comparison to gained acceleration. We can see that both phases – trie construction and short-read alignment were effectively parallelized. To summarize: we achieve our goal and show that short-read alignment problem can be effectively run in distributed memory environment. Acknowledgements This work is supported by PLGrid Core project no. POIG.02.03.00-12-137/13