Download presentation
Presentation is loading. Please wait.
Published byHubert Doyle Modified over 9 years ago
Chun-Yuan Lin Assistant Professor Department of Computer Science and Information Engineering Chang Gung University Experiences for computational biology on CUDA 2015/12/13 1 GPU Workshop
Introduction (1) 2015/12/13 GPU Workshop 2 The fast increasing power of the GPU (Graphics Processing Unit) and its streaming architecture opens up a range of new possibilities for a variety of applications. Previous works on GPGPU (General-Purpose computation on GPUs) have showed the design and implementation of algorithms for non-graphics applications. (scientific computing, computational geometry, image processing, Bioinformatics and etc.)
Introduction (2) 2015/12/13 GPU Workshop 3 Some bioinformatics applications have been successfully ported to GPGPU in the past. Liu et al. (IPDPS 2006) implemented the Smith-Waterman algorithm (sequence alignment problem) to run on the nVidia GeForce 6800 GTO and GeForce 7800 GTX, and reported an approximate 16× speedup by computing the alignment score of multiple cells simultaneously. Charalambous et al. (LNCS 2005) ported an expensive loop from RAxML, an application for phylogenetic tree construction, and achieved a 1.2× speedup on the nVidia GeForce 5700 LE.
Introduction (3) Sequence alignment DNA/RNA sequences: 4-letter alphabet (ATGC, AUGC) Protein sequences: 20-letter alphabet (or 23-letter alphabet) High sequence similarity usually implies functional or structural similarity. 2015/12/13 GPU Workshop 4
Introduction (4) 2015/12/13 GPU Workshop 5
Introduction (5) 2015/12/13 GPU Workshop 6
Introduction (6) 2015/12/13 GPU Workshop 7
Introduction (7) 2015/12/13 GPU Workshop 8
Introduction (8) 2015/12/13 GPU Workshop 9
Introduction (9) 2015/12/13 GPU Workshop 10
Introduction (10) 2015/12/13 GPU Workshop 11
Introduction (11) An evolutionary tree can be seen as a representation of evolutionary histories for a set of species and is helpful for biologists to observe existent species or to evaluate the relationship of them in the taxonomy. The real evolutionary histories (trees) are unknown in practice. (root and internal node) The majority of these methods or models are based on two inputs: the sequences and the distance matrix. However, most of optimization problems for evolutionary tree construction have been shown to be NP-hard. 2015/12/13 GPU Workshop 12
Introduction (12) 2015/12/13 GPU Workshop 13
Introduction (13) 2015/12/13 GPU Workshop 14
Introduction (14) Liu et al. (IEEE TPDS 2007) presented a GPGPU approach to high- performance biological sequence alignment based on commodity PC graphics hardware. (C++ and OpenGL Shading Language (GLSL)) Pairwise Sequence Alignment (Smith-Waterman algorithm, scan database, no backtrack) Multiple sequence alignment (MSA) 2015/12/13 GPU Workshop 15 (from Liu et al. TDPS 2007) (intra-task parallel)
2015/12/13 GPU Workshop 16 (from Liu et al. TPDS 2007)
2015/12/13 GPU Workshop 17 (from Liu et al. TDPS 2007)
CUDA (1) CUDA (Compute Unified Device Architecture) is an extension of C/C++ which enables users to write scalable multi-threaded programs for CUDA-enabled GPUs. CUDA programs contain a sequential part, called a kernel. Readable and writable global memory (ex. 1GB) (The effective bandwidth of global memory depends heavily on the memory access pattern) (coalesced access) Readable and writable per-thread local memory (16KB per thread) (Access to local memory is as expensive as access to global memory) 2015/12/13 GPU Workshop 18
CUDA (2) Read-only constant memory (64KB, cached, 8kB per multiprocessor) (The reading cost scales with the number of different addresses read by all threads) (Reading from constant memory can be as fast as reading from a register) Read-only texture memory (size of global, cached, 8kB per multiprocessor) (Reading from texture memory is generally faster than reading from global or local memory) Readable and writable per-block shared memory (16KB per block) (Shared memory is divided into equally-sized banks that can be accessed simultaneously by each thread) Readable and writable per-thread registers (ex. 8192 per block) (the fastest memory) 2015/12/13 GPU Workshop 19
2015/12/13 GPU Workshop 20 Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign (from Schatz et al. BMC Bioinformatics 2007)
CUDA (3) Some bioinformatics applications have been successfully ported to CUDA now. Smith-Waterman algorithm (scan database, no alignment results) Manavski and Valle (BMC Bioinformatics 2008), Striemer and Akoglu (IPDPS 2009), Liu et al. (BMC Research Notes 2009) Multiple sequence alignment (ClustalW) Liu et al. (IPDPS 2009) for Neighbor-Joining Trees construction Liu et al. (ASAP 2009) Pattern matching (MUMmerGPU) Schatz et al. (BMC Bioinformatics 2007) 2015/12/13 GPU Workshop 21
CUDA- Smith-Waterman algorithm (1) Manavski and Valle present the first solution (CUDA solution) based on commodity hardware that efficiently computes the exact Smith- Waterman alignment. It runs from 2 to 30 times faster than any previous implementation on general-purpose hardware. 2015/12/13 GPU Workshop 22 (from Schatz et al. BMC Bioinformatics 2007) (inter-task parallel)
CUDA- Smith-Waterman algorithm (2) Pre-compute a query profile parallel to the query sequence for each possible residue. The implementation in CUDA was to make each GPU thread compute the whole alignment of the query sequence with one database sequence. (pre-order the sequences of the database in function of their length) The ordered database is stored in the global memory, while the query- profile is saved into the texture memory. For each alignment the matrix is computed column by column in order parallel to the query sequence. (store them in the local memory of the thread) 2015/12/13 GPU Workshop 23
2015/12/13 GPU Workshop 24 (from Schatz et al. BMC Bioinformatics 2007) (no backtrack) The GPU is able to read and write up to 128 bits of the local memory with a single instruction.
2015/12/13 GPU Workshop 25 (from Schatz et al. BMC Bioinformatics 2007) CUPS: cell updates per second
CUDA- Smith-Waterman algorithm (3) Striemer and Akoglu further study the effect of memory organization and the instruction set architecture on GPU performance. For both single and dual GPU configurations, Manavski utilizes the help of an Intel Quad Core processor by distributing the workload among GPU(s) and the Quad Core processor. They pointed out that query profile in Manavski’s method has a major drawback in utilizing the texture memory of the GPU that leads to unnecessary caches misses. (larger than 8KB) Long sequence problem. 2015/12/13 GPU Workshop 26 (inter-task parallel)
CUDA- Smith-Waterman algorithm (4) They placed the substitution matrix in the constant memory to exploit the constant cache, and created an efficient cost function to access it. (modulo operator (%) is extremely inefficient on CUDA, not use hash function) The substitution matrix needs to be re-arranged in alphabetical order. They mapped query sequence as well as the substitution matrix to the constant memory. They calculated the SW score from the query sequence and database sequences by means of columns, four cells at a time due to the restrictions in the size of the shared memory. 2015/12/13 GPU Workshop 27
CUDA- Smith-Waterman algorithm (5) After the alignment is complete, the score is written to the global memory. They pointed out the main drawback of GPU is the limited on chip memory. (need to be designed carefully) 2015/12/13 GPU Workshop 28
2015/12/13 GPU Workshop 29 (from Striemer and Akoglu IPDPS 2009)
CUDA- Smith-Waterman algorithm (6) Liu et al. proposed Two versions of CUDASW++ are implemented: a single-GPU version and a multi-GPU version. The alignment can be computed in minor-diagonal order from the top-left corner to the bottom-right corner in the alignment matrix. Considering the optimal local alignment of a query sequence and a subject sequence as a task. Inter-task parallelization: Each task is assigned to exactly one thread and dimBlock tasks are performed in parallel by different threads in a thread block. Intra-task parallelization: Each task is assigned to one thread block and all dimBlock threads in the thread block cooperate to perform the task in parallel. 2015/12/13 GPU Workshop 30
CUDA- Smith-Waterman algorithm (7) Inter-task parallelization occupies more device memory but achieves better performance than intra-task parallelization. Intra-task parallelization occupies significantly less device memory and therefore can support longer query/subject sequences. (two stages implementation, the threshold is set to 3,072) In order to achieve high efficiency for inter-task parallelization, the runtime of all threads in a thread block should be roughly identical. (order database sequences based on their lengths) 2015/12/13 GPU Workshop 31
CUDA- Smith-Waterman algorithm (8) Coalesced subject sequence arrangement For inter-task parallelization, sorted subject sequences are arranged in an array like a multi-layer bookcase, where all symbols of a sequence are restricted to be stored in the same column from top to bottom and all sequences are arranged in increasing length order from left to right and top to bottom in the array. (global memory) Sorted subject sequences for the intra-task parallelization are sequentially stored in an array row by row from the top-left corner to the bottom-right corner. A hash table records the location coordinate in the array and the length of each sequence, providing fast access to any sequence) 2015/12/13 GPU Workshop 32
CUDA- Smith-Waterman algorithm (9) Coalesced global memory access During the execution of the SW algorithm, additional memory is required to store intermediate alignment data. To support much longer sequences, the global memory is used to store the intermediate results. A prerequisite for coalescing is that the words accessed by all threads in a half- warp must lie in the same segment) For inter-task parallelization, a memory slot is allocated to a thread in a thread block and is indexed top-to bottom, and the access to MemSlot using the same index for all threads in a half-warp is coalesced into one or two memory transactions depending on the compute capacity of devices. 2015/12/13 GPU Workshop 33
CUDA- Smith-Waterman algorithm (10) For intra-task parallelization, a memory slot is allocated to a thread block and is indexed left-to right, and the coalesced access is able to be obtained using the common global memory access pattern. 2015/12/13 GPU Workshop 34
2015/12/13 GPU Workshop 35 (from Liu et al. BMC Research Notes 2009) Coalesced subject sequence arrangement Coalesced global memory access
CUDA- Smith-Waterman algorithm (11) Cell block division method To maximize performance and to reduce the bandwidth demand of global memory, they propose a cell block division method for the inter-task parallelization, where the alignment matrix is divided into cell blocks of equal size. A cell block is a square matrix of size n × n. If the length of query or subject sequence is not a multiple of n, the sequence is padded with an appropriate number of dummy symbols. (add to scoring matrix) However, the size of cell block is limited by the number of registers available per thread. (8 × 8 per thread) 2015/12/13 GPU Workshop 36
CUDA- Smith-Waterman algorithm (12) Constant memory is exploited to store the gap penalties, scoring matrix and the query sequence. (In our implementation, sequences of length up to 59K can be supported) 2015/12/13 GPU Workshop 37
2015/12/13 GPU Workshop 38 a single-GPU NVIDIA GeForce GTX 280 graphics card and a dual-GPU GeForce GTX 295 graphics card (from Liu et al. BMC Research Notes 2009)
CUDA- Multiple sequence alignment Liu et al. presents MSA-CUDA, a parallel MSA program, which parallelizes all three stages of the ClustalW processing pipeline using CUDA. Pairwise distance computation: a forward score-only pass using Smith-Waterman (SW) algorithm a reverse score-only pass using SW algorithm a traceback computation pass using Myers-Miller algorithm they have developed a new stack-based iterative implementation. (CUDA does not support recursion) As the work in Liu et al. (BMC Research Notes 2009) Neighbor-Joining Trees: as the work in Liu et al. (IPDPS 2009) Reconstruction of the unrooted NJ tree Rooting the NJ tree and computing sequence weights Progressive alignment: conducted iteratively in a multi-pass way. 2015/12/13 GPU Workshop 39
2015/12/13 GPU Workshop 40 (from Liu et al. ASAP 2009)
CUDA-Pattern matching (1) Exact or approximate string matching problem: given a query string P of length m, a text string T, and a distance k (k is 0 for the exact string matching problem), find all substrings t of T that are within the distance k from P. more than million query strings for a practical application. 2015/12/13 GPU Workshop 41
CUDA-Pattern matching (2) Schatz et al. proposed MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program (exact sequence alignment) that runs on commodity Graphics Processing Units (GPUs) in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. 2015/12/13 GPU Workshop 42
2015/12/13 GPU Workshop 43 (from Schatz et al. BMC Bioinformatics 2007)
CUDA-Pattern matching (3) First a suffix tree of the reference sequence is constructed on the CPU using Ukkonen's algorithm and transferred to the GPU. (the reference suffix tree, query sequences, and output buffers will fit on the GPU) MUMmerGPU builds k smaller suffix trees from overlapping segments of the reference. The suffix tree is "flattened" into two 2D textures, the node texture and the child texture. (32 × 32) The queries are read from disk in blocks that will fill the remaining (global) memory, concatenated into a single large buffer (separated by null characters), and transferred to the GPU. An auxiliary 1D array, also transferred to the GPU, stores the offset of each query in the query buffer. 2015/12/13 GPU Workshop 44
2015/12/13 GPU Workshop 45 (from Schatz et al. BMC Bioinformatics 2007) k smaller suffix trees
CUDA-Pattern matching (4) Then the query sequences are transferred to the GPU, and are aligned to the tree on the GPU using the alignment algorithm. Each multiprocessor on the GPU is assigned a subset of queries to process in parallel, depending on the number of multiprocessors and processors available. (inter- and intra-task parallel) Thus, the data reordering scheme attempts to increase the cache hit rate for a single thread. (alphabet order) Alignment results are temporarily written to the GPU's memory (global memory), and then transferred in bulk to host RAM once the alignment kernel is complete for all queries. (the alignments are printed by the CPU) 2015/12/13 GPU Workshop 46
2015/12/13 GPU Workshop 47 (from Schatz et al. BMC Bioinformatics 2007) The time for building the suffix tree, reading queries from disk, and printing alignment output is the same regardless of whether MUMmerGPU ran on the CPU or the GPU
2015/12/13 GPU Workshop 48 (from Schatz et al. BMC Bioinformatics 2007)
Our works in progress 2015/12/13 GPU Workshop 49
Objects (1) Systematically implement existent and other bioinformatics tools on CUDA. Implement previous works to learn the experiences. Improve the performance of previous works or make them more practical. Port other tools to CUDA. Design new sequencing tools on CUDA. Many new sequencing techniques have been proposed in the last few years, such as ABI-SOLiD, Roche-454 and Illumina-Solexa systems. These next-generation sequencing machines can generate more than 1000 millions reads of length around 30bps in a single run. It will be a very important and challenge problem in the future. 2015/12/13 GPU Workshop 50
Objects (2) Apply to drug design, systems biology and virus research fields. Molecular Dynamic, docking tools and modeling methodology (screening) for drug design. Analyze and mining biological data for systems biology and virus research fields. Construct service platform for our development tools on CUDA. Support open source software for downloading. Construct service platform (website) for biologists. 2015/12/13 GPU Workshop 51
Thanks for our team members Chuan Yi Tang, Professor, CS, National Tsing Hua University. Wei Sheng Lee, master student, CS, National Tsing Hua University. Chen Hua Lu, master student, CS, National Tsing Hua University. Chien Pang Liu, undergraduate student, CS, National Tsing Hua University. Chun-Yuan Lin, Assistant Professor, CSIE, Chang Gung University Bo Wei Yang, master student, CSIE, Chang Gung University. Yu-Shiang Lin, undergraduate student, CSIE, Chang Gung University. Sheng-Ta Li, undergraduate student, CSIE, Chang Gung University. Yi Fang Chung, undergraduate student, CSIE, Chang Gung University. Yi Zhi Liou, undergraduate student, CSIE, Chang Gung University. Thanks for NVIDIA’s technique supporting 2015/12/13 GPU Workshop 52
Chuan Yi Tang, Professor Computational Systems Biology and Bio- Medicine Laboratory National Tsing Hua University Tel: +886 3 5715131-1077 Chun-Yuan Lin, Assistant Professor Parallel Processing and Computational Biology Laboratory Chang Gung University Tel: +886 3 2118800-3581 Computational Systems Biology and Bio- Medicine Parallel Processing and Computational Biology Laboratory 2015/12/13 GPU Workshop 53
Thank you for listening 2015/12/13 GPU Workshop 54
Similar presentations
© 2025 Inc.
All rights reserved.