Download presentation
Presentation is loading. Please wait.
Published byMelvin Hampton Modified over 8 years ago
1
Naga Shailaja Dasari Ranjan Desh Zubair M Old Dominion University Norfolk, Virginia, USA
2
Overview Planted motif problem Overview of the BitBased approach Parallelization on GPU Optimizations Bit representation Repartitioning and reordering Results
3
(l, d) Planted Motif Problem DNA sequence: sequence of A, C, G, Ts l-mer: sequence of length l d-neighbors: two l-mers are d-neighbors if the hamming distance between them is less than or equal to d. H(s 1, s 2 ) ≤ d Planted Motif Problem: Given N input sequences each of length L, find M, the set of l-mers, which have at-least one d-neighbor in each of the N input sequences.
4
(5,1) Planted Motif Problem CCGATCAAACTGGCTTATATGGCTATGTCAGTC TACCAATCCATTTCAGTGTACTCCGACATCGGA ACCAGATTCGATCAGCAGTGTACCAATGAGTAC N L M GCAAT CCGATCAAACTGGCTTATATGGCTATGTCAGTC TACCAATCCATTTCAGTGTACTCCGACATCGGA ACCAGATTCGATCAGCAGTGTACCAATGAGTAC 1-neighbors
5
Observation CCGATCAAACTGGCTTATATGGCTATGTCAGTCS0S0 N0N0 {CCGAT, … CGATC, ………, GCTAT, …, GCAAT, …………….., AAGTC} d-neighbors TACCAATCCATTTCAGTGTACTCCGACATCGGA S1S1 N1N1 {…………………, GCAAT, …………………………………………………………}
6
Bit Array For (5, 1) problem.... 4 5 bits 012 3 10221023 0000000000 AAAAA 0000000001 AAAAC 0000000010 AAAAG 0000000011 AAAAT A - 00 C - 01 G - 10 T - 11 9 0000001001 AAAGC 1111111110 TTTTG 1111111111 TTTTT
7
The BitBased approach Two Phases Setting Bits: Generate d-neighbor set for each input sequence. Finding Motifs: Find l-mers that are common in all the d-neighbor sets.
8
Setting Bits CCGATCAAACTGGCTTATATGGCTATGTCAGTC CCGAT ACGAT GCGAT …….… CCGAG 0101100011 0001100011 1001100011 ….…… 0101100010 355 99 611 354.... 4l4l 1-neighbors SiSi A - 00 C - 01 G - 10 T - 11 BiBi
9
Finding Motifs.... 4l4l B0B0 B1B1 B N-1.... B & & & 1 1 1 1 Motif 1 1 1 1
10
Memory Requirement Space Complexity: InstanceBit array size (13,4)8MB (15,5)128MB (17,6)2GB (19,7)32GB
11
Iterative Approach.... CCGATCAAACTGGCTTATATGGCTATGTCAGTC CCGAT ACGAT GCGAT …….… CCGAG 0101100011 0001100011 1001100011 ….…… 0101100010 355 99 611 354 1-neighbors SiSi
12
Parallelizing BitBased on GPU Option 1: Distribute the input arrays among the thread blocks. Setting bits: Input sequences constant/texture memory Bit Arrays global memory Each thread reads an l-mer from input sequence, enumerates, sets bits in the bit array. Finding motifs: All the threads together perform logical AND operation and write the result back to global memory.
13
main() { // host memory allocation cudaMalloc((void **) &inArray_d, size); cudaMemcpy(inArray_d, inArray_h, size, cudaMemcpyHostToDevice); cudaMalloc((void **) &bigArray_d, sizeofBitArray); cudaMemset(bigArray_d, 0, sizeofBitArray); numThreads = 128; numBlocks = N; //number of sequences setBits >>(inArray_d, bitArray_d,); cudaThreadSynchronize(); numThreads = 128; numBlocks = sizeofBitArray / numThreads; findMotifs >>(bitArray_d, resultArray_d, resultArraySize); cudaMemcpy(resultArray_h, resultArray_d, resultArraySize, cudaMemcpyDeviceToHost); //free cuda memory and host memory }
14
__global__ void setBits(char *inArray, int *bitArray) { int seqNum = blockIdx.x; int lmerIdx = threadIdx.x; while (lmerIdx < numoflmersInSequence) { char lmer[l]; //copy lmer from input array for (i=0; i<l; i++) lmer[i] = inArray[seqNum * seqLength + lmerIdx + i]; neighborIdx = 0; while ( (neighborIdx = getNextdNeighbot(lmer, neighborIdx)) != -1) { setBit(bitArray, seqNum, neighborIdx); } lmerIdx += blockDim.x; }
15
__global__ void findMotifs(int *bitArray, int *resultArray, int *resultArraySize) { int idx = blockIdx.x * blockDim.x + threadIdx.x; int temp = 0; for (i=0; i<N; i++) { temp &= bitArray[i * sizeofBitArray + idx]; } if (temp != 0) { resultIdx = atomicAdd(resultArraySize, 1); resultArray[resultIdx] = temp; } Disadvantage of using option 1?
16
Parallelizing BitBased on GPU Option 2: Partition the bit arrays into chunks that fit in shared memory Distribute the chunks among the thread blocks Setting bits and finding motifs in the same kernel. Input sequences constant/texture memory Bit Arrays shared memory Each thread reads an l-mer from input sequence, enumerates, sets bits in the chunk of bit array assigned to the block. Finding motifs: All the threads together perform logical AND operation of the chunk of bit arrays assigned to the block and write the result back to global memory..... 4l4l B0B0 B1B1 B n-1 1 1 1 4 ls
17
1.) Select Compute Capability (click):1.3 2.) Enter your resource usage: Threads Per Block128 Registers Per Thread30 Shared Memory Per Block (bytes)4096 3.) GPU Occupancy Data is displayed here and in the graphs: Active Threads per Multiprocessor512 Active Warps per Multiprocessor16 Active Thread Blocks per Multiprocessor4 Occupancy of each Multiprocessor50% Physical Limits for GPU:1.3 Threads / Warp32 Warps / Multiprocessor32 Threads / Multiprocessor1024 Thread Blocks / Multiprocessor8 Total # of 32-bit registers / Multiprocessor16384 Register allocation unit size512 Shared Memory / Multiprocessor (bytes)16384 Warp allocation granularity (for register allocation)2 Allocation Per Thread Block Warps4 Registers4096 Shared Memory4096 Maximum Thread Blocks Per MultiprocessorBlocks Limited by Max Warps / Multiprocessor8 Limited by Registers / Multiprocessor4 Limited by Shared Memory / Multiprocessor4 Thread Block Limit Per Multiprocessor highlightedRED Occupancy Calculator 128 30 8192 256 8 2 25% 4 4096 8192 8 4 2 128 30 2048 512 16 4 50% 4 4096 2048 8 4 8
18
.... 4l4l B0B0 B1B1 B n-1 1 1 1 4 ls N = 8l = 15d = 5 Chunk size:4096Bytes = 4096 * 8 bits 4 ls * N = 4096 * 8 l s = 6
19
Optimization Bit representation of input to reduce registry usage. Repartitioning and reordering to avoid bank conflicts.
20
Bit Representation of Input Sequences Character representation: l-mer l bytes ex: GCATGGATCC 10 bytes Bit representation: l-mer 4 bytes (integer size) ex: GCATGGATCC 10010011101000110101 4 bytes
21
Shared memory banks Successive 32-bit words are assigned to successive banks. Bank 0Bank 1Bank 15Bank 0Bank 1Bank 15Bank 0Bank 1Bank 15 Multiple threads accessing same bank results in bank conflicts leading to serialized access. Bank 0Bank 1Bank 15Bank 0Bank 1Bank 15Bank 0Bank 1Bank 15
22
Legend Repartitioning Bit/Integer array is partitioned into 16 chunks. Example: Bit array B i of size 128 integers 0.. 7 8.. 15 16.. 23 24.. 31 32.. 39 40.. 47 48.. 55 56.. 63 64.. 71 72.. 79 80.. 87 88.. 9596.. 103 104.. 111 112.. 119 120.. 127 0 1 2 15 Bank 0 Bank 7 Bank 8 Bank 15 Thread i in a half warp only accesses the integers in chunk i
23
Reordering 0 8 16.. 120 1 9 17.. 121 2 10 18.. 122 3 11 19.. 123 4 12 20.. 124 5 13 21.. 125 6 14 22.. 126 7 15 23.. 127 0 1 2 15 Bank 0 Bank 1 Bank 2 Bank 15 0 1 2.. 15 16 17 18.. 31 32 33 34.. 47 48 49 50.. 63 64 65 66.. 79 80 81 82.. 95 96 97 98.. 111 112 113 114..127 Legend If a thread wants to write into integer I, it calculates the index of integer I in the reordered array and then writes into I. Ex: Index of 16 in the reordered array is 2. Reorder the integer array such that thread i in a half warp only accesses bank i
24
Multiple devices Implemented with openmp.... 4l4l B0B0 B1B1 B n-1 1 1 1 Device 0Device 1Device 2Device 3
25
Results Implemented on Nvidia Tesla C1060 and Tesla S1070. C1060 has single GPU device running at 1.3 GHz while S1070 has four devices. Implemented on random data. #GPU devices(15,5)(17,6)(19,7)(21,8) 18s91.2s19.7m4.5h 24.4s46.1s9.9m2.3h 33.2s31.1s6.62m1.5h 42.7m23.9s5m1.1h Results on GPU Algorithm(13,4)(15,5)(17,6)(19,7)(21,8) BitBased-162s11s2.4m30.6m6.9h BitBased-82s16s3.5m42.3m- BitBased-44s29s6.5m1.3h- BitBased-19s1.8m20.6m4.7h- PMSPrune53s9m69m9.2h- Results on Multicore (from previous work)
26
#GPU devices(15,5)(17,6)(19,7)(21,8) 113.513.614.3- 224.526.828.5- 333.639.742.6- 44051.756.4- Comparison with CPU Speedup compared to single core CPU(2.67 GHz) #GPU devices(15,5)(17,6)(19,7)(21,8) 11.41.6 1.5 22.53.1 3.0 33.44.6 44.16.06.16.3 Speedup compared to 16 cores CPU(2.67 GHz)
27
Questions?
28
Thank you
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.