Naga Shailaja Dasari Ranjan Desh Zubair M Old Dominion University Norfolk, Virginia, USA.

Slides:

Advertisements

Similar presentations

Speed, Accurate and Efficient way to identify the DNA.

Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Optimization on Kepler Zehuan Wang

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

More on threads, shared memory, synchronization

Dongyue Mou and Zeng Xing

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.

Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

An Introduction to Programming with CUDA Paul Richmond

Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CIS 565 Fall 2011 Qing Sun

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

GPU Architecture and Programming

LECTURE 3: INTRODUCTION TO PARALLEL COMPUTING USING CUDA Ken Domino, Domem Technologies May 16, 2011 IEEE Boston Continuing Education Program.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

CSS 700: MASS CUDA Parallel‐Computing Library for Multi‐Agent Spatial Simulation Fall Quarter 2014 Nathaniel Hart UW Bothell Computing & Software Systems.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

CUDA programming Performance considerations (CUDA best practices)

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.

Computer Engg, IIT(BHU)

CUDA C/C++ Basics Part 2 - Blocks and Threads

Sathish Vadhiyar Parallel Programming

EECE571R -- Harnessing Massively Parallel Processors ece

GPU Memories These notes will introduce:

Basic CUDA Programming

CUDA Parallelism Model

CS/EE 217 – GPU Architecture and Parallel Programming

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.

Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

General Purpose Graphics Processing Units (GPGPUs)

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Chapter 4:Parallel Programming in CUDA C

6- General Purpose GPU Programming

Presentation transcript:

Naga Shailaja Dasari Ranjan Desh Zubair M Old Dominion University Norfolk, Virginia, USA

Overview Planted motif problem Overview of the BitBased approach Parallelization on GPU Optimizations Bit representation Repartitioning and reordering Results

(l, d) Planted Motif Problem DNA sequence: sequence of A, C, G, Ts l-mer: sequence of length l d-neighbors: two l-mers are d-neighbors if the hamming distance between them is less than or equal to d. H(s 1, s 2 ) ≤ d Planted Motif Problem: Given N input sequences each of length L, find M, the set of l-mers, which have at-least one d-neighbor in each of the N input sequences.

(5,1) Planted Motif Problem CCGATCAAACTGGCTTATATGGCTATGTCAGTC TACCAATCCATTTCAGTGTACTCCGACATCGGA ACCAGATTCGATCAGCAGTGTACCAATGAGTAC N L M GCAAT CCGATCAAACTGGCTTATATGGCTATGTCAGTC TACCAATCCATTTCAGTGTACTCCGACATCGGA ACCAGATTCGATCAGCAGTGTACCAATGAGTAC 1-neighbors

Observation CCGATCAAACTGGCTTATATGGCTATGTCAGTCS0S0 N0N0 {CCGAT, … CGATC, ………, GCTAT, …, GCAAT, …………….., AAGTC} d-neighbors TACCAATCCATTTCAGTGTACTCCGACATCGGA S1S1 N1N1 {…………………, GCAAT, …………………………………………………………}

Bit Array For (5, 1) problem bits AAAAA AAAAC AAAAG AAAAT A - 00 C - 01 G - 10 T AAAGC TTTTG TTTTT

The BitBased approach Two Phases Setting Bits: Generate d-neighbor set for each input sequence. Finding Motifs: Find l-mers that are common in all the d-neighbor sets.

Setting Bits CCGATCAAACTGGCTTATATGGCTATGTCAGTC CCGAT ACGAT GCGAT …….… CCGAG ….…… l4l 1-neighbors SiSi A - 00 C - 01 G - 10 T - 11 BiBi

Finding Motifs.... 4l4l B0B0 B1B1 B N B & & & Motif

Memory Requirement Space Complexity: InstanceBit array size (13,4)8MB (15,5)128MB (17,6)2GB (19,7)32GB

Iterative Approach.... CCGATCAAACTGGCTTATATGGCTATGTCAGTC CCGAT ACGAT GCGAT …….… CCGAG ….…… neighbors SiSi

Parallelizing BitBased on GPU Option 1: Distribute the input arrays among the thread blocks. Setting bits: Input sequences  constant/texture memory Bit Arrays  global memory Each thread  reads an l-mer from input sequence, enumerates, sets bits in the bit array. Finding motifs: All the threads together perform logical AND operation and write the result back to global memory.

main() { // host memory allocation cudaMalloc((void **) &inArray_d, size); cudaMemcpy(inArray_d, inArray_h, size, cudaMemcpyHostToDevice); cudaMalloc((void **) &bigArray_d, sizeofBitArray); cudaMemset(bigArray_d, 0, sizeofBitArray); numThreads = 128; numBlocks = N; //number of sequences setBits >>(inArray_d, bitArray_d,); cudaThreadSynchronize(); numThreads = 128; numBlocks = sizeofBitArray / numThreads; findMotifs >>(bitArray_d, resultArray_d, resultArraySize); cudaMemcpy(resultArray_h, resultArray_d, resultArraySize, cudaMemcpyDeviceToHost); //free cuda memory and host memory }

__global__ void setBits(char *inArray, int *bitArray) { int seqNum = blockIdx.x; int lmerIdx = threadIdx.x; while (lmerIdx < numoflmersInSequence) { char lmer[l]; //copy lmer from input array for (i=0; i<l; i++) lmer[i] = inArray[seqNum * seqLength + lmerIdx + i]; neighborIdx = 0; while ( (neighborIdx = getNextdNeighbot(lmer, neighborIdx)) != -1) { setBit(bitArray, seqNum, neighborIdx); } lmerIdx += blockDim.x; }

__global__ void findMotifs(int *bitArray, int *resultArray, int *resultArraySize) { int idx = blockIdx.x * blockDim.x + threadIdx.x; int temp = 0; for (i=0; i<N; i++) { temp &= bitArray[i * sizeofBitArray + idx]; } if (temp != 0) { resultIdx = atomicAdd(resultArraySize, 1); resultArray[resultIdx] = temp; } Disadvantage of using option 1?

Parallelizing BitBased on GPU Option 2: Partition the bit arrays into chunks that fit in shared memory Distribute the chunks among the thread blocks Setting bits and finding motifs in the same kernel. Input sequences  constant/texture memory Bit Arrays  shared memory Each thread  reads an l-mer from input sequence, enumerates, sets bits in the chunk of bit array assigned to the block. Finding motifs: All the threads together perform logical AND operation of the chunk of bit arrays assigned to the block and write the result back to global memory l4l B0B0 B1B1 B n ls

1.) Select Compute Capability (click):1.3 2.) Enter your resource usage: Threads Per Block128 Registers Per Thread30 Shared Memory Per Block (bytes) ) GPU Occupancy Data is displayed here and in the graphs: Active Threads per Multiprocessor512 Active Warps per Multiprocessor16 Active Thread Blocks per Multiprocessor4 Occupancy of each Multiprocessor50% Physical Limits for GPU:1.3 Threads / Warp32 Warps / Multiprocessor32 Threads / Multiprocessor1024 Thread Blocks / Multiprocessor8 Total # of 32-bit registers / Multiprocessor16384 Register allocation unit size512 Shared Memory / Multiprocessor (bytes)16384 Warp allocation granularity (for register allocation)2 Allocation Per Thread Block Warps4 Registers4096 Shared Memory4096 Maximum Thread Blocks Per MultiprocessorBlocks Limited by Max Warps / Multiprocessor8 Limited by Registers / Multiprocessor4 Limited by Shared Memory / Multiprocessor4 Thread Block Limit Per Multiprocessor highlightedRED Occupancy Calculator % %

.... 4l4l B0B0 B1B1 B n ls N = 8l = 15d = 5 Chunk size:4096Bytes = 4096 * 8 bits 4 ls * N = 4096 * 8  l s = 6

Optimization Bit representation of input to reduce registry usage. Repartitioning and reordering to avoid bank conflicts.

Bit Representation of Input Sequences Character representation: l-mer  l bytes ex: GCATGGATCC  10 bytes Bit representation: l-mer  4 bytes (integer size) ex: GCATGGATCC   4 bytes

Shared memory banks Successive 32-bit words are assigned to successive banks. Bank 0Bank 1Bank 15Bank 0Bank 1Bank 15Bank 0Bank 1Bank 15 Multiple threads accessing same bank results in bank conflicts leading to serialized access. Bank 0Bank 1Bank 15Bank 0Bank 1Bank 15Bank 0Bank 1Bank 15

Legend Repartitioning Bit/Integer array is partitioned into 16 chunks. Example: Bit array B i of size 128 integers Bank 0 Bank 7 Bank 8 Bank 15 Thread i in a half warp only accesses the integers in chunk i

Reordering Bank 0 Bank 1 Bank 2 Bank Legend If a thread wants to write into integer I, it calculates the index of integer I in the reordered array and then writes into I. Ex: Index of 16 in the reordered array is 2. Reorder the integer array such that thread i in a half warp only accesses bank i

Multiple devices Implemented with openmp.... 4l4l B0B0 B1B1 B n Device 0Device 1Device 2Device 3

Results Implemented on Nvidia Tesla C1060 and Tesla S1070. C1060 has single GPU device running at 1.3 GHz while S1070 has four devices. Implemented on random data. #GPU devices(15,5)(17,6)(19,7)(21,8) 18s91.2s19.7m4.5h 24.4s46.1s9.9m2.3h 33.2s31.1s6.62m1.5h 42.7m23.9s5m1.1h Results on GPU Algorithm(13,4)(15,5)(17,6)(19,7)(21,8) BitBased-162s11s2.4m30.6m6.9h BitBased-82s16s3.5m42.3m- BitBased-44s29s6.5m1.3h- BitBased-19s1.8m20.6m4.7h- PMSPrune53s9m69m9.2h- Results on Multicore (from previous work)

#GPU devices(15,5)(17,6)(19,7)(21,8) Comparison with CPU Speedup compared to single core CPU(2.67 GHz) #GPU devices(15,5)(17,6)(19,7)(21,8) Speedup compared to 16 cores CPU(2.67 GHz)

Questions?

Thank you