Download presentation
Presentation is loading. Please wait.
Published byToby Bailey Modified over 9 years ago
1
Size Matters : Space/Time Tradeoffs to Improve GPGPU Application Performance Abdullah Gharaibeh Matei Ripeanu NetSysLab The University of British Columbia
2
2 GPUs offer different characteristics High peak compute power High communication overhead High peak memory bandwidth Limited memory space Implication: careful tradeoff analysis is needed when porting applications to GPU-based platforms
3
3 Motivating Question: How should we design applications to efficiently exploit GPU characteristics? Context: A bioinformatics problem: Sequence Alignment A string matching problem Data intensive (10 2 GB)
4
4 Past work: sequence alignment on GPUs MUMmerGPU [Schatz 07, Trapnell 09]: A GPU port of the sequence alignment tool MUMmer [Kurtz 04] ~4x (end-to-end) compared to CPU version Hypothesis : mismatch between the core data structure ( suffix tree ) and GPU characteristics > 50% overhead (%)
5
5 Use a space efficient data structure (though, from higher computational complexity class): suffix array 4x speedup compared to suffix tree-based on GPU Idea: trade-off time for space Consequences: Opportunity to exploit multi-GPU systems as I/O is less of a bottleneck Focus is shifted towards optimizing the compute stage Significant overhead reduction
6
6 Outline Sequence alignment: background and offloading to GPU Space/Time trade-off analysis Evaluation
7
7 CCAT GGCT........CGCCCTA GCAATTT....... GCGG...TAGGC TGCGC......CGGCA......GGCG...GGCTA ATGCG….…TCGG... TTTGCGG…....TAGG...ATAT….…CCTA... CAATT…...CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG.. Background: sequence alignment problem Find where each query most likely originated from Queries 10 8 queries 10 1 to 10 2 symbols length per query Reference 10 6 to 10 11 symbols length Queries Reference
8
8 GPU Offloading: opportunity and challenges Sequence alignment Easy to partition Memory intensive GPU Massively parallel High memory bandwidth Opportunity Data Intensive Large output size Limited memory space No direct access to other I/O devices (e.g., disk) Challenges
9
9 GPU Offloading: addressing the challenges subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Data intensive problem and limited memory space →divide and compute in rounds Large output size →compressed output representation (decompress on the CPU) High-level algorithm (executed on the host)
10
10 Space/Time Trade-off Analysis
11
11 The core data structure massive number of queries and long reference => pre- process reference to an index Past work: build a suffix tree (MUMmerGPU [Schatz 07, 09]) Search: O(qry_len) per query Space: O(ref_len), but the constant is high: ~20xref_len Post-processing: O(4 qry_len - min_match_len ), DFS traversal per query
12
12 The core data structure massive number of queries and long reference => pre- process reference to an index Past work: build a suffix tree (MUMmerGPU [Schatz 07]) Search: O(qry_len) per query Space: O(ref_len), but the constant is high: ~20xref_len Post-processing: O(4 qry_len - min_match_len ), DFS traversal per query subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Expensive Efficient
13
13 A better matching data structure Suffix Tree 0A$ 1ACA$ 2ACACA$ 3CA$ 4CACA$ 5TACACA$ Suffix Array SpaceO(ref_len), 20 x ref_lenO(ref_len), 4 x ref_len SearchO(qry_len)O(qry_len x log ref_len) Post-processO(4 qry_len - min_match_len )O(qry_len – min_match_len) Impact 1: reduced communication Less data to transfer
14
14 A better matching data structure Suffix Tree 0A$ 1ACA$ 2ACACA$ 3CA$ 4CACA$ 5TACACA$ Suffix Array SpaceO(ref_len), 20 x ref_lenO(ref_len), 4 x ref_len SearchO(qry_len)O(qry_len x log ref_len) Post-processO(4 qry_len - min_match_len )O(qry_len – min_match_len) Impact 2: better data locality is achieved at the cost of additional per-thread processing time Space for longer sub- references => fewer processing rounds
15
15 A better matching data structure Suffix Tree 0A$ 1ACA$ 2ACACA$ 3CA$ 4CACA$ 5TACACA$ Suffix Array SpaceO(ref_len), 20 x ref_lenO(ref_len), 4 x ref_len SearchO(qry_len)O(qry_len x log ref_len) Post-processO(4 qry_len - min_match_len )O(qry_len – min_match_len) Impact 3: lower post-processing overhead
16
16 Evaluation
17
17 Evaluation setup Workload / Species Reference sequence length # of queries Average read length HS1 - Human (chromosome 2) ~238M~78M~200 HS2 - Human (chromosome 3) ~100M~2M~700 MONO - L. monocytogenes~3M~6M~120 SUIS - S. suis~2M~26M~36 Testbed Low-end Geforce 9800 GX2 GPU (512MB) High-end Tesla C1060 (4GB) Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09]) Success metrics Performance Energy consumption Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces)http://www.ncbi.nlm.nih.gov/Traces
18
18 Speedup: array-based over tree-based
19
19 Dissecting the overheads Significant reduction in data transfers and post- processing Workload: HS1, ~78M queries, ~238M ref. length on Geforce
20
20 Summary GPUs have drastically different performance characteristics Reconsidering the choice of the data structure used is necessary when porting applications to the GPU A good matching data structure ensures: Low communication overhead Data locality: might be achieved at the cost of additional per thread processing time Low post-processing overhead
21
21 Code available at: netsyslab.ece.ubc.ca
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.