Download presentation
Presentation is loading. Please wait.
1
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu
2
Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic for CUDA Multiple-Precision Arithmetics Implementation on GPUs Data Structure Optimization of Data on CUDA Example Experimental Result
3
Multiple-Precision Integer 32bit & 64bit System Multiple-Precision Integer GPU Computing & CUDA GPGPU CUDA
4
10 Based Integer Big Integer in System b is 2^32
5
Computing Capability Memory Bandwidth
6
L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread IssuePixel Thread Issue Input Assembler Host SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB Streaming Multiprocessor (SM) Streaming Processor (SP)
7
CUDA: CPU + GPU C Parallel Computing modal Single instruction Multiple Thread (SIMT) All threads run the same function(1000s threads on the fly) Each core deal with different data Hidden the IO by multiple-threads(more than 1000s threads) Speed up Computing / IO Translation Coalesce the IO one time When half warp thread access neighboring data 1 cycle@GPU vs. ~1000 cycles@CPU
8
Background & Related Work. Multiple-Precision Arithmetic for CUDA Multiple-Precision Arithmetics Implementation on GPUs Data Structure Optimization of Data on CUDA Example Experimental Result
9
1. Multiple-precision Comparison 2. Multiple-precision Addition 3. Multiple-precision Subtraction 4. Multiple-precision Modular Addition 5. Multiple-precision Modular Subtraction
10
6. Multiple-precision Multiplication 7. Multiple-precision Division 8. Multiple-precision Montgomery Reduction 9. Multiple-precision Montgomery Multiplication 10.Barrett Modular Reduction Algorithm
11
11. Multiple-precision Multiplicative Inversion 12. Multiple-precision Montgomery Exponentiation 13. Montgomery Multi- Exponentiation 14. Multiple-precision Modular Addition …
12
Background & Related Work. Multiple-Precision Arithmetic for CUDA. Implementation on GPUs Data Structure Optimization of Data on CUDA Example Experimental Result
13
Two types of Data Structure Data Structure Using Cache memory with Constant Constant Value Using Shared memory for temp value Temp value Balance the threads and memory Balance Resource Data encoding Example
14
C = vectorA * Matrix B % prime
15
There is no cache for global memory on G80/G200 Constant memory & texture memory have little cache IO latency 400-600 clock cycles This is the bottle neck Key to Optimization!
19
Global memory access by threads in a half-warp can be coalesced When the words accessed by all threads lie in the same segment of size equal to: 32 bytes if all threads access 8-bit words 64 bytes if all threads access 16-bit words 128 bytes if all threads access 32-bit or 64-bit words Any pattern of addresses requested by the half- warp Including patterns where multiple threads access the same address
20
Address 0 Thread 0 Address 4 Address … Address 116 Address 120 Address 124 Address 128 Address … Address 172 Address 176 Address 180 Address 184 Address 188 Address 252 Thread 1 Thread 2 Thread 3 Thread … Thread 14 Thread 15 … Segment 0 (128B)Segment 1 (128B) Reduced to 32B Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, 128 bytes for 32-, 64- and 128-bit data.
21
C = vectorA * Matrix B % prime
22
Background & Related Work. Multiple-Precision Arithmetic for CUDA. Implementation on GPUs. Experimental Result
24
CPU: Intel® Core™ i7 CPU 860 @ 2.80 GHz (single thread) GPU: XFX GTX280, 1.24 GHz
25
C = vectorA * Matrix B % prime
26
CPU: Intel® Core™ i7 CPU 860 @ 2.80 GHz (single thread) GPU: XFX GTX280, 1.24 GHz
27
Multiple-Precision 1 Arithmetic 2 GPU Computing & Optimization 3 Example & result 4 Summary
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.