IMPLEMENTATION OF MULTIPLE-PRECISION MODULAR MULTIPLICATION ON GPU

Slides:

Advertisements

Similar presentations

Optimization on Kepler Zehuan Wang

Advertisements

Computer Organization. This module surveys the physical resources of a computer system. –Basic components CPUMemoryBus I/O devices –CPU structure Registers.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

An Expandable Montgomery Modular Multiplication Processor Adnan Abdul-Aziz GutubAlaaeldin A. M. Amin Computer Engineering Department King Fahd University.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.

Montgomery multiplication Algorithm Mohammad Farmani Under supervision of : Dr. S. Bayat-sarmadi 2 nd. Semister, Sharif University of Technology.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Organization  Introduction to Network Coding  Practical Network Coding  Secure Network Coding  Structured File Sharing  Conclusion.

Computer Organization Computer Organization & Assembly Language: Module 2.

MATH 224 – Discrete Mathematics

Extracted directly from:

CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

GPU Architecture and Programming

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CS 1308 – Computer Literacy and the Internet Building the CPU.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

VISUAL C++ PROGRAMMING: CONCEPTS AND PROJECTS Chapter 2A Reading, Processing and Displaying Data (Concepts)

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Dan Boneh Intro. Number Theory Arithmetic algorithms Online Cryptography Course Dan Boneh.

Sunpyo Hong, Hyesoon Kim

Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.

My Coordinates Office EM G.27 contact time:

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang Nov. 5, 2010.

1 Chapter 1: Introduction Appendix A: Binary and Hexadecimal Tutorial Assembly Language for Intel-Based Computers, 3rd edition Kip R. Irvine.

Lecture 10 CUDA Instructions

CS/EE 217 – GPU Architecture and Parallel Programming

Computers’ Basic Organization

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Generalized and Hybrid Fast-ICA Implementation using GPU

Invitation to Computer Science, C++ Version, Fourth Edition

Amit Amritkar & Danesh Tafti Eric de Sturler & Kasia Swirydowicz

Ioannis E. Venetis Department of Computer Engineering and Informatics

Cryptography and Information Security

Instructor: David Ferry

Lecture 2: Intro to the simd lifestyle and GPU internals

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Faster File matching using GPGPU’s Deephan Mohan Professor: Dr

CS/EE 217 – GPU Architecture and Parallel Programming

S. Bagherian H. Modaem Zade A. Armin A. Mehalian H. Nosoohian

Parallel Computation Patterns (Scan)

All-Pairs Shortest Paths

On-line arithmetic for detection in digital communication receivers

Geography of civilizations

A Comparison-FREE SORTING ALGORITHM ON CPUs

ECE 498AL Lecture 15: Reductions and Their Implementation

Introduction to CUDA.

Mathematical Background 2

Division and Modulo 15 Q A = Dividend B = Divisor Q = Quotient = A/B

Operations and Arithmetic

Course Outline for Computer Architecture

Cryptography Lecture 16.

Cryptography Lecture 18.

6- General Purpose GPU Programming

On-line arithmetic for detection in digital communication receivers

RSA Cryptosystem 電機四 B 游志強 2019/8/25.

Presentation transcript:

IMPLEMENTATION OF MULTIPLE-PRECISION MODULAR MULTIPLICATION ON GPU Presented by ZHAO Kaiyong Supervisor: Dr. CHU XiaoWen

Department of Computer Science, HKBU OUTLINE 1.Background 2.Implementation Modular Multiplications on GPU 3.Improving the Montgomery Modular Multiplication on GPU 4.Summary 5.Q&A Insert a map of your country. 11/9/2018 Department of Computer Science, HKBU

Homomorphic hash function 1.Background Network coding Originally proposed to improve throughput Information is coded at potentially every node. A field of information theory and coding theory for attaining maximum information flow in a network Pollution attack A malicious node sends bogus data packets to others The effect is far more serious with network coding The bogus packet is mixed into other packets and propagates to the whole network. Homomorphic hash function The hash of an encoded packet should be easily derived from the hashes of the original packets and the encoding coefficient vector. Assume the original blocks are bi, i = 1, …, n The encoded block is e = c1b1 + … +cnbn  The coefficient vector is (c1, c2, …, cn) The homomorphic hash function h(·)  h(e) = hc1(b1)hc2(b2)…hcn(bn) Insert a picture of one of the geographic features of your country. 11/9/2018 Department of Computer Science, HKBU

1.Background (why?) Multiple-precision Modular Multiplication Modular reduction Multiple-precision Modular Exponentiation Multiple-precision Multiple-precision Add Multiple-precision Sub Multiple-precision Multiplication Multiple-precision Division Multiple-precision Modular … Security issues RSA Homomorphic hash function Insert a picture of one of the geographic features of your country. 11/9/2018 Department of Computer Science, HKBU

1.Background (Karatsuba multiplication) X-> hi.x1 lo.x0 hi.y1 lo.y0 Y-> x1*y1 x0*y0 (x1-x0)*(y1-y0) add sub Karatsuba Multiplication O(N^1.585)[1] Base Case Multiplication O(N^2) hi.x1 lo.x0 hi.y1 lo.y0 X0*y0 X1*y0 X1*y1 X0*y1 Insert a picture illustrating a season in your country. [1] A. Karatsuba and Yu. Ofman (1962). "Multiplication of Many-Digital Numbers by Automatic Computers". Proceedings of the USSR Academy of Sciences 145: 293–294. 11/9/2018 Department of Computer Science, HKBU

1.Background (Montgomery multiplication) Algorithm 1 Multiple-precision Montgomery Reduction INPUT: integer m with n radix b digits and gcd(m, b) = 1, R = bn , m’=-m-1 mod b, and integer A with 2n radix b digits and A<m •R. OUTPUT: T = A•R-1 mod m. 1: T<-A ; 2: for ( i from 0 to n-1 ) 3: ui <-Ti*m’ mod b; 4: T <- T +ui *m*bi ; 5: end for 6: T <- T/bn ; 7: if ( T >= m) then T <- T - m; 8: return T; Algorithm 2 Multiple-precision Montgomery Multiplication INPUT: non-negative integer m, x, y with n radix b digits, x <m, y<m, and gcd(m, b) = 1, R=bn, m’= - m-1 mod b. OUTPUT: T = x*y*R-1 mod m. 1: T <- 0; 2: for ( i from 0 to n-1) 3: ui <- (T0 +xi*y0)*m’ mod b; 4: T <- (T +xi*y + ui*m)/b; 5: end for 6: if ( T>=m) then T <-T-m; 7: return T; Insert a picture of an animal and/or plant found in your country. [2] Montgomery, P., 1985. Multiplication without trial division, Math. Computation, vol. 44, 1985, 519-521. 11/9/2018 Department of Computer Science, HKBU

1.Background (GPU computing & CUDA) GPU/CPU architecture Add key points in the history of your country to the timeline. 11/9/2018 Department of Computer Science, HKBU

1.Background (GPU computing & CUDA) GPU powerful computing Add key points in the history of your country to the timeline. Computing Capability Memory Bandwidth 11/9/2018 Department of Computer Science, HKBU

1.Background (GPU computing & CUDA) Add key points in the history of your country to the timeline. 11/9/2018 Department of Computer Science, HKBU

1.Background (GPU computing & CUDA) CPU + GPU CUDA: CPU + GPU C Program CPU: Flying serial GPU = Parallel processing Large Data Parallel Launching Large Thin Threads CPU Serial Code . . . kernel 0 GPU Parallel Code Concurrent execution! CPU Serial Code . . . kernel 1 GPU Parallel Code

2.Implementation Modular Multiplications on GPU 1.Multiple-precision comparison 2.Multiple-precision subtraction 3.Multiple-precision modular addition 4.Multiple-precision modular subtraction 5.Multiple-precision multiplication 6.Multiple-precision division 7.Multiple-precision multiplicative inversion 8.Multiple-precision modular exponentiation … Design and Implementation of Multiple-Precision Modular Arithmetic Library for CUDA Insert a picture illustrating a custom or tradition. 11/9/2018 Department of Computer Science, HKBU

2.Implementation Modular Multiplications on GPU 1.CIOS Montgomery Modular Multiplication 2.Karatsuba Montgomery Modular Multiplication Modular Exponentiation always exchange to Modular Multiplication We will present the implementation detail in the two Montgomery Modular Multiplication Insert a picture illustrating a custom or tradition. 11/9/2018 Department of Computer Science, HKBU

2.Implementation Modular Multiplications on GPU Algorithm 3 Multiple-precision Montgomery multiplication INPUT: integer m with n radix b digits and gcd(m, b) = 1, , positive integer x and y with n radix b digits and . OUTPUT: x*y*R-1 mod m. for (i from 0 up to s-1) C: = 0 for ( j from 0 up to s-1) (C,S) := t[j] + a[j]*b[i] + C t[j] := S end for (C,S) := t[s] + C t[s] := S t[s+1] := C C := 0 m := t[0]*n'[0] mod W for (j from 0 up to s-1) (C,S) := t[j] + m*n[j] + C t[s+1] := t[s+1] + C for (j from 0 up to s) t[j] := t[j+1] CIOS (Coarsely Integrated Operand Scanning) Montgomery Modular Multiplication Insert a picture illustrating a custom or tradition. 11/9/2018 Department of Computer Science, HKBU

2.Implementation Modular Multiplications on GPU Algorithm 4 Multiple-precision Karatsuba and Montgomery Multiplication INPUT: integer m with n radix b digits and gcd(m, b) = 1, , positive integer x and y with n radix b digits and . OUTPUT: x*y*R-1 mod m. Karatsuba(x,y) for (i from 0 up to s-1) C := 0 m := t[i]*n'[0] mod W for (j from 0 up to s-1) (C,S) := t[i+j] + m*n[j] + C t[i+j] := S end for ADD (t[i+s],C) for (j from 0 up to s) u[j] := t[j+s] B := 0 for (i from 0 up to s-1) (B,D) := u[i] - n[i] - B t[i] := D (B,D) := u[s] - B t[s] := D if B=0 then return t[0], t[1], ... , t[s-1] else return u[0], u[1], ... , u[s-1] Karatsuba Montgomery Modular Multiplication: In this method, we choose the Karatsuba multiplication to implement the multiplication, and then perform Montgomery reduction. Insert a picture illustrating a custom or tradition. 11/9/2018 Department of Computer Science, HKBU

2.Implementation Modular Multiplications on GPU CPU CPU(Intel(R) Core(TM)2 Quad CPU Q6600 @2.40GHz GPU GTX 295 240 cores 1.24GHz Integer parameters Integer: 1024bits x 1024bits Module 1024bits Using 32bit integer as the base Insert a picture illustrating a custom or tradition. 11/9/2018 Department of Computer Science, HKBU

2.Implementation Modular Multiplications on GPU Comparing Karatsuba Method and CIOS Method K-MM: 60 registers, 5132 local memories. CIOS : 14 register, no local memory at all. Insert a picture illustrating a custom or tradition. 11/9/2018 Department of Computer Science, HKBU

3.Improving the Montgomery Modular Multiplication on GPU Algorithm 5 32bit integer multiplication INPUT: 32bit integer A multiplicative with 32bit integer B. OUTPUT: A*B. static inline __device__ unsigned __int64 mul_32x32(unsigned A, unsigned B) { unsigned __int64 out; asm("mul.wide.u32 %0, %1, %2;" : "=l"(out) : "r"(A), "r"(B)); return out; } ASM of Integer Multiplication MULT64X64LO need more than 20 instructions MULT32X32WIDE only need 10 instructions. Insert a picture illustrating a custom or tradition. 11/9/2018 Department of Computer Science, HKBU

3.Improving the Montgomery Modular Multiplication on GPU 20% faster The inside ASM function used to solve the 32bit multiplicative 32bit integer. In the decuda code we can see that each loop the CIOS-ASM method is 11 instructions less than the CIOS method. Insert a picture illustrating a custom or tradition. 11/9/2018 Department of Computer Science, HKBU

3.Improving the Montgomery Modular Multiplication on GPU GPU VS CPU (GPU 20 times faster than CPU) Total instructions: CPU: 14s^2+16s+5= 14850 GPU: 10~15times more than CPU & memory latency times = 1/40~1/60 CPU:2.4GHz GPU:1.24GHz times = 1/2*1/40~1/60 = 1/80~1/120 CPU:4 cores GPU:240 cores times = 240*4/4 = 240 2~3 Almost 2-3 times faster than the 4 core CPU Insert a picture illustrating a custom or tradition. 11/9/2018 Department of Computer Science, HKBU

Department of Computer Science, HKBU 4.Summary Due to Security issues Hash function is based on multiple-precision GPU is good at parallel computing Implementation multiple-precision for CUDA Improve the Montgomery Modular Multiplication Department of Computer Science, HKBU

Department of Computer Science, HKBU 5. Q&A Q&A Thanks! Department of Computer Science, HKBU