Download presentation
Presentation is loading. Please wait.
Published byHomer McDaniel Modified over 6 years ago
1
KyoungSoo Park Department of Electrical Engineering KAIST
High-performance Cryptographic Operations on General-Purpose Processors KyoungSoo Park Department of Electrical Engineering KAIST
2
Internet Traffic Is Still Rapidly Increasing
Cisco VNI Forecasts 194 EB per Month of IP Traffic by 2020* *The Zettabyte Era-Trends and Analysis (Cisco Visual Networking Index (VNI)) *
3
Encrypted Traffic Is Growing Fast
Our suggestion is to use GPU for offload SSL operations in a cost effective way. 50+% of transferred bytes in the Internet 60+% for mobile/cellular traffic
4
Growing Number of SSL/TLS Sites
1000x increase in valid SSL/TLS certificates for past 18 years Most popular sites support SSL (Google, Facebook, Amazon, etc.) * Source:
5
Practical Cryptography
Goal: “high-quality” traffic encryption at a small cost Secure protocols and implementations Secure key management Performance improvement with modern processors Multicore CPUs and manycore GPUs Cost-effective crypto performance acceleration Exploiting parallelism is the key to high performance SSL/TLS and IPsec Most widely used in VPN and secure Web transactions SSLShader [NSDI’11] and APUNet
6
Secure Sockets Layer (SSL)
A de-facto standard for secure communication Authentication, Confidentiality, Content integrity Client Server TCP handshake Key exchange using public key algorithm (e.g., RSA) Server identification Encrypted data
7
SSL Computation Overhead
Performance overhead (HTTPS vs. HTTP) Connection setup Data transfer Good privacy is expensive More servers H/W SSL accelerators Our suggestion: Use GPU to offload SSL operations 22x 50x Our suggestion is to use GPU for offload SSL operations in a cost effective way.
8
What Is GPU? I assume most of you are not familiar with GPU,
so let me introduce GPU briefly.
9
GPU = Graphics Processing Unit
The main chip of graphics cards Mainly used for real-time 3D game rendering Massively-parallel processing capacity (Battlefield Bad Company 2,
10
Small # of super-fast cores
CPU vs. GPU CPU: Small # of super-fast cores GPU: Large # of wimpy cores
11
How GPU Differs From CPU?
ALU Control ALU Cache Intel Xeon E5-2697v3 CPU: 14 cores NVIDIA Titan X: cores Core Cache Instructions / sec < 0.6×1012 11×1012
12
Single Instruction Multiple Threads (SIMT)
Example code: vector addition (C = A + B) CPU code GPU code void VecAdd( int *A, int *B, int *C, int N) { //iterate over N elements for(int i = 0; i < N; i++) C[i] = A[i] + B[i] } VecAdd(A, B, C, N); __global__ void VecAdd( int *A, int *B, int *C) { int i = threadIdx.x; C[i] = A[i] + B[i] } //Launch N threads VecAdd<<<1, N>>>(A, B, C);
13
Benefits of GPU over CPU
Higher raw computation power 0.6 x 1012 instrs/s (CPU) vs. 11 x 1012 instrs/s (GPU) Hide memory access latency Hardware-based context switch Higher memory bandwidth 68 GB/s (CPU) vs GB/s (GPU) BIG Caveat: 1, 2, 3 are realizable only through parallel execution of independent threads (*Intel E5-2697v3 CPU vs. NVIDIA TitanX GPU) .
14
SSLShader SSL-accelerator leveraging GPU SSL reverse proxy
High-performance Cost-effective SSL reverse proxy No modification on existing servers Cloud SSLShader Web Server SMTP Server POP3 Server Plain TCP SSL-encrypted session
15
Parallelism in SSL Processing
Client 1 SSLShader Client 2 1. Independent Sessions Client N 2. Independent SSL Record SSL Record SSL Record SSL Record 3. Parallelism in Cryptographic Operations
16
Our GPU Implementation
Choices of cipher-suite Optimization of GPU algorithms Exploiting massive parallel processing Parallelization of algorithms Batch processing Data copy overhead is significant Concurrent copy and execution Client Server Key exchange: RSA Encryption: AES Message Authentication: SHA1
17
C = Me mod n M = Cd mod n Basic RSA Operations
M: plain-text, C: cipher-text (e, n): public key, (d, n): private key Encryption: C = Me mod n Decryption: M = Cd mod n Small number: 3, 17, 65537 1024/2048 bits integer (300 ~ 600 digits) Exponentiation many multiplications Decryption at the server side is the bottleneck
18
Breakdown of Large Integer Multiplication
Schoolbook multiplication 649 X 63 280 4200 180 800 12000 5400 32000 406923 Accumulation is difficult to parallelize due to “overlapping digits” “carry propagation” 3 x 3 = 9 multiplications 8 addition of 6-digits integers
19
O(s) Parallel Multiplications
Example of 649 x 627 = 406,923 s = # of words in a large integer 2s steps 1 or 2 steps (s – 1 worst case)
20
More Optimizations on RSA
Common optimizations for RSA Chinese Remainder Theorem (CRT) Montgomery Multiplication Constant Length Non-zero Window (CLNW) Parallelization of serial algorithms Faster Calculation of M×n Interleaving of T + M×n Mixed-Radix Conversion Offloading GPU specific optimizations Warp Utilization Loop Unrolling Elimination of Divergence Avoiding Bank Conflicts Instruction-Level Optimization Read our NSDI’11 paper for details
21
Parallelism in SSL Processing
Client 1 SSLShader Client 2 1. Independent Sessions Client N Batch Processing 2. Independent SSL Record SSL Record SSL Record SSL Record 3. Parallelism in Cryptographic Operations
22
GTX580 Throughput w/o Batching
Intel Nehalem single core (2.66Ghz)
23
GTX580 Throughput w/ Batching
Batch size: 32~4096 depending on the algorithm Difference: ratio of computation to copy
24
Copy Overhead in GPU Cryptography
GPU processing works by Data copy: CPU memory GPU memory (PCIe) Execution in GPU Data copy: GPU memory -> CPU memory (PCIe) 4x w/o copy w/ copy 3.3x 2.4x AES-ENC (Gbps) AES-DEC HMAC-SHA1 GTX580 w/ copy 8.8 10 31 GTX580 no copy 21.8 33 124
25
Hiding Copy Overhead t Synchronous Execution Pipelining
Data copy: CPU -> GPU Execution in GPU Data copy: GPU -> CPU Processing time : 3t Pipelining … Data copy: CPU -> GPU Execution in GPU … … Data copy: GPU -> CPU Amortized processing time : t
26
GTX580 Performance w/ Pipelining
w/o copy synchronous pipelining 51% 36% 36% 14x 9x 9x
27
Summary of GPU Cryptography
Performance gain from GTX580 GPU performs as fast as 9 ~ 28 CPU cores Comparable with high-end hardware accelerators Lessons Batch processing is essential to fully utilize a GPU AES and SHA1 are bottlenecked by data copy RSA-1024 (ops/sec) AES-ENC (Gbps) AES-DEC SHA1 GTX580 91.9K 11.5 12.5 47.1 CPU core 3.3K 1.3 3.3
28
RSA Performance with Recent Processors
2.2x Or 31x CPU cores 2.1x Or 29x CPU cores 1.6 $2,095 $1,200
29
HMAC-SHA1 Performance with Recent Processors
2.0x
30
AES Performance with Recent Processors
CPU outperforms GPUs due to AES-NI Instructions.
31
Building SSL proxy that leverages GPU
32
Batching Crypto Operations
Network workloads vary over time Waiting for fixed batch size doesn’t work SSL Stack Input queue Output queue CPU GPU GPU Batch size is dynamically adjusted to queue length
33
Balancing Load Between CPU and GPU
For small batch, CPU is faster than GPU Opportunistic offloading CPU Input queue Output queue GPU queue Input queue length > threshold GPU CPU processing GPU processing when input queue length > threshold Cryptographic operation Minimum Maximum RSA (1024-bit) 16 512 AES Decryption 32 2048 AES Encryption 128 HMAC-SHA1
34
Scaling with Multiple Cores
CPU Core1 CPU Core2 CPU Input queues Output queues GPU queue CPU GPU GPU Per-core worker threads Network session handling, cryptographic operation Sharing a GPU with multiple cores More parallelism with larger batch size
35
HTTPS Connection Rate (SSL handshake)
Setup PC: 1 x E5-2697v3 CPU + 1 GPU Test: SSLShader + Plaintext lighttpd vs. OpenSSL lighttpd HTTP object: 1-byte object fetching Ciphersuite:RSA-2048, AES-128, HMAC-SHA1 6.2x
36
Latency and Throughput Experiments
Measurement done 5 years ago Performance trend is very similar to recent processors Setup Ciphersuite: RSA 1024, AES-128, HMAC-SHA1 CPU: 2 x X5650 (12 cores) GPU: 2 x GTX580 10G NIC Comparison of latency and throughput SSLShader proxy + plaintext lighttpd vs. OpenSSL-based lighttpd
37
Similar latency at light load
Lighttpd at 1k connections / sec SSLShader at 1k connections / sec Similar latency at light load
38
Lower latency and higher throughput at heavy load
Latency at Heavy Load SSLShader at 29k connections / sec Lighttpd at 11k connections / sec Lower latency and higher throughput at heavy load
39
Data Transfer Performance
2.1x SSLShader: 13 Gbps Lighttpd performance 0.87x Typical web content size is under 100KB
40
APUNet: Exploiting APU for Accelerating Network Applications
Accelerated Processor Unit (APU) CPU and GPU on one physical package Strength: no PCI-e overhead, low power consumption, inexpensive Weakness: shared memory bandwidth for CPU and GPU 4 CPU cores 512 GPU cores
41
IPsec Gateway Performance
AES-128 in CBC mode with HMAC-SHA1 PC: AMD Carrizo CPU + 40G NIC 2.2x Performance Improvement over CPU-only approach 16.4 Gbps IPsec Gateway Performance for $1,000 machine
42
Conclusions GPU Cryptographic algorithms in GPU SSLShader [NSDI’11]
Beneficial to compute-intensive network applications Cryptographic algorithms in GPU The fastest RSA implementation (29K ops/sec RSA-2048) Comparable to ~30 CPU cores Comparable to high-end hardware accelerators SSLShader [NSDI’11] Transparent integration Effective utilization of GPU for SSL processing Up to 6x connections / sec 13 Gbps throughput Linux network stack performance Can use mTCP user-level stack Copy overhead
43
More Information SSLShader Questions?
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.