KyoungSoo Park Department of Electrical Engineering KAIST

KyoungSoo Park Department of Electrical Engineering KAIST
High-performance Cryptographic Operations on General-Purpose Processors KyoungSoo Park Department of Electrical Engineering KAIST

Internet Traffic Is Still Rapidly Increasing
Cisco VNI Forecasts 194 EB per Month of IP Traffic by 2020* *The Zettabyte Era-Trends and Analysis (Cisco Visual Networking Index (VNI)) *

Encrypted Traffic Is Growing Fast
Our suggestion is to use GPU for offload SSL operations in a cost effective way. 50+% of transferred bytes in the Internet 60+% for mobile/cellular traffic

Growing Number of SSL/TLS Sites
1000x increase in valid SSL/TLS certificates for past 18 years Most popular sites support SSL (Google, Facebook, Amazon, etc.) * Source:

Practical Cryptography
Goal: “high-quality” traffic encryption at a small cost Secure protocols and implementations Secure key management Performance improvement with modern processors Multicore CPUs and manycore GPUs Cost-effective crypto performance acceleration Exploiting parallelism is the key to high performance SSL/TLS and IPsec Most widely used in VPN and secure Web transactions SSLShader [NSDI’11] and APUNet

Secure Sockets Layer (SSL)
A de-facto standard for secure communication Authentication, Confidentiality, Content integrity Client Server TCP handshake Key exchange using public key algorithm (e.g., RSA) Server identification Encrypted data

SSL Computation Overhead
Performance overhead (HTTPS vs. HTTP) Connection setup Data transfer Good privacy is expensive More servers H/W SSL accelerators Our suggestion: Use GPU to offload SSL operations 22x 50x Our suggestion is to use GPU for offload SSL operations in a cost effective way.

What Is GPU? I assume most of you are not familiar with GPU,
so let me introduce GPU briefly.

GPU = Graphics Processing Unit
The main chip of graphics cards Mainly used for real-time 3D game rendering Massively-parallel processing capacity (Battlefield Bad Company 2,

Small # of super-fast cores
CPU vs. GPU CPU: Small # of super-fast cores GPU: Large # of wimpy cores

How GPU Differs From CPU?
ALU Control ALU Cache Intel Xeon E5-2697v3 CPU: 14 cores NVIDIA Titan X: cores Core Cache Instructions / sec < 0.6×1012 11×1012

Single Instruction Multiple Threads (SIMT)
Example code: vector addition (C = A + B) CPU code GPU code void VecAdd( int *A, int *B, int *C, int N) { //iterate over N elements for(int i = 0; i < N; i++) C[i] = A[i] + B[i] } VecAdd(A, B, C, N); __global__ void VecAdd( int *A, int *B, int *C) { int i = threadIdx.x; C[i] = A[i] + B[i] } //Launch N threads VecAdd<<<1, N>>>(A, B, C);

Benefits of GPU over CPU
Higher raw computation power 0.6 x 1012 instrs/s (CPU) vs. 11 x 1012 instrs/s (GPU) Hide memory access latency Hardware-based context switch Higher memory bandwidth 68 GB/s (CPU) vs GB/s (GPU) BIG Caveat: 1, 2, 3 are realizable only through parallel execution of independent threads (*Intel E5-2697v3 CPU vs. NVIDIA TitanX GPU) .

SSLShader SSL-accelerator leveraging GPU SSL reverse proxy
High-performance Cost-effective SSL reverse proxy No modification on existing servers Cloud SSLShader Web Server SMTP Server POP3 Server Plain TCP SSL-encrypted session

Parallelism in SSL Processing
Client 1 SSLShader Client 2 1. Independent Sessions Client N 2. Independent SSL Record SSL Record SSL Record SSL Record 3. Parallelism in Cryptographic Operations

Our GPU Implementation
Choices of cipher-suite Optimization of GPU algorithms Exploiting massive parallel processing Parallelization of algorithms Batch processing Data copy overhead is significant Concurrent copy and execution Client Server Key exchange: RSA Encryption: AES Message Authentication: SHA1

C = Me mod n M = Cd mod n Basic RSA Operations
M: plain-text, C: cipher-text (e, n): public key, (d, n): private key Encryption: C = Me mod n Decryption: M = Cd mod n Small number: 3, 17, 65537 1024/2048 bits integer (300 ~ 600 digits) Exponentiation  many multiplications Decryption at the server side is the bottleneck

Breakdown of Large Integer Multiplication
Schoolbook multiplication 649 X 63 280 4200 180 800 12000 5400 32000 406923 Accumulation is difficult to parallelize due to “overlapping digits” “carry propagation” 3 x 3 = 9 multiplications 8 addition of 6-digits integers

O(s) Parallel Multiplications
Example of 649 x 627 = 406,923 s = # of words in a large integer 2s steps 1 or 2 steps (s – 1 worst case)

More Optimizations on RSA
Common optimizations for RSA Chinese Remainder Theorem (CRT) Montgomery Multiplication Constant Length Non-zero Window (CLNW) Parallelization of serial algorithms Faster Calculation of M×n Interleaving of T + M×n Mixed-Radix Conversion Offloading GPU specific optimizations Warp Utilization Loop Unrolling Elimination of Divergence Avoiding Bank Conflicts Instruction-Level Optimization Read our NSDI’11 paper for details 

Parallelism in SSL Processing
Client 1 SSLShader Client 2 1. Independent Sessions Client N Batch Processing 2. Independent SSL Record SSL Record SSL Record SSL Record 3. Parallelism in Cryptographic Operations

GTX580 Throughput w/o Batching
Intel Nehalem single core (2.66Ghz)

GTX580 Throughput w/ Batching
Batch size: 32~4096 depending on the algorithm Difference: ratio of computation to copy

Copy Overhead in GPU Cryptography
GPU processing works by Data copy: CPU memory  GPU memory (PCIe) Execution in GPU Data copy: GPU memory -> CPU memory (PCIe) 4x w/o copy w/ copy 3.3x 2.4x AES-ENC (Gbps) AES-DEC HMAC-SHA1 GTX580 w/ copy 8.8 10 31 GTX580 no copy 21.8 33 124

Hiding Copy Overhead t Synchronous Execution Pipelining
Data copy: CPU -> GPU Execution in GPU Data copy: GPU -> CPU Processing time : 3t Pipelining … Data copy: CPU -> GPU Execution in GPU … … Data copy: GPU -> CPU Amortized processing time : t

GTX580 Performance w/ Pipelining
w/o copy synchronous pipelining 51% 36% 36% 14x 9x 9x

Summary of GPU Cryptography
Performance gain from GTX580 GPU performs as fast as 9 ~ 28 CPU cores Comparable with high-end hardware accelerators Lessons Batch processing is essential to fully utilize a GPU AES and SHA1 are bottlenecked by data copy RSA-1024 (ops/sec) AES-ENC (Gbps) AES-DEC SHA1 GTX580 91.9K 11.5 12.5 47.1 CPU core 3.3K 1.3 3.3

RSA Performance with Recent Processors
2.2x Or 31x CPU cores 2.1x Or 29x CPU cores 1.6 $2,095 $1,200

HMAC-SHA1 Performance with Recent Processors
2.0x

AES Performance with Recent Processors
CPU outperforms GPUs due to AES-NI Instructions.

Building SSL proxy that leverages GPU

Batching Crypto Operations
Network workloads vary over time Waiting for fixed batch size doesn’t work SSL Stack Input queue Output queue CPU GPU GPU Batch size is dynamically adjusted to queue length

Balancing Load Between CPU and GPU
For small batch, CPU is faster than GPU Opportunistic offloading CPU Input queue Output queue GPU queue Input queue length > threshold GPU CPU processing GPU processing when input queue length > threshold Cryptographic operation Minimum Maximum RSA (1024-bit) 16 512 AES Decryption 32 2048 AES Encryption 128 HMAC-SHA1

Scaling with Multiple Cores
CPU Core1 CPU Core2 CPU Input queues Output queues GPU queue CPU GPU GPU Per-core worker threads Network session handling, cryptographic operation Sharing a GPU with multiple cores More parallelism with larger batch size

HTTPS Connection Rate (SSL handshake)
Setup PC: 1 x E5-2697v3 CPU + 1 GPU Test: SSLShader + Plaintext lighttpd vs. OpenSSL lighttpd HTTP object: 1-byte object fetching Ciphersuite:RSA-2048, AES-128, HMAC-SHA1 6.2x

Latency and Throughput Experiments
Measurement done 5 years ago Performance trend is very similar to recent processors Setup Ciphersuite: RSA 1024, AES-128, HMAC-SHA1 CPU: 2 x X5650 (12 cores) GPU: 2 x GTX580 10G NIC Comparison of latency and throughput SSLShader proxy + plaintext lighttpd vs. OpenSSL-based lighttpd

Similar latency at light load
Lighttpd at 1k connections / sec SSLShader at 1k connections / sec Similar latency at light load

Lower latency and higher throughput at heavy load
Latency at Heavy Load SSLShader at 29k connections / sec Lighttpd at 11k connections / sec Lower latency and higher throughput at heavy load

Data Transfer Performance
2.1x SSLShader: 13 Gbps Lighttpd performance 0.87x Typical web content size is under 100KB

APUNet: Exploiting APU for Accelerating Network Applications
Accelerated Processor Unit (APU) CPU and GPU on one physical package Strength: no PCI-e overhead, low power consumption, inexpensive Weakness: shared memory bandwidth for CPU and GPU 4 CPU cores 512 GPU cores

IPsec Gateway Performance
AES-128 in CBC mode with HMAC-SHA1 PC: AMD Carrizo CPU + 40G NIC 2.2x Performance Improvement over CPU-only approach 16.4 Gbps IPsec Gateway Performance for $1,000 machine

Conclusions GPU Cryptographic algorithms in GPU SSLShader [NSDI’11]
Beneficial to compute-intensive network applications Cryptographic algorithms in GPU The fastest RSA implementation (29K ops/sec RSA-2048) Comparable to ~30 CPU cores Comparable to high-end hardware accelerators SSLShader [NSDI’11] Transparent integration Effective utilization of GPU for SSL processing Up to 6x connections / sec 13 Gbps throughput Linux network stack performance Can use mTCP user-level stack Copy overhead

More Information SSLShader Questions?
Questions?

KyoungSoo Park Department of Electrical Engineering KAIST

Similar presentations

Presentation on theme: "KyoungSoo Park Department of Electrical Engineering KAIST"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

KyoungSoo Park Department of Electrical Engineering KAIST

Similar presentations

Presentation on theme: "KyoungSoo Park Department of Electrical Engineering KAIST"— Presentation transcript:

Similar presentations

About project

Feedback