KyoungSoo Park Department of Electrical Engineering KAIST

Slides:



Advertisements
Similar presentations
Lecture 6: Multicore Systems
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Secure Socket Layer.
Socket Layer Security. In this Presentation: need for web security SSL/TLS transport layer security protocols HTTPS secure shell (SSH)
An Introduction to Secure Sockets Layer (SSL). Overview Types of encryption SSL History Design Goals Protocol Problems Competing Technologies.
Module 5: TLS and SSL 1. Overview Transport Layer Security Overview Secure Socket Layer Overview SSL Termination SSL in the Hosted Environment Load Balanced.
Topic 8: Secure communication in mobile devices. Choice of secure communication protocols, leveraging SSL for remote authentication and using HTTPS for.
BASIC CRYPTOGRAPHY CONCEPT. Secure Socket Layer (SSL)  SSL was first used by Netscape.  To ensure security of data sent through HTTP, LDAP or POP3.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Architectural Impact of SSL Processing Jingnan Yao.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
OpenSSL acceleration using Graphics Processing Units
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Computer Networks with Internet Technology William Stallings Network Security.
Multi-core architectures. Single-core computer Single-core CPU chip.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
Multi-Core Architectures
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
On the Performance of TCP Splicing for URL-aware Redirection Ariel Cohen, Sampath Rangarajan, and Hamilton Slye The 2 nd USENIX Symposium on Internet Technologies.
Srihari Makineni & Ravi Iyer Communications Technology Lab
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Lecture 10 Page 1 CS 236 Online SSL and TLS SSL – Secure Socket Layer TLS – Transport Layer Security The common standards for securing network applications.
SSL: Secure Socket Layer By: Mike Weissert. Overview Definition History & Background SSL Assurances SSL Session Problems Attacks & Defenses.
PRESENTATION ON SECURE SOCKET LAYER (SSL) BY: ARZOO THAKUR M.E. C.S.E (REGULAR) BATCH
Network security Presentation AFZAAL AHMAD ABDUL RAZAQ AHMAD SHAKIR MUHAMMD ADNAN WEB SECURITY, THREADS & SSL.
Computer Graphics Graphics Hardware
VPNs and IPSec Review VPN concepts Encryption IPSec Lab.
APUNet: Revitalizing GPU as Packet Processing Accelerator
GPU Architecture and Its Application
NFV Compute Acceleration APIs and Evaluation
Web Applications Security Cryptography 1
GPUNFV: a GPU-Accelerated NFV System
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing
Encryption and Network Security
Cryptography and Network Security
Neha Jain Shashwat Yadav
Secure Sockets Layer (SSL)
Cryptography and Network Security Chapter 16
Cache Memory Presentation I
Morgan Kaufmann Publishers
SHA-3 on GPU By Lee, Jae-song
Cryptography and Network Security
Cryptography and Network Security Chapter 16
Encrypting the Internet 09/01/10
Running OpenSSL Crypto Algorithms in Simplescalar
Understanding Venus Performance (Tentative Update )
GEN: A GPU-Accelerated Elastic Framework for NFV
Cryptography and Network Security
SSL (Secure Socket Layer)
Security in the Internet: IPSec, SSL/TLS, PGP, VPN, and Firewalls
The Secure Sockets Layer (SSL) Protocol
Chapter 1 Introduction.
Cryptography and Network Security Chapter 16
Computer Graphics Graphics Hardware
Prof. Leonardo Mostarda University of Camerino
Unit 8 Network Security.
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
CSE 502: Computer Architecture
Multicore and GPU Programming
Fast Accesses to Big Data in Memory and Storage Systems
Cryptography and Network Security
TLS Encryption and Decryption
Presentation transcript:

KyoungSoo Park Department of Electrical Engineering KAIST High-performance Cryptographic Operations on General-Purpose Processors KyoungSoo Park Department of Electrical Engineering KAIST

Internet Traffic Is Still Rapidly Increasing Cisco VNI Forecasts 194 EB per Month of IP Traffic by 2020* *The Zettabyte Era-Trends and Analysis (Cisco Visual Networking Index (VNI)) *http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/vni-hyperconnectivity-wp.html

Encrypted Traffic Is Growing Fast Our suggestion is to use GPU for offload SSL operations in a cost effective way. 50+% of transferred bytes in the Internet 60+% for mobile/cellular traffic

Growing Number of SSL/TLS Sites 1000x increase in valid SSL/TLS certificates for past 18 years Most popular sites support SSL (Google, Facebook, Amazon, etc.) * Source: https://www.netcraft.com/internet-data-mining/ssl-survey/

Practical Cryptography Goal: “high-quality” traffic encryption at a small cost Secure protocols and implementations Secure key management Performance improvement with modern processors Multicore CPUs and manycore GPUs Cost-effective crypto performance acceleration Exploiting parallelism is the key to high performance SSL/TLS and IPsec Most widely used in VPN and secure Web transactions SSLShader [NSDI’11] and APUNet

Secure Sockets Layer (SSL) A de-facto standard for secure communication Authentication, Confidentiality, Content integrity Client Server TCP handshake Key exchange using public key algorithm (e.g., RSA) Server identification Encrypted data

SSL Computation Overhead Performance overhead (HTTPS vs. HTTP) Connection setup Data transfer Good privacy is expensive More servers H/W SSL accelerators Our suggestion: Use GPU to offload SSL operations 22x 50x Our suggestion is to use GPU for offload SSL operations in a cost effective way.

What Is GPU? I assume most of you are not familiar with GPU, so let me introduce GPU briefly.

GPU = Graphics Processing Unit The main chip of graphics cards Mainly used for real-time 3D game rendering Massively-parallel processing capacity (Battlefield Bad Company 2, http://www.techradar.com/news/gaming/top-10-best-3d-pc-games-914805)

Small # of super-fast cores CPU vs. GPU CPU: Small # of super-fast cores GPU: Large # of wimpy cores

How GPU Differs From CPU? ALU Control ALU Cache Intel Xeon E5-2697v3 CPU: 14 cores NVIDIA Titan X: 3072 cores Core Cache Instructions / sec < 0.6×1012 11×1012

Single Instruction Multiple Threads (SIMT) Example code: vector addition (C = A + B) CPU code GPU code void VecAdd( int *A, int *B, int *C, int N) { //iterate over N elements for(int i = 0; i < N; i++) C[i] = A[i] + B[i] } VecAdd(A, B, C, N); __global__ void VecAdd( int *A, int *B, int *C) { int i = threadIdx.x; C[i] = A[i] + B[i] } //Launch N threads VecAdd<<<1, N>>>(A, B, C);

Benefits of GPU over CPU Higher raw computation power 0.6 x 1012 instrs/s (CPU) vs. 11 x 1012 instrs/s (GPU) Hide memory access latency Hardware-based context switch Higher memory bandwidth 68 GB/s (CPU) vs. 336.5 GB/s (GPU) BIG Caveat: 1, 2, 3 are realizable only through parallel execution of independent threads (*Intel E5-2697v3 CPU vs. NVIDIA TitanX GPU) .

SSLShader SSL-accelerator leveraging GPU SSL reverse proxy High-performance Cost-effective SSL reverse proxy No modification on existing servers Cloud SSLShader Web Server SMTP Server POP3 Server Plain TCP SSL-encrypted session

Parallelism in SSL Processing Client 1 SSLShader Client 2 1. Independent Sessions Client N 2. Independent SSL Record SSL Record SSL Record SSL Record 3. Parallelism in Cryptographic Operations

Our GPU Implementation Choices of cipher-suite Optimization of GPU algorithms Exploiting massive parallel processing Parallelization of algorithms Batch processing Data copy overhead is significant Concurrent copy and execution Client Server Key exchange: RSA Encryption: AES Message Authentication: SHA1

C = Me mod n M = Cd mod n Basic RSA Operations M: plain-text, C: cipher-text (e, n): public key, (d, n): private key Encryption: C = Me mod n Decryption: M = Cd mod n Small number: 3, 17, 65537 1024/2048 bits integer (300 ~ 600 digits) Exponentiation  many multiplications Decryption at the server side is the bottleneck

Breakdown of Large Integer Multiplication Schoolbook multiplication 649 X 627 --------- 63 280 4200 180 800 12000 5400 32000 + 360000 406923 Accumulation is difficult to parallelize due to “overlapping digits” “carry propagation” 3 x 3 = 9 multiplications 8 addition of 6-digits integers

O(s) Parallel Multiplications Example of 649 x 627 = 406,923 s = # of words in a large integer 2s steps 1 or 2 steps (s – 1 worst case)

More Optimizations on RSA Common optimizations for RSA Chinese Remainder Theorem (CRT) Montgomery Multiplication Constant Length Non-zero Window (CLNW) Parallelization of serial algorithms Faster Calculation of M×n Interleaving of T + M×n Mixed-Radix Conversion Offloading GPU specific optimizations Warp Utilization Loop Unrolling Elimination of Divergence Avoiding Bank Conflicts Instruction-Level Optimization Read our NSDI’11 paper for details 

Parallelism in SSL Processing Client 1 SSLShader Client 2 1. Independent Sessions Client N Batch Processing 2. Independent SSL Record SSL Record SSL Record SSL Record 3. Parallelism in Cryptographic Operations

GTX580 Throughput w/o Batching Intel Nehalem single core (2.66Ghz)

GTX580 Throughput w/ Batching Batch size: 32~4096 depending on the algorithm Difference: ratio of computation to copy

Copy Overhead in GPU Cryptography GPU processing works by Data copy: CPU memory  GPU memory (PCIe) Execution in GPU Data copy: GPU memory -> CPU memory (PCIe) 4x w/o copy w/ copy 3.3x 2.4x AES-ENC (Gbps) AES-DEC HMAC-SHA1 GTX580 w/ copy 8.8 10 31 GTX580 no copy 21.8 33 124

Hiding Copy Overhead t Synchronous Execution Pipelining Data copy: CPU -> GPU Execution in GPU Data copy: GPU -> CPU Processing time : 3t Pipelining … Data copy: CPU -> GPU Execution in GPU … … Data copy: GPU -> CPU Amortized processing time : t

GTX580 Performance w/ Pipelining w/o copy synchronous pipelining 51% 36% 36% 14x 9x 9x

Summary of GPU Cryptography Performance gain from GTX580 GPU performs as fast as 9 ~ 28 CPU cores Comparable with high-end hardware accelerators Lessons Batch processing is essential to fully utilize a GPU AES and SHA1 are bottlenecked by data copy RSA-1024 (ops/sec) AES-ENC (Gbps) AES-DEC SHA1 GTX580 91.9K 11.5 12.5 47.1 CPU core 3.3K 1.3 3.3

RSA Performance with Recent Processors 2.2x Or 31x CPU cores 2.1x Or 29x CPU cores 1.6 $2,095 $1,200

HMAC-SHA1 Performance with Recent Processors 2.0x

AES Performance with Recent Processors CPU outperforms GPUs due to AES-NI Instructions.

Building SSL proxy that leverages GPU

Batching Crypto Operations Network workloads vary over time Waiting for fixed batch size doesn’t work SSL Stack Input queue Output queue CPU GPU GPU Batch size is dynamically adjusted to queue length

Balancing Load Between CPU and GPU For small batch, CPU is faster than GPU Opportunistic offloading CPU Input queue Output queue GPU queue Input queue length > threshold GPU CPU processing GPU processing when input queue length > threshold Cryptographic operation Minimum Maximum RSA (1024-bit) 16 512 AES Decryption 32 2048 AES Encryption 128 HMAC-SHA1

Scaling with Multiple Cores CPU Core1 CPU Core2 CPU Input queues Output queues GPU queue CPU GPU GPU Per-core worker threads Network session handling, cryptographic operation Sharing a GPU with multiple cores More parallelism with larger batch size

HTTPS Connection Rate (SSL handshake) Setup PC: 1 x E5-2697v3 CPU + 1 GPU Test: SSLShader + Plaintext lighttpd vs. OpenSSL lighttpd HTTP object: 1-byte object fetching Ciphersuite:RSA-2048, AES-128, HMAC-SHA1 6.2x

Latency and Throughput Experiments Measurement done 5 years ago Performance trend is very similar to recent processors Setup Ciphersuite: RSA 1024, AES-128, HMAC-SHA1 CPU: 2 x X5650 (12 cores) GPU: 2 x GTX580 10G NIC Comparison of latency and throughput SSLShader proxy + plaintext lighttpd vs. OpenSSL-based lighttpd

Similar latency at light load Lighttpd at 1k connections / sec SSLShader at 1k connections / sec Similar latency at light load

Lower latency and higher throughput at heavy load Latency at Heavy Load SSLShader at 29k connections / sec Lighttpd at 11k connections / sec Lower latency and higher throughput at heavy load

Data Transfer Performance 2.1x SSLShader: 13 Gbps Lighttpd performance 0.87x Typical web content size is under 100KB

APUNet: Exploiting APU for Accelerating Network Applications Accelerated Processor Unit (APU) CPU and GPU on one physical package Strength: no PCI-e overhead, low power consumption, inexpensive Weakness: shared memory bandwidth for CPU and GPU 4 CPU cores 512 GPU cores

IPsec Gateway Performance AES-128 in CBC mode with HMAC-SHA1 PC: AMD Carrizo CPU + 40G NIC 2.2x Performance Improvement over CPU-only approach 16.4 Gbps IPsec Gateway Performance for $1,000 machine

Conclusions GPU Cryptographic algorithms in GPU SSLShader [NSDI’11] Beneficial to compute-intensive network applications Cryptographic algorithms in GPU The fastest RSA implementation (29K ops/sec RSA-2048) Comparable to ~30 CPU cores Comparable to high-end hardware accelerators SSLShader [NSDI’11] Transparent integration Effective utilization of GPU for SSL processing Up to 6x connections / sec 13 Gbps throughput Linux network stack performance Can use mTCP user-level stack Copy overhead

More Information SSLShader Questions? https://shader.kaist.edu/sshslahder/ Questions?