Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Enabling High Performance Bulk Data Transfers With SSH Chris Rapier Benjamin.

Slides:



Advertisements
Similar presentations
3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.
Advertisements

“Advanced Encryption Standard” & “Modes of Operation”
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
BZUPAGES.COM 1 User Datagram Protocol - UDP RFC 768, Protocol 17 Provides unreliable, connectionless on top of IP Minimal overhead, high performance –No.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
SDN and Openflow.
Chapter 8 Hardware Conventional Computer Hardware Architecture.
Cryptography1 CPSC 3730 Cryptography Chapter 6 Triple DES, Block Cipher Modes of Operation.
1 CS 577 “TinySec: A Link Layer Security Architecture for Wireless Sensor Networks” Chris Karlof, Naveen Sastry, David Wagner UC Berkeley Summary presented.
High-performance bulk data transfers with TCP Matei Ripeanu University of Chicago.
Chapter 1 Read (again) chapter 1.
Chapter 7 Interupts DMA Channels Context Switching.
EE 4272Spring, 2003 Protocols & Architecture A Protocol Architecture is the layered structure of hardware & software that supports the exchange of data.
Computer Networks Transport Layer. Topics F Introduction  F Connection Issues F TCP.
CSE 451: Operating Systems Winter 2010 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
OpenSSL acceleration using Graphics Processing Units
High Performance Networking with the SSH Protocol Chris Rapier Vancouver Joint Techs July 19, 2005.
JVM Tehnologic Company profile & core business Founded: February 1992; –Core business: design and implementation of large software applications mainly.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Christopher Bednarz Justin Jones Prof. Xiang ECE 4986 Fall Department of Electrical and Computer Engineering University.
1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.
Brierley 1 Module 4 Module 4 Introduction to LAN Switching.
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
Operating System 4 THREADS, SMP AND MICROKERNELS
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
The NE010 iWARP Adapter Gary Montry Senior Scientist
Class 7 Practical Considerations CIS 755: Advanced Computer Security Spring 2014 Eugene Vasserman
Networked & Distributed Systems TCP/IP Transport Layer Protocols UDP and TCP University of Glamorgan.
Example: Sorting on Distributed Computing Environment Apr 20,
WEP Protocol Weaknesses and Vulnerabilities
Srihari Makineni & Ravi Iyer Communications Technology Lab
Shambhu Upadhyaya Security – AES-CCMP Shambhu Upadhyaya Wireless Network Security CSE 566 (Lecture 13)
Implementing Memory Protection Primitives on Reconfigurable Hardware Brett Brotherton Nick Callegari Ted Huffmire.
Harnessing Multicore Processors for High Speed Secure Transfer Raj Kettimuthu Argonne National Laboratory.
Operating System Structure A key concept of operating systems is multiprogramming. –Goal of multiprogramming is to efficiently utilize all of the computing.
We will focus on operating system concepts What does it do? How is it implemented? Apply to Windows, Linux, Unix, Solaris, Mac OS X. Will discuss differences.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Department of Electronic Engineering City University of Hong Kong EE3900 Computer Networks Protocols and Architecture Slide 1 Use of Standard Protocols.
RTL Design Methodology Transition from Pseudocode & Interface
Prentice HallHigh Performance TCP/IP Networking, Hassan-Jain Chapter 13 TCP Implementation.
Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Enabling High Performance Bulk Data Transfers With SSH Chris Rapier Benjamin.
CCNA3 Module 4 Brierley Module 4. CCNA3 Module 4 Brierley Topics LAN congestion and its effect on network performance Advantages of LAN segmentation in.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
Doc.:IEEE /0313r1 Submission Robert Stacey (Intel) March 12, 2010 Slide 1 Rekeying Protocol Fix Authors: Date:
INDIANAUNIVERSITYINDIANAUNIVERSITY Tsunami File Transfer Protocol Presentation by ANML January 2003.
Chapter 7: Using Network Clients The Complete Guide To Linux System Administration.
References A. Silberschatz, P. B. Galvin, and G. Gagne, “Operating Systems Concepts (with Java)”, 8th Edition, John Wiley & Sons, 2009.
Mohit Aron Peter Druschel Presenter: Christopher Head
Computer and Network Security
Quick UDP Internet Connections
Transport Protocols over Circuits/VCs
TLS Receive Side Crypto Offload to NIC
Operating Systems CPU Scheduling.
Chapter 4: Threads.
PUSH Flag A notification from the sender to the receiver to pass all the data the receiver has to the receiving application. Some implementations of TCP.
TCP over SoNIC Abhishek Kumar Maurya
CSE 451: Operating Systems Winter 2009 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
Fast Communication and User Level Parallelism
Threads Chapter 4.
CSE 451: Operating Systems Winter 2012 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
Outline Chapter 2 (cont) OS Design OS structure
Block Ciphers (Crypto 2)
November 26th, 2018 Prof. Ion Stoica
Counter Mode, Output Feedback Mode
Achieving reliable high performance in LFNs (long-fat networks)
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Enabling High Performance Bulk Data Transfers With SSH Chris Rapier Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Moving Data Still crazy after all these years –Multiple solutions exist Protocols –UDT, SABUL, etc… Implementations –GridFTP, kFTP, bbFTP, hand rolled and more… Not to mention –Advanced congestion control, autotuning, jumbograms, etc…

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Many Solutions No Answers All developed as solutions to the same problem –Moving lots of data very fast and reliably can be very difficult Unfortunately, no single solution meets all needs. –Fast, easy to use, inexpensive to maintain, flexible, secure, ubiquitous

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Why not [insert app here]? Many will work just fine if the right environment exists. –Some solutions have significantly higher costs of entry than others. Not a problem for some but can be a serious barrier to smaller organizations. –Sometimes just getting the app installed on both ends is a problem. Can make it difficult to ‘test drive’

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 What About SSH? Easy to use. –Well known interface Cheap to maintain. –Often a ‘fire and forget’ installation Installed everywhere. –Included in most OS distributions Flexible. –Multiple authentication methods and functional modes Strong cryptography.

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Why not SSH? It can be slow. –Really. –Slow.

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 How slow?

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 A little better

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 What changed? Why the improvement in OpenSSH4.7? –SSH is a multiplexed application Each channel requires its own flow control which is implemented as a receive window –In 4.7 the maximum window size was increased to ~1MiB up from 64KiB

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Windows Receive and congestion windows advertise the amount of data a system or application is willing to accept per round trip time. Effective window size is the minimum of all windows; protocol and application. Each window must be tuned and in sync to maximize throughput. –If any one is out of tune the entire connection will suffer.

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 TCP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 TCP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 TCP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 TCP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 SSHTCP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 SSHTCP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Windows in HPN-SSH Dynamically defined receive window size grows to match the TCP window. –Set to TCP RWIN on start. –Grows with RWIN if autotuning system. –Dynamic sizing reduces issues of over- buffering problems.

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 HPN-SSH TCP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 HPN-SSH TCP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 HPN-SSH TCP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 HPN-SSH TCP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 SFTP is Special SFTP adds *another* layer of flow control. –All SFTP packets are treated as requests –By default no more than 16 outstanding requests. –Results in a 512KiB window –Increase using -R on command line

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 HPN-SSH TCPSFTP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 A lot better

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 But… As the throughput increases crypto demands more of the processor. –The transfer is now processor bound

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 We Need More Power? Two solutions to processor bound transfers –Throw more processing power at the problem –Do the work more efficiently Define ‘work’

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 The None Switch Many people only need secure authentication. The data can pass in the clear. –HPN-SSH allows users to switch to a ‘None’ cipher after authentication.

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Done!

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 As far as we can go? Windows are already optimized. –No more real improvements available there NONE cipher is limited to a subset of transfers. –Sometimes you absolutely need full encryption. So what now?

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 More Power Common assumption that current hardware is incapable of meeting crypto demand –Is it true?

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Packetize Compute MAC Encrypt read(disk) write(net) Depacketize Compute MAC Decrypt write(disk) read(net) Tx Rx What does SSH need to do?

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Today's Hardware Laptop –Two 64bit general purpose cores –1GiB to 4GiB RAM –1Gbps ethernet Desktop/Workstation –Two to eight 64bit general purpose cores –1GiB to 8GiB RAM –1Gbps ethernet

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 OpenSSL Benchmarks Dual Intel Xeon 5345 Workstation –4 cores per socket, 8 cores 2.33Ghz –Fedora 7 stock OpenSSL build

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras Gbps, ~0.3 cores 1Gbps, ~1.34 cores Crypto 1Gbps, ~1.64 cores We have 8! We have the CPU power

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 MAC requires fraction of one core Cipher requires more than one core MAC, cipher, and more all within a single execution thread So what's the problem? ssh idle kernel I/O idle util %

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Multi-threading on functional boundaries –Perform MAC and cipher on a packet concurrently Possible on sender, not on receiver –Process multiple packets concurrently (pipeline) –Cipher still needs more than one core Multi-threading within cipher –Can it be parallelized? How can we fix it?

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 SSH Cipher Modes CBC –Most common –RFC 4253 “The Secure Shell (SSH) Transport Layer Protocol” specifies only CBC mode ciphers, arcfour, and none. CTR –Specified in RFC 4344 “SSH Transport Layer Encryption Modes” –More desirable security properties than CBC

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Cipher Block Chaining Mode Encryption Hello, my name is CBC XOR IVP0P0 Encrypt C0C0 XOR Encrypt C1C1 P1P1 Key...

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Cipher Block Chaining Mode Decryption Decrypt Key C0C0 XOR P0P0 Decrypt XOR P1P1 C1C1 IV... Hello, my name is CBC (cont)

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Encrypt must be serial Decrypt may be parallel That doesn't help so much :-( CBC Summary

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Counter Mode Encryption Hello, my name is CTR CTR Encrypt Key XOR P0P0 C0C0 CTR + 1 Encrypt XOR P1P1 C1C1...

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Counter Mode Decryption Hello, my name is CTR (cont) CTR Encrypt Key XOR C0C0 P0P0 Encrypt XOR C1C1 P1P1... CTR + 1

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Encrypt may be parallel Decrypt may be parallel Keystream can be pregenerated Let’s get to work… CTR Summary

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Uses arbitrary number of cipher threads (and cores) to generate a single keystream. Cipher threads pre-generate keystream, starting once a cipher context key and IV are known. Leaves only keystream dequeue & XOR for encrypt/decrypt operations in main SSH thread. Multi-threaded AES-CTR

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Single Cipher Thread Cipher Thread –AES_Encrypt(ctr) –Inc(ctr) Main Thread –read(disk) –Packetize –Compute MAC –XOR –write(net) Keystream Q

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Multiple Cipher Threads Ring of bounded queues –Each queue holds a portion of keystream –Each queue exclusively accessed Queue counters offset initially and each fill DRAINING FILLING EMPTY Main Thread Cipher Thread 1 Cipher Thread 2

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 M-T AES-CTR Results

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Conclusion SSH designed for security –HPN-SSH is performance enhancements to the most common SSH implementation, OpenSSH High throughput with high latency –Kernel auto-tuning adjusts TCP flow contol –HPN-SSH RecvBufferPolling adjusts SSH flow control High throughput with any latency –HPN-SSH None cipher for non-private data –HPN-SSH Multi-threaded AES-CTR cipher

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Future Work Approaching 10Gbps Continued multi-threading –Concurrent packet processing/pipelining Efficiency Striped data transfers Exotic architectures

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Where to get it