Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Enabling High Performance Bulk Data Transfers With SSH Chris Rapier Benjamin.

Slides:



Advertisements
Similar presentations
3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.
Advertisements

“Advanced Encryption Standard” & “Modes of Operation”
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
1 © 2004, Cisco Systems, Inc. All rights reserved. Chapter 3 Ethernet Technologies/ Ethernet Switching/ TCP/IP Protocol Suite and IP Addressing.
BZUPAGES.COM 1 User Datagram Protocol - UDP RFC 768, Protocol 17 Provides unreliable, connectionless on top of IP Minimal overhead, high performance –No.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
CSCE 715: Network Systems Security Chin-Tser Huang University of South Carolina.
Chapter 8 Hardware Conventional Computer Hardware Architecture.
Cryptography1 CPSC 3730 Cryptography Chapter 6 Triple DES, Block Cipher Modes of Operation.
1 CS 577 “TinySec: A Link Layer Security Architecture for Wireless Sensor Networks” Chris Karlof, Naveen Sastry, David Wagner UC Berkeley Summary presented.
High-performance bulk data transfers with TCP Matei Ripeanu University of Chicago.
Telnet/SSH Tim Jansen, Mike Stanislawski. TELNET is short for Terminal Network Enables the establishment of a connection to a remote system, so that the.
Efficient Internet Traffic Delivery over Wireless Networks Sandhya Sumathy.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
TinySec: Performance Characteristics Chris K :: Naveen S :: David W January 16, 2004.
OpenSSL acceleration using Graphics Processing Units
High Performance Networking with the SSH Protocol Chris Rapier Vancouver Joint Techs July 19, 2005.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
CSE 451: Operating Systems Autumn 2013 Module 6 Review of Processes, Kernel Threads, User-Level Threads Ed Lazowska 570 Allen.
Christopher Bednarz Justin Jones Prof. Xiang ECE 4986 Fall Department of Electrical and Computer Engineering University.
1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.
Brierley 1 Module 4 Module 4 Introduction to LAN Switching.
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
Improving Network I/O Virtualization for Cloud Computing.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
Class 7 Practical Considerations CIS 755: Advanced Computer Security Spring 2014 Eugene Vasserman
CSCE 715: Network Systems Security
Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center Mardi Gras 2008 Enabling High Performance Bulk Data Transfers With SSH Chris Rapier Benjamin.
Networked & Distributed Systems TCP/IP Transport Layer Protocols UDP and TCP University of Glamorgan.
Lecture 4: Using Block Ciphers
Transport Layer Moving Segments. Transport Layer Protocols Provide a logical communication link between processes running on different hosts as if directly.
WEP Protocol Weaknesses and Vulnerabilities
Network Security. 2 SECURITY REQUIREMENTS Privacy (Confidentiality) Data only be accessible by authorized parties Authenticity A host or service be able.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Shambhu Upadhyaya Security – AES-CCMP Shambhu Upadhyaya Wireless Network Security CSE 566 (Lecture 13)
Implementing Memory Protection Primitives on Reconfigurable Hardware Brett Brotherton Nick Callegari Ted Huffmire.
Harnessing Multicore Processors for High Speed Secure Transfer Raj Kettimuthu Argonne National Laboratory.
We will focus on operating system concepts What does it do? How is it implemented? Apply to Windows, Linux, Unix, Solaris, Mac OS X. Will discuss differences.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
RTL Design Methodology Transition from Pseudocode & Interface
CCNA3 Module 4 Brierley Module 4. CCNA3 Module 4 Brierley Topics LAN congestion and its effect on network performance Advantages of LAN segmentation in.
4343 X2 – The Transport Layer Tanenbaum Ch.6.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
Doc.:IEEE /0313r1 Submission Robert Stacey (Intel) March 12, 2010 Slide 1 Rekeying Protocol Fix Authors: Date:
Chapter 7: Using Network Clients The Complete Guide To Linux System Administration.
Modes of Operation block ciphers encrypt fixed size blocks – eg. DES encrypts 64-bit blocks with 56-bit key need some way to en/decrypt arbitrary amounts.
Mohit Aron Peter Druschel Presenter: Christopher Head
Computer and Network Security
Quick UDP Internet Connections
Transport Protocols over Circuits/VCs
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
TLS Receive Side Crypto Offload to NIC
Block Cipher Modes CS 465 Make a chart for the mode comparisons
Implementation of IDEA on a Reconfigurable Computer
Chapter 4: Threads.
Chapter 4: Threads.
Automatic TCP Buffer Tuning
PUSH Flag A notification from the sender to the receiver to pass all the data the receiver has to the receiving application. Some implementations of TCP.
File Transfer Issues with TCP Acceleration with FileCatalyst
TCP over SoNIC Abhishek Kumar Maurya
Fast Communication and User Level Parallelism
Threads Chapter 4.
Outline Chapter 2 (cont) OS Design OS structure
Block Ciphers (Crypto 2)
Rekeying Protocol Fix Date: Authors: Month Year
Counter Mode, Output Feedback Mode
CSC Multiprocessor Programming, Spring, 2011
Secret-Key Encryption
Presentation transcript:

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Enabling High Performance Bulk Data Transfers With SSH Chris Rapier Benjamin Bennett Pittsburgh Supercomputing Center TIP ‘08

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Moving Data Still crazy after all these years –Multiple solutions exist Protocols –UDT, SABUL, etc… Implementations –GridFTP, kFTP, bbFTP, hand rolled and more… Not to mention –Advanced congestion control, autotuning, jumbograms, etc…

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Many Solutions No Answers All developed as a solution to the same problem –Moving lots of a data very fast can be very difficult Unfortunately, no single solution meets all needs. –Fast, easy to use, inexpensive to maintain, flexible, secure

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 What About SSH? Easy to use. Cheap to maintain. Installed everywhere. Flexible. Strong cryptography.

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Why not SSH? It can be really really slow.

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 How slow?

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 A little better

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 What changed? Why the improvement in OpenSSH4.7? –SSH is a multiplexed application Each channel requires its own flow control which is implemented as a receive window –In 4.7 the maximum window size was increased to ~1MiB up from 64KiB

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Windows Receive windows advertise the amount of data a system or application is willing to accept per round trip time. Effective window size is the minimum of all windows; protocol and application. Each window must be tuned and in sync to maximize throughput. –If any one is out of tune the entire connection will suffer.

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 TCP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 TCP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 TCP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 TCP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 SSHTCP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 SSHTCP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Windows in HPN-SSH Dynamically defined receive window size grows to match the TCP window. –Set to TCP RWIN on start. –Grows with RWIN if autotuning system. –Dynamic sizing reduces issues of over- buffering problems.

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 HPN-SSH TCP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 HPN-SSH TCP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 HPN-SSH TCP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 HPN-SSH TCP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 SFTP is Special SFTP adds *another* layer of flow control. –All SFTP packets are treated as requests –By default no more than 16 outstanding requests. –Results in a 512KiB window –Increase using -R on command line

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 HPN-SSH TCPSFTP

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 A lot better

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 But… As the throughput increases crypto demands more of the processor. –The transfer is now processor bound

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 We Need More Power? Two solutions to processor bound transfers –Throw more processing power at the problem –Do the work more efficiently Define ‘work’

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 The None Switch Many people only need secure authentication. The data can pass in the clear. –HPN-SSH allows users to switch to a ‘None’ cipher after authentication.

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Done!

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 As far as we can go? Windows are already optimized. –No more real improvements available there NONE cipher is limited to a subset of transfers. –Sometimes you absolutely need full encryption. So what now?

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 More Power Common assumption that current hardware is incapable of meeting crypto demand –Is it true?

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Packetize Compute MAC Encrypt read(disk) write(net) Depacketize Compute MAC Decrypt write(disk) read(net) Tx Rx What does SSH need to do?

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Today's Hardware Laptop –Two 64bit general purpose cores –1GiB to 4GiB RAM –1Gbps ethernet Desktop/Workstation –Two to eight 64bit general purpose cores –1GiB to 8GiB RAM –1Gbps ethernet

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 OpenSSL Benchmarks Dual Intel Xeon 5345 Workstation –4 cores per socket, 8 cores 2.33Ghz –Fedora 7 stock OpenSSL build

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 1Gbps, ~0.3 cores 1Gbps, ~1.34 cores Crypto 1Gbps, ~1.64 cores We have 8! We have the CPU power

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 MAC requires fraction of one core Cipher requires more than one core MAC, cipher, and more all within a single execution thread So what's the problem? ssh idle kernel I/O idle util %

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Multi-threading on functional boundaries –Perform MAC and cipher on a packet concurrently Possible on sender, not on receiver –Process multiple packets concurrently (pipeline) –Cipher still needs more than one core Multi-threading within cipher –Can it be parallelized? How can we fix it?

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 SSH Cipher Modes CBC –Most common –RFC 4253 “The Secure Shell (SSH) Transport Layer Protocol” specifies only CBC mode ciphers, arcfour, and none. CTR –Specified in RFC 4344 “SSH Transport Layer Encryption Modes” –More desirable security properties than CBC

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Cipher Block Chaining Mode Encryption Hello, my name is CBC XOR IVP0P0 Encrypt C0C0 XOR Encrypt C1C1 P1P1 Key...

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Cipher Block Chaining Mode Decryption Decrypt Key C0C0 XOR P0P0 Decrypt XOR P1P1 C1C1 IV... Hello, my name is CBC (cont)

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Encrypt must be serial Decrypt may be parallel That doesn't help so much :-( CBC Summary

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Counter Mode Encryption Hello, my name is CTR CTR Encrypt Key XOR P0P0 C0C0 CTR + 1 Encrypt XOR P1P1 C1C1...

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Counter Mode Decryption Hello, my name is CTR (cont) CTR Encrypt Key XOR C0C0 P0P0 Encrypt XOR C1C1 P1P1... CTR + 1

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Encrypt may be parallel Decrypt may be parallel Keystream can be pregenerated Let’s get to work… CTR Summary

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Uses arbitrary number of cipher threads (and cores) to generate a single keystream. Cipher threads pre-generate keystream, starting once a cipher context key and IV are known. Leaves only keystream dequeue & XOR for encrypt/decrypt operations in main SSH thread. Multi-threaded AES-CTR

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Single Cipher Thread Cipher Thread –AES_Encrypt(ctr) –Inc(ctr) Main Thread –read(disk) –Packetize –Compute MAC –XOR –write(net) Keystream Q

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Multiple Cipher Threads Ring of bounded queues –Each queue holds a portion of keystream –Each queue exclusively accessed Queue counters offset initially and each fill DRAINING FILLING EMPTY Main Thread Cipher Thread 1 Cipher Thread 2

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 M-T AES-CTR Results

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Conclusion SSH designed for security –HPN-SSH is performance enhancements to the most common SSH implementation, OpenSSH High throughput with high latency –Kernel auto-tuning adjusts TCP flow contol –HPN-SSH RecvBufferPolling adjusts SSH flow control High throughput with any latency –HPN-SSH None cipher for non-private data –HPN-SSH Multi-threaded AES-CTR cipher

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Future Work Approaching 10Gbps Continued multi-threading –Concurrent packet processing/pipelining Efficiency Striped data transfers Exotic architectures

Chris Rapier, Benjamin Bennett Pittsburgh Supercomputing Center HPN-SSH TIP’08 Where to get it