Download presentation
Presentation is loading. Please wait.
1
Internetworking: Hardware/Software Interface
CS 213, LECTURE 16 L.N. Bhuyan CS258 S99
2
Protocols: HW/SW Interface
Internetworking: allows computers on independent and incompatible networks to communicate reliably and efficiently; Enabling technologies: SW standards that allow reliable communications without reliable networks Hierarchy of SW layers, giving each layer responsibility for portion of overall communications task, called protocol families or protocol suites Transmission Control Protocol/Internet Protocol (TCP/IP) This protocol family is the basis of the Internet IP makes best effort to deliver; TCP guarantees delivery TCP/IP used even when communicating locally: NFS uses IP even though communicating across homogeneous LAN WS companies used TCP/IP even over LAN Because early Ethernet controllers were cheap, but not reliable 11/15/2018 CS258 S99 CS258 S99
3
TCP/IP packet Application sends message
TCP breaks into 64KB segements, adds 20B header IP adds 20B header, sends to network If Ethernet, broken into 1500B packets with headers, trailers Header, trailers have length field, destination, window number, version, ... Ethernet IP Header TCP Header IP Data TCP data (≤ 64KB) 11/15/2018 CS258 S99
4
Communicating with the Server: The O/S Wall
Problems: O/S overhead to move a packet between network and application level => Protocol Stack (TCP/IP) O/S interrupt Data copying from kernel space to user space and vice versa Oh, the PCI Bottleneck! CPU User Kernel NIC PCI Bus CS258 S99
5
The Send/Receive Operation
The application writes the transmit data to the TCP/IP sockets interface for transmission in payload sizes ranging from 4 KB to 64 KB. The data is copied from the User space to the Kernel space The OS segments the data into maximum transmission unit (MTU)–size packets, and then adds TCP/IP header information to each packet. The OS copies the data onto the network interface card (NIC) send queue. The NIC performs the direct memory access (DMA) transfer of each data packet from the TCP buffer space to the NIC, and interrupts CPU activities to indicate completion of the transfer. 11/15/2018 CS258 S99
6
Transmitting data across the memory bus using a standard NIC
11/15/2018 CS258 S99
7
Timing Measurement in UDP Communication
X.Zhang, L. Bhuyan and W. Feng, ““Anatomy of UDP and M-VIA for Cluster Communication” JPDC, October 2005 11/15/2018 CS258 S99
8
I/O Acceleration Techniques
TCP Offload: Offload TCP/IP Checksum and Segmentation to Interface hardware or programmable device (Ex. TOEs) – A TOE-enabled NIC using Remote Direct Memory Access (RDMA) can use zero-copy algorithms to place data directly into application buffers. O/S Bypass: User-level software techniques to bypass protocol stack – Zero Copy Protocol (Needs programmable device in the NIC for direct user level memory access – Virtual to Physical Memory Mapping. Ex. VIA) Architectural Techniques: Instruction set optimization, Multithreading, copy engines, onloading, prefetching, etc. 11/15/2018 CS258 S99
9
Comparing standard TCP/IP and TOE enabled TCP/IP stacks
( 11/15/2018 CS258 S99
10
Chelsio 10 Gbs TOE 11/15/2018 CS258 S99
11
Cluster (Network) of Workstations/PCs
11/15/2018 CS258 S99
12
Myrinet Interface Card
11/15/2018 CS258 S99
13
InfiniBand Interconnection
Zero-copy mechanism. The zero-copy mechanism enables a user-level application to perform I/O on the InfiniBand fabric without being required to copy data between user space and kernel space. RDMA. RDMA facilitates transferring data from remote memory to local memory without the involvement of host CPUs. Reliable transport services. The InfiniBand architecture implements reliable transport services so the host CPU is not involved in protocol-processing tasks like segmentation, reassembly, NACK/ACK, etc. Virtual lanes. InfiniBand architecture provides 16 virtual lanes (VLs) to multiplex independent data lanes into the same physical lane, including a dedicated VL for management operations. High link speeds. InfiniBand architecture defines three link speeds, which are characterized as 1X, 4X, and 12X, yielding data rates of 2.5 Gbps, 10 Gbps, and 30 Gbps, respectively. Reprinted from Dell Power Solutions, October BY ONUR CELEBIOGLU, RAMESH RAJAGOPALAN, AND RIZWAN ALI 11/15/2018 CS258 S99
14
InfiniBand system fabric
11/15/2018 CS258 S99
15
UDP Communication – Life of a Packet
X. Zhang, L. Bhuyan and W. Feng, “Anatomy of UDP and M-VIA for Cluster Communication” Journal of Parallel and Distributed Computing (JPDC), Special issue on Design and Performance of Networks for Super-, Cluster-, and Grid-Computing, Vol. 65, Issue 10, October 2005, pp 11/15/2018 CS258 S99
16
Timing Measurement in UDP Communication
X.Zhang, L. Bhuyan and W. Feng, ““Anatomy of UDP and M-VIA for Cluster Communication” JPDC, October 2005 11/15/2018 CS258 S99
17
Network Bandwidth is Increasing
TCP requirements Rule of thumb: 1GHz for 1Gbps 1000 100 100 Network bandwidth outpaces Moore’s Law 40 10 10 GHz and Gbps The gap between the rate of processing network applications and the fast growing network bandwidth is increasing 1 0.1 Moore’s Law .01 1990 1995 2000 2003 2005 2006/7 2010 Time 11/15/2018 CS258 S99 CS258 S99
18
Total Avg Clocks / Packet: ~ 21K Effective Bandwidth: 0.6 Gb/s
Profile of a Packet System Overheads Descriptor & Header Accesses IP Processing Computes TCB Accesses TCP Processing Memory Memory Copy Total Avg Clocks / Packet: ~ 21K Effective Bandwidth: 0.6 Gb/s (1KB Receive) 11/15/2018 CS258 S99 CS258 S99
19
Five Emerging Technologies
Optimized Network Protocol Stack (ISSS+CODES, 2003) Cache Optimization (ISSS+CODES, 2003, ANCHOR, 2004) Network Stack Affinity Scheduling Direct Cache Access Lightweight Threading Memory Copy Engine (ICCD 2005 and IEEE TC) 11/15/2018 CS258 S99 CS258 S99
20
Stack Optimizations (Instruction Count)
Separate Data & Control Paths TCP data-path focused Reduce # of conditionals NIC assist logic (L3/L4 stateless logic) Basic Memory Optimizations Cache-line aware data structures SW Prefetches Optimized Computation Standard compiler capability 3X reduction in Instructions per Packet 11/15/2018 CS258 S99 CS258 S99
21
Network Stack Affinity
Assigns network I/O workloads to designated devices Separates network I/O from application work Reduces scheduling overheads More efficient cache utilization Increases pipeline efficiency Chipset Memory CPU Core Core Core Core … I/O Interface CPU Dedicated for network I/O Intel calls it Onloading 11/15/2018 CS258 S99 CS258 S99
22
Direct Cache Access (DCA)
Normal DMA Writes Direct Cache Access CPU Cache Memory NIC Memory Controller Step 1 DMA Write Step 2 Cache Update Step 3 CPU Read CPU Step 4 CPU Read Cache Step 2 Snoop Invalidate Memory Controller Memory Step 1 DMA Write Step 3 Memory Write NIC Eliminate 3 to 25 memory accesses by placing packet data directly into cache 11/15/2018 CS258 S99 CS258 S99
23
Lightweight Threading
Builds on helper threads; reduces CPU stall Memory informing event (e.g. cache miss) Thread Manager S/W controlled thread 1 Execution pipeline S/W controlled thread 2 Single Hardware Context Single Core Pipeline Continue computing in single pipeline in shadow of cache miss 11/15/2018 CS258 S99 CS258 S99
24
Potential Efficiencies (10X)
Benefits of Affinity Benefits of Architectural Technques Greg Regnier, et al., “TCP Onloading for DataCenter Servers,” IEEE Computer, vol 37, Nov 2004 On CPU, multi-gigabit, line speed network I/O is possible 11/15/2018 CS258 S99 CS258 S99
25
I/O Acceleration – Problem Magnitude
Security Services Storage over IP Networking Memory Copies & Effects of Streaming CRCs Crypto Parsing, Tree Construction I/O Processing Rates are significantly limited by CPU in the face of Data Movement and Transformation Operations 11/15/2018 CS258 S99 CS258 S99
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.