Download presentation
Presentation is loading. Please wait.
1
The ‘zero-copy’ initiative A look at the ‘zero-copy’ concept and an x86 Linux implementation for the case of outgoing packets
2
From Wikipedia, the free encyclopedia: Zero-copy is an adjective that refers to computer operations in which the CPU does not perform the task of copying data from one area of memory to another. The availability of zero-copy versions of operating system elements such as device drivers, file systems and network protocol stacks greatly increases the performance of many applications, since using a CPU that is capable of complex operations just to make copies of data can be a great waste of resources. Zero-copy also reduces the number of context-switches from User space to Kernel space and vice-versa. Several OS like Linux support zero copying of files through specific API's like sendfile, sendfile64, etc. Techniques for creating zero-copy software include the use of DMA-based copying, and memory-mapping through an MMU. These features require specific hardware support and usually involve particular memory alignment requirements. Zero-copy protocols are especially important for high-speed networks, as memory copies would cause a serious workload for the host cpu. Still, such protocols have some initial overhead so that avoiding programmed IO (PIO) there only makes sense for large messages.
3
Application source-code charmessage[] = “This is a test of network-packet transmission \n”; int main( void ) { intfd = open( “/dev/nic”, O_RDWR ); if ( fd < 0 ) { perror( “/dev/nic” ); exit(1); } intmsglen = strlen( message ); intnbytes = write( fd, message, msglen ); if ( nbytes < 0 ) { perror( “write” ); exit(1); } printf( “Transmitted %d bytes \n”, nbytes ); }
4
Transmit operation application program user data-buffer runtime library write() Linux OS kernel nic device-driver my_write() file subsystem hardware packet buffer copy_from_user() DMA user space kernel space We want to eliminate this copying-operation
5
Our driver’s packet-layout packet-buffer in kernel-space destn-addresssource-address TYPE/ LENGTH count -- data -- -- data – base-address (64-bits) status Packet- length specialCSS 16 bytes cmd CSO Format for Legacy Transmit-Descriptor
6
Can zero-copy be transparent? We would like to implement the zero-copy concept in out ‘nic2.c’ character driver in such a manner that no changes would be required to an ‘application’ program’s code We will show how to do this for ‘outgoing’ packets (i.e., by modifying ‘my_write()’), but achieving zero-copy with ‘incoming’ packets would be a lot more complicated!
7
TX Descriptor’s CMD byte IDEIDE VLEVLE 00 RSRS ICIC IFCSIFCS EOPEOP Command-Byte Format EOP = End-Of-Packet (1=yes, 0=no) RS = Report Status (1=yes, 0=no) VLE = VLAN-tag Enable Key question: What will the NIC do if we don’t set the EOP-bit in a TX Descriptor?
8
Splitting our packet-layout packet-buffer in kernel-space destn-addresssource-address TYPE/ LENGTH count -- data -- -- data – base-address (64-bits) status Packet- Length (=HDR) specialCSS cmd EOP=0 CSO Format for Legacy Transmit-Descriptor Pair base-address (64-bits) status Packet- Length (=LEN) specialCSS cmd EOP=1 CSO HDR LEN
9
packet-buffer in user-space packet-buffer in kernel-space Splitting our packet-buffer destn-addresssource-address TYPE/ LENGTH count -- data -- -- data – base-address (64-bits) status Packet- Length (=HDR) specialCSS cmd EOP=0 CSO Format for Legacy Transmit-Descriptor Pair base-address (64-bits) status Packet- Length (=LEN) specialCSS cmd EOP=1 CSO HDR LEN Two physical packet-buffers comprise one logical packet that gets transmitted!
10
Transmitting a ‘split-packet’ NIC hardware Device-driver module Application-program User-space Kernel-space packet-data buffer packet-header buffer DMA The 82573L controller ‘merges’ the contents of these separate buffers into just a single ethernet-packet
11
The ‘virt_to_phys()’ macro Linux provides a convenient macro which kernel-module code can employ to obtain the physical-address for a memory-region from its virtual-address – but it only works for addresses that aren’t in ‘high’ memory For ‘normal’ memory-regions, conversion between ‘virtual’ and ‘physical’ addresses amounts to a simple addition/subtraction
12
Linux memory-mapping user space kernel space CPU’s virtual address-space HMA 896-MB physical RAM There is more physical RAM in our classroom’s systems than can be ‘mapped’ into the available address-range for kernel virtual addresses = persistent mapping = transient mappings
13
Two-Level Translation Scheme PAGE DIRECTORY CR3 PAGE TABLES PAGE FRAMES
14
Linear to Physical physical address-space offsettable-index linear address CR3 dir-index page frame page directory page table
15
Address-translation The CPU examines any virtual address it encounters, subdividing it into three fields offset into page-frame index into page-directory index into page-table 31 22 21 12 11 0 10-bits 12-bits This field selects one of the 1024 array-entries in the Page-Directory This field selects one of the 1024 array-entries in that Page-Table This field provides the offset to one of the 4096 bytes in that Page-Frame
16
Format of a Page-Table entry PAGE-FRAME BASE ADDRESSPWU PWTPWT PCDPCD AD00 3112 11 10 9 8 7 6 5 4 3 2 1 0 AVAIL LEGEND P = Present (1=yes, 0=no) W = Writable (1 = yes, 0 = no) U = User (1 = yes, 0 = no) A = Accessed (1 = yes, 0 = no) D = Dirty (1 = yes, 0 = no) PWT = Page Write-Through (1=yes, 0 = no) PCD = Page Cache-Disable (1 = yes, 0 = no)
17
Finding the user-buffer’s PFN To program the ‘base-address’ field in the second TX-Descriptor, our driver’s ‘write()’ function will need to know which physical Page-Frame the application’s buffer lies in And its PFN (Page-Frame Number) can be found from its virtual address by ‘walking- the-cpu-page-tables’ – even when Linux puts some page-tables in ‘high’ memory
18
Performing ‘virt_to_phys()’ ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos ) { unsigned int_cr3, *pgdir, *pgtbl, pfn_pgtbl, pfn_frame; unsigned intdindex, pindex, offset; // take apart the virtual-address of the user’s ‘buf’ variable dindex = ((int)buf >> 22) & 0x3FF;// pgdir-index (10-bits) pindex = ((int)buf >> 12) & 0x3FF;// pgtbl-index (10-bits) offset = ((int)buf >> 0) & 0xFFF;// frame-offset (12-bits) // then walk the CPU’s paging-tables to get buf’s physical-address asm(“ mov %cr3, %eax \n mov %eax, %0 “ : “=m”(_cr3) : : “ax” ); pgdir = (unsigned int*)phys_to_virt( _cr3 & ~0xFFF ); pfn_pgtbl = (pgdir[ dindex ] >> 12); pgtbl = (unsigned int *)kmap( &mem_map[ pfn_pgtbl ] ); pfn_frame = (pgtbl[ pindex ] >> 12); kunmap( &mem_map[ pfn_pgtbl ]; txring[ txtail + 1 ].base_address = (pfn_frame << 12) + offset;
19
Can’t cross a ‘page-boundary’ In order for the NIC to fetch the user’s data using its Bus-Master DMA capability, it is necessary for the buffer needs to reside in a physically contiguous memory-region But we can’t be sure Linux will have setup the CPU’s page-tables that way – unless the ‘buf’ is confined to a single page-frame buf
20
Truncate ‘len’ if necessary ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos ) { if ( offset + len > PAGE_SIZE ) len = PAGE_SIZE – offset; buf offset len PAGE_SIZE
21
‘zerocopy.c’ We created this modification of our ‘nic2.c’ device-driver so it’s ‘my_write()’ function lets an application perform transmissions without performing a memory-to-memory copy-operation (i.e., copy_from_user()’ ) It is not so easy to implement ‘zero-copy’ for receiving packets – can you say why?
22
Website article We’ve posted a link on our CS686 website to a frequently cited research-article about the various issues that arise when trying to implement the ‘zero-copy’ concept for the case of ‘incoming’ network-packets: The Need for Asynchronous, Zero-Copy Network I/OThe Need for Asynchronous, Zero-Copy Network I/O, by Ulrich Drepper, Red Hat, Inc.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.