Srihari Makineni & Ravi Iyer Communications Technology Lab

Name: Srihari Makineni & Ravi Iyer Communications Technology Lab
Uploaded: 2017-08-21T05:58:36+00:00
Duration: PTM12S47
Channel: Isabella Higgins
Description: Srihari Makineni & Ravi Iyer Communications Technology Lab

Architectural Characterization of TCP/IP Processing on the Intel® Pentium® M Processor
Srihari Makineni & Ravi Iyer Communications Technology Lab Intel® Corp. {srihari.makineni & HPCA-10 HPCA-10

Outline Motivation Overview of TCP/IP Setup and Configuration
TCP/IP Performance Characteristics Throughput and CPU Utilization Architectural Characterization TCP/IP in server workloads Ongoing work

Motivation Why TCP/IP? What is the problem?
TCP/IP is the protocol of choice for data communications What is the problem? So far system capabilities allowed TCP/IP to process data at Ethernet speeds But Ethernet speeds are jumping rapidly (1 to 10 Gbps) Requires efficient processing to scale to these speeds Why architectural characterization? Analyze performance characteristics and identify processor architectural features that impact TCP/IP processing Focus is not just on TCP/IP protocol processing but includes the entire stack HPCA-10

TCP/IP Overview Transmit Buffer User Kernel TCB Tx 1 Tx 2 Tx 3 ETH IP
Application Buffer User Kernel Sockets Interface TCP/IP Stack TCB Tx 1 Tx 2 Tx 3 ETH IP TCP Desc 1 Driver Desc 2 Network Hardware NIC DMA Eth Pkt 1

TCP/IP Overview Receive User Kernel Signal/ Copy TCB Buffer Rx 1 Rx 2
Application User Kernel Signal/ Copy Sockets Interface TCP/IP Stack TCB Buffer Rx 1 Rx 2 Rx 3 Copy Driver Descriptor ETH IP TCP Payload DMA Network Hardware NIC Eth Pkt 1

Setup and Configuration
Test setup System Under Test (SUT) Intel® Pentium® M 1600MHz, 1MB (64B line) L2 cache 2 Clients Four way Itanium® 2 1GHz; 3MB L3 (128B line) cache Operating System Microsoft Windows* 2003 Enterprise Edition Network SUT – 4Gbps total (2 dual port Gigabit NICs) Clients – 2Gbps per client (1 dual port Gigabit NIC)

Setup and Configuration
Tools NTttcp Microsoft application to measure TCP/IP performance Tool to extract CPU performance counters Settings 16 connections (4 per NIC port) Overlapped I/O Large Segment Offload (LSO) Regular Ethernet frames (1518 bytes) Checksum offload to NIC Interrupt coalescing HPCA-10

Throughput and CPU Utilization
For 32KB buffer Tx (LSO) Throughput is 42% CPU Rx Throughput is % CPU LSO kicks in for >1460 byte buffers We have observed peak Tx throughput of 7.5Gbps at 80% CPU for 64KB buffer Rx throughput increases by 20% for 64KB buffer with 1 dual port Gig NIC Lower Rx performance for > 512 byte buffer sizes Rx and Tx (no LSO) CPU utilization is 100% Benefit of LSO is significant (~250% for 64KB buffer) Lower throughput for < 1KB buffers is due to buffer locking TCP/IP 1Gbps & 1460 bytes requires >1 CPU HPCA-10

Processing Efficiency
Hz/bit 64 byte buffer Tx (lso) – and Rx – 13.7 64 KB buffer Tx (lso) – 0.212, Tx (no LSO) – 0.53 and Rx – 1.12 Several cycles are needed to move a bit, especially for Rx HPCA-10

Architectural Characterization
CPI Rx CPI higher than Tx for >512 byte buffers Tx (LSO) CPI is higher than Tx (no LSO)!!! CPI needs to come down to achieve TCP/IP scaling

Pathlength Rx pathlength increase is significant after 1460 byte buffer sizes For 64KB, TCP/IP stack has to receive and process 45 packets Lower CPI for Tx (no LSO) over Tx (LSO) is due to higher PL High PL shows that there is room for stack optimizations

Last level Cache Performance Extra memory copy Protocol processing for each incoming packet 64 full size packets for 64KB buffer Rx has higher misses Primary reason for higher CPI Lot of compulsory misses Source buffer, descriptors, may be destination buffer Tx (no LSO) has slightly higher misses per bit Rx performance does not scale with cache size (many compulsory misses) HPCA-10

L1 Data Cache Performance 32KB of data cache in Pentium® M processor As expected L1 data cache misses are more for Rx For Rx, 68% to 88% of L1 misses resulted in L2 hits Larger L1 data cache has limited impact on TCP/IP

L1 Instruction Cache Performance L1 Instruction Cache Performance 32KB instruction cache in Pentium® M processor Tx (no LSO) MPI is lower because of code temporal locality Rx code path generated L1 instruction capacity misses Larger L1 instruction cache helps RX processing

TLB Performance TLB Performance Size 128 instruction and 128 data TLB entries iTLB misses increase faster than dTLB misses

Branch Behavior 19-21% branch instructions Misprediction rate is higher in Tx than Rx for < 512 byte buffer size >98% accuracy in branch prediction

CPI Contributors RX is more memory intensive than TX Frequency Scaling Poor Frequency scaling due to memory latency overhead Frequency Scaling alone will not deliver 10x gain

TCP/IP in Server Workloads
Webserver TCP/IP data path overhead is ~28% Back-End (database server with iSCSI) TCP/IP data path overhead is ~35% Front-End (e-commerce server) TCP/IP data path overhead is ~29% Web server Each op KB file size transmit (avg.) receive - <256 bytes per op 330kbps for 4000 connections Back-End Metric - transactions per minute (tpmC) Multiple I/Os per transaction Depends on memory size Each I/O - 8KB storage request Assumes 70 / 30 split between read and write I/O Assumes storage over IP Front-End Metric - Web Interactions per Second (WIPS) Web Server transmits an average of 25 KB per WIP Web server receives an average of 1.6 KB per WIP TCP/IP Processing is significant in commercial server workloads HPCA-10

Conclusions Major Observations Key Issues
TCP/IP 1Gbps & 1460 bytes requires >1 CPU CPI needs to come down to achieve TCP/IP scaling High PL shows that there is room for stack optimizations Rx performance does not scale w/ cache size (=> compulsory misses) Larger L1 data cache has limited impact on TCP/IP Larger L1 instruction cache helps RX processing >98% accuracy in branch prediction Frequency Scaling alone will not deliver 10x gain TCP/IP Processing is significant in commercial server workloads Key Issues Memory Stall Time Overhead Pathlength (O/S Overhead, etc)

Ongoing work Investigating Solutions to the Memory Latency Overhead
Copy Acceleration Low cost synchronous/asynchronous copy engine DCA Incoming data is pushed into processor’s cache instead of memory Light weight Threads to hide memory access latency Switch-on-event threads + small context & low switching overhead Smart Caching Cache structures and policies for networking Partitioning Optimized TCP/IP stack running on dedicated processor(s) or core(s) Other Studies Connection processing, bi-directional data Application interference

Srihari Makineni & Ravi Iyer Communications Technology Lab

Similar presentations

Presentation on theme: "Srihari Makineni & Ravi Iyer Communications Technology Lab"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Srihari Makineni & Ravi Iyer Communications Technology Lab

Similar presentations

Presentation on theme: "Srihari Makineni & Ravi Iyer Communications Technology Lab"— Presentation transcript:

Similar presentations

About project

Feedback