Hardware-Level Performance Analysis of Platform I/O

Hardware-Level Performance Analysis of Platform I/O
Patrick Lu, Roman Sudarikov intel

Legal Disclaimer & Optimization Notice
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Copyright © 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

New Modular PCIe Controller Interconnect
Intel® Xeon® Processor E7 v3 Family – 18C Block Diagram Intel® Xeon® Processor Scalable Family – 28C Block diagram Prior to Intel® Xeon® Scalable Processor, all PCIe lanes aggregated to single ring stop. In Intel® Xeon® Scalable Processor, 3x16 PCIe lanes split to three mesh stop. (higher concurrent I/O)

Re-Architected L2 & L3 Cache Hierarchy
Shared-distributed  shared-distributed L3 is primary cache Private-local  private L2 becomes primary cache with shared L3 used as overflow cache Shared L3 2.5MB/core (inclusive) Core L2 (256KB private) 1.375MB/core (non-inclusive) L2 (1MB private) Previous Architectures Skylake-SP Architecture Shared L3 changed from inclusive to non-inclusive: Inclusive (prior architectures)  L3 has copies of all lines in L2 Non-inclusive (Skylake architecture)  lines in L2 may not exist in L3 On-chip cache balance shifted from shared-distributed (prior architectures) to private-local (Skylake architecture)

Inclusive vs Non-Inclusive L3
Memory reads fill directly to the L2, no longer to both the L2 and L3 When a L2 line needs to be removed, both modified and unmodified lines are written back to DRAM Data shared across cores are copied into the L3 for servicing future L2 misses Inclusive L3 (prior architectures) Non-Inclusive L3 (Skylake-SP architecture) L2 256kB 2.5 MB L3 1 2 Memory L2 1MB 1.375 MB L3 1 2 Memory 3 3 Though software should be agnostic to the change, it may see different traffic pattern versus prior server implementations

Cache Usage on I/O and Core Interactions
I/O reads to memory are non-allocating, i.e., if line is not present in LLC it is not allocated; if line is in LLC, it stays there, if it is in one of L2s, it moves or gets copied to LLC I/O writes to memory can be allocating or non-allocating based on no-snoop attribute Allocating writes: Writes invalidate core caches and allocate the line in LLC cache where I/O is done Non-allocating writes: Writes invalidate core caches and update memory CPU supports DDIO where a subset of ways (default is 2) in LLC cache is used to allocate data brought in by I/O agents DDIO ways can be written by cores, whereas I/O can’t write into core ways Any performance enhancement technology built around last level cache may need to be revisited

IIO transaction flow – Overview
IIO – Integrated I/O Controller Controls I/O flow between PCIe devices and Main Memory “Allocating write transactions” PCI Root Port will utilize write buffers backed by LLC core cache when the target write buffer has WB attribute Data buffers naturally aged out of cache to main memory “Non-Allocating write transactions” PCI Root Port Write transactions utilize buffers not backed by cache – write data moves to the iMC without cache delay DDIO – Data Direct I/O Allows Bus Mastering PCI & RDMA I/O to move data directly in/out of LLC Core Caches Allocating Write transactions will utilize DDIO Configurable globally per platform, per Root PCI Port, or per PCI Transaction

Configuring IIO transaction flow
Disable_all_allocating_flows setting trumps all other settings when NoSnoopOpWrEn is set to 1, UseAllocatingFlowWr is a don’t care when NoSnoopOpWrEn is set to 0, the NS PCI Header bit is a don’t care Option Disable_all_allocating_flows (PER PLATFORM) NoSnoopOpWrEn (PER ROOT PORT) UseAllocatingFlowWr TLP NS Bit (PER PCI IO) Resulting IIO Write DDIO Notes 1 X Non- Allocating NO OPTIONAL PLATFORM GLOBAL 2 OPTIONAL PER ROOT PORT YES DEFAULT SETTING PER ROOT PORT 3 Enable Per Root and OPTIONAL PER PCI TRANSACTION Starting with Skylake Server IIO HW will look at the Non-Snoop bit in the PCI Header to determine allocating or non-allocating flows

Chassis Performance Monitoring
Core PerfMon can tell SW how it is performing on an iA core - e.g. was the code scheduled to execute efficiently? Were the tasks well balanced? Uncore monitoring can give SW a better sense where all that traffic was routed and what was involved in processing it.

DPDK packet processing using Direct Data I/O
CPU Socket CPU Cores MEMORY CONTROLLER RAM 1 LLC Packet 5 2 RXd NIC 4 3 6 Core writes RXd preparing for receiving packet NIC reads RXd to get buffer address NIC writes packet NIC writes RXd Core reads RXd (polling) Core reads packet and performs some action Most of software thread work in CPU core and local cache with minimal memory traffic per packet

hardware with all its key components and software acceleration engines
Telemetry points CPU Socket CPU Cores MEMORY CONTROLLER RAM 1 LLC Packet 5 2 RXd NIC 4 6 3 Core I/O data processing Integration with DPDK Memory bandwidth Core communicates with NIC through PCIe MMIO transactions Intel DDIO makes LLC the primary target of DMA operations Writing back descriptors may result in partial PCIe transaction Intel tools enable performance visibility of the platform at all layers: hardware with all its key components and software acceleration engines

Understanding PCIe performance
Network Rx (ItoM/RFO) Inbound PCIe write Network Tx (PCIeRdCur) Inbound PCIe read MMIO Read (PRd) Outbound CPU read MMIO Write (WiL) Outbound CPU write Avoid DDIO miss Avoid writing back “partial” descriptors Match workload I/O throughput Avoid MMIO Rd Optimize batch size Look for MMIO write traffic

Measuring memory bandwidth
CPU Socket CPU Cores MEMORY CONTROLLER RAM 1 LLC Packet 5 2 RXd NIC 4 3 6 Reason with memory traffic: Wrong NUMA allocation DDIO Miss CPU data structure No memory traffic: Best Fix it in the code! Receive packets ASAP, descriptor ring sizes (needs tuning) May be okay, but check for latency!

Measuring cross socket bandwidth
CPU Socket CPU Cores MEMORY CONTROLLER RAM 1 LLC Packet RXd NIC CPU Socket CPU Cores UPI CONTROLLER UPI CONTROLLER LLC RXd Packet 1 NIC Any QPI traffic will inevitably hurt performance. If unavoidable, watch for link utilization

Intel® Vtune Amplifier

Intel® Vtune Amplifier
PCIe bandwidth with traffic breakdown per physical device Intel DDIO misses resulted in write back to RAM MMIO traffic. Avoid Reads and control Writes DRAM bandwidth Socket interconnect traffic 1

Performance Counter Monitor (PCM)

DPDK ip_pipeline Forwarding Example
Intel® Xeon® Platinum 8180 Intel® X710-DA4 10G Intel® X710-DA4 10G Intel® X710-DA4 10G IIO Stack 0 Gen3x8 IIO Stack 1 Gen3x8 IIO Stack 2 Gen3x8 Measure individual PCIe device bandwidth at x4 granularity Enumerate downstream devices behind each IIO Stack The output of the pcm-iio.x showed three different I/O bandwidth coming from 3 different PCIe NIC cards. First 4x10G receive and transmit 10% of line 64 bytes. Second 4x10G receive and transmit 20% of line 64 bytes. Third 4x10G receive and transmit 30% of line 64 bytes. Monitor both inbound/outbound at DWord granularity PCIe to system CPU to PCIe Can monitor VT-d IOTLB miss rate (opCode.txt)

Analyzing performance with Intel tools
Easy Advance Calibrate I/O bandwidth with expected performance Check for unexpected cross-NUMA traffic Check for unexpected PCIe MMIO traffic Optimize workload to make use of Intel DDIO efficiently 1

Questions? Patrick Lu Roman Sudarikov patrick.lu@intel.com
Questions?

Backup

IIO transaction flow Inbound Write Flows:
I2M – get line ownership for IIO, data is not required to be in the LLC RFO – get line ownership and data in the LLC/eDRAM for IIO, deliver a copy of the data to IIO WbMtoI – IIO delivers data to Cbo, releases ownership CLFlush – IIO delivers data to Cbo, flushes to DRAM Inbound Read and Snoop Flows: RdCurrent – LLC delivers data to IIO SnpInv – IIO releases ownership always Peer to Peer Flows: NcP2PB – Peer Write or Completion NcP2PS – Peer to Peer Read

Configuring IIO transaction flow
Option 1: global per platform Option 2: per root port Option 3: per pci transaction

Hardware-Level Performance Analysis of Platform I/O

Similar presentations

Presentation on theme: "Hardware-Level Performance Analysis of Platform I/O"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hardware-Level Performance Analysis of Platform I/O

Similar presentations

Presentation on theme: "Hardware-Level Performance Analysis of Platform I/O"— Presentation transcript:

Similar presentations

About project

Feedback