Are You Insured Against Your Noisy Neighbor - A VSPERF Use Case Sunku.Ranganath@Intel.com Sridhar.Rao@Spirent.com Shreya.Pandita@Spirent.com
Agenda Intro to VSPERF Intro to Intel RDT & Spirent Cloud Stress Demo: Noisy Neighbor impact with VSPERF Intro to RMD Demo: Mitigating Noisy Neighbor impact with RMD Call to Action
Intro to VSPERF Define, implement and execute a test suite to characterize the performance of a virtual switch in the NFVi Based on industry standards Ability to assign and scale CPUs for VNFs Supports multiple traffic generators and virtual switches with various VNF deployment scenarios
Common Contention in Cloud Deployments Minimizing Total Cost of Ownership (TCO) often leads to oversubscription Quality of Service (QoS) requirements Service Level Agreements (SLAs) Metrics: Service Availability, Throughput, Latency, Scaling. Cloud vs. Network Function Virtualization Deployments Optimizing CPU resource utilization often leads to Shared Resource contention Multi-Tenants & Automated workload placement Lack of control of cache by orchestration layer
Intel® Resource Director Technology (Intel® RDT) DRAM Cache Allocation Technology (CAT) Last- Level Cache CORE APP DRAM Cache Monitoring Technology (CMT) CORE APP Last- Level Cache Identify misbehaving applications and reschedule according to priority Cache Occupancy reported on a per Resource Monitoring ID (RMID) basis—Advanced Telemetry Last-Level Cache partitioning mechanism enabling separation and prioritization of apps or VMs Misbehaving threads can be isolated to increase determinism
Key Concepts: Class of Service (CLOS) Threads/Apps/VMs grouped into Classes of Service (CLOS) for resource allocation Resource usage of any thread, app, VM, or a combination controlled with a CLOS Specify the CLOS for a thread via the per-core IA32_PQR_ASSOC (“PQR”) MSR Configure resource guidelines per CLOS Associate threads into CLOS Hardware manages resource allocation Default Bitmask LLC is all shared Overlapped Bitmask LLC is partially shared. Low priority Workload will be placed in COS with shared resources Isolated Bitmask LLC is allocated separately to individual COS.
Noisy Neighbor Impact & VSPERF Traffic Generator port 2000 Flows VNF 1 (Testpmd L2 FWD) Linux Kernel DPDK Intel Xeon Platform Pod 12 – Node 4 2 Dedicated cores Virtio port 0 Virtio port 1 DPDK - PMDs Open vSwitch bridge Tenant port 2 Internet Port: Onboard Intel GbE NIC Si Tenant port 1 NUMA Node 0 4 Dedicated cores Cloud Stress Noisy Neighbor 3 Dedicated cores VSPERF integration with Collectd provides insight into NFVi data plane resource utilization VSPERF automates the deployment of Phy-VM-Phy setup Cloud Stress as a Noisy Neighbor 4 Dedicated cores Cloud Stress Noisy Neighbor Figure: A Phy-VM-Phy deployment
CloudStress Intro to Spirent CloudStress Web-based infrastructure validation application Performance and capacity planning for Compute, Memory, Storage and Network I/O Dynamic workloads to validate NFV/Cloud infrastructure CloudStress
Intro to Spirent CloudStress Virtual Firewall NFVi Compute Network Storage
Creating Virtual Machine Profiles Spirent CloudStress NFVi Compute Network Storage
Creating Virtual Machine Profiles Spirent CloudStress Spirent CloudStress NFVi Under Test Compute Network Storage
Capacity Planning NFVi Under Test Compute Network Storage
Cloud Stress as Noisy Neighbor Assess impact of resource contention on VNF and/or NFV service chains. Noisy Neighbor VNF Performance vRouter vFW vCPE VNF NFVi Under Test Generate flap or negative events on a system to cause intentional disruption. This helps understand the impact of noisy neighbors on the given system. What is the impact of a CPU spike on one of the VM on a fully loaded host If network load drops suddenly, and after a short time returns all at once, is full capacity immediately available? What is the effect of VMs on oversubscribed hosts to uneven loads on a small sub-set of VMs? <Needs more definition and clarity> Compute Network Storage
Demo : Impact of Noisy Neighbor on VNF Under Test
Planning For Resources Remote analysis of resource utilization and granular resource control not optimal for latency sensitive workloads Planning for your Cache: LLC Profiling LLC considerations Class Of Service construction
Class Of Service Construction Total LLC Considerations: Capacity of Cache Would you require DDIO? Isolated vs. Overlapping cache COS Crucial to have local agent on the host to control & enforce COS associations for latency sensitive workloads 1 Non DDIO Packet path Figure: Traffic flow from NIC to VMs
Enabling Options User space tool requiring access to Intel MSR Platform Quality of Service (pqos) tool User space tool requiring access to Intel MSR Associates LLC per Core id basis https://github.com/intel/intel-cmt-cat Resctrl file system Extension of kernfs Associates using pid per thread basis Kernel 4.10+ Resource Management Daemon Newly open sourced Based on resctrl fs Figure: Kernel resctrl fs
Resource Management Daemon What is RMD Why RMD A Linux daemon that runs on individual hosts, with pluggable interfaces to interact with orchestration, monitoring and enforcement layers Communicates across control and data plane using REST API Receives resource policy from orchestration layer and enforces it on host Enforces resource allocation using kernel interfaces like resctrlfs or using libraries like libpqos Complex usage (mask) Real time tuning Varying platforms (cache size, bandwidth, numa) Fast shifting workloads (local policy) Uniform interface for RDT Simple API Interface Hosted at https://github.com/intel/rmd
RMD Architecture Open sourced on Nov 9th 2017 Provides the construct of overlapped and isolated COS’es Help tune the LLC for optimal performance Simple to use with max_cache and min_cache constructs WIP sections Configuration Policy Details osgroup cache ways reserved for operate system usage infragroup cache ways will be shared with other workloads guarantee allocate cache for workload max_cache == min_cache > 0 besteffort allocate cache for workload max_cache > min_cache > 0 shared allocate cache for workload max_cache == min_cache = 0
Demo : Mitigation of Noisy Neighbor Impact with RMD
Permutations of Test Scenarios Overlapping COS between: Virtual switch and VMs Multiple VMs OS and virtual switch Isolated COS between: DDIO considerations: Exclusive to VMs Exclusive to OS Shared across virtual switch & VMs Forced Contentions Limited LLC to VM under test Limited LLC to virtual switch 11 12 6 10 4 5 2 3 0, 7-9,13-23 1 Hypervisor PHY PMDs VM1 vswitchd OS VM2 COS0 COS1 COS2 COS3 Isolated LLC Overlapped OVS LLC Total LLC Isolated OVS LLC 1.DDIO 2.DDIO 3.DDIO Overlapped LLC across VMs Figure: Permutations of COS association
In Summary…. Call To Action… Noisy Neighbor affects are real and here to persist RMD provides a REST API for control/orchestration/management layer to request LLC for their VMs/Containers/applications. Call To Action… Enable test cases for VSPERF with various combinations of cache associations Scale the test scenarios for your projects with RMD and/or Cloud Stress
Questions?
References Cloud Testing with Synthetic Workload Gen: https://www.spirent.com/-/media/White- Papers/Broadband/PAB/Cloud_testing_with_synthetic_workload_generators.pdf Virtual Infrastructure Benchmarking: https://www.spirent.com/-/media/White- Papers/Broadband/PAB/Key_Considerations_for_Virtual_Infrastructure_Benchmarking_whit epaper.pdf Intro to Intel RDT: https://01.org/intel-rdt-linux/blogs/fyu1/2017/resource-allocation- intel%C2%AE-resource-director-technology Intro to RMD: https://github.com/intel/rmd Deterministic NFV w/ Intel RDT: https://builders.intel.com/docs/networkbuilders/deterministic_network_functions_virtualizatio n_with_Intel_Resource_Director_Technology.pdf