NUMA scaling issues in 10GbE NetConf 2009 PJ Waskiewicz Intel Corp.

Slides:



Advertisements
Similar presentations
Reconfigurable Sensor Networks with SOS Chih-Chieh Han, Ram Kumar Rengaswamy, Roy Shea and Mani Srivastava UCLA Networked and Embedded Systems Laboratory.
Advertisements

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
© 2002 Sept 2003 OpenVMS on Marvel EV7 Performance Characterization 1 “Marvel” EV7 for OpenVMS: Proof Points from Live Customer Production Systems Tech.
Geoff Salmon, Monia Ghobadi, Yashar Ganjali, Martin Labrecque, J. Gregory Steffan University of Toronto.
A Case for Virtualizing Nodes on Network Experimentation Testbeds Konrad Lorincz Harvard University June 1, 2015June 1, 2015June 1, 2015.
NoHype: Virtualized Cloud Infrastructure without the Virtualization Eric Keller, Jakub Szefer, Jennifer Rexford, Ruby Lee ISCA 2010 Princeton University.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Tools for Investigating Graphics System Performance
Multiprocessors ELEC 6200 Computer Architecture and Design Instructor: Dr. Agrawal Yu-Chun Chen 10/27/06.
Performance Engineering and Debugging HPC Applications David Skinner
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Implementing Efficient RSS Capable Hardware and Drivers for Windows 7
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
Henry Brady Computer Components Unit 2 – Computer Systems.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Chapter 5 – CPU Scheduling (Pgs 183 – 218). CPU Scheduling  Goal: To get as much done as possible  How: By never letting the CPU sit "idle" and not.
Kyle Spafford Jeremy S. Meredith Jeffrey S. Vetter
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Lecture 17 Page 1 CS 111 Online Distributed Computing CS 111 On-Line MS Program Operating Systems Peter Reiher.
The Central Processing Unit
Recent advances in the Linux kernel resource management Kir Kolyshkin, OpenVZ
Embedded System Lab. 김해천 Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist,
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
10GE network tests with UDP
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
Hardware. Make sure you have paper and pen to hand as you will need to take notes and write down answers and thoughts that you can refer to later on.
Lessons from HLT benchmarking (For the new Farm) Rainer Schwemmer, LHCb Computing Workshop 2014.
Optimised Memory Transfer & Flow Control for High Speed Networks - Codito Technologies Pvt. Ltd. - D Y Patil College of Engineering.
Para-Snort : A Multi-thread Snort on Multi-Core IA Platform Tsinghua University PDCS 2009 November 3, 2009 Xinming Chen, Yiyao Wu, Lianghong Xu, Yibo Xue.
Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.
Embedded System Lab 김해천 Thread and Memory Placement on NUMA Systems: Asymmetry Matters.
Hyper-V Performance, Scale & Architecture Changes Benjamin Armstrong Senior Program Manager Lead Microsoft Corporation VIR413.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
OBJECTIVE: To learn about the various system calls. To perform the various CPU scheduling algorithms. To understand the concept of memory management schemes.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Martin Kruliš by Martin Kruliš (v1.1)1.
Clustering in OpenDaylight
1 Three ways to (ab)use Multipath Congestion Control Costin Raiciu University Politehnica of Bucharest.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
DefinitionDefinition DesignDesign DevelopmentDevelopment TestingTesting DeploymentDeployment OperationsOperations Application Life-Cycle.
Thread & Processor Scheduling CSSE 332 Operating Systems Rose-Hulman Institute of Technology.
Writing (and testing) device drivers without hardware PJ Waskiewicz, LAN Access Division, Intel Corp. Title.
Unit 2 Technology Systems
GPUNFV: a GPU-Accelerated NFV System
Kernel/Hardware for bifrost
Thread & Processor Scheduling
CS427 Multicore Architecture and Parallel Computing
The Multikernel: A New OS Architecture for Scalable Multicore Systems
Multiplication and Division Equations with Decimals
Scaling the Memory Power Wall with DRAM-Aware Data Management
CSE451 I/O Systems and the Full I/O Path Autumn 2002
Task Scheduling for Multicore CPUs and NUMA Systems
D.Cobas, G. Daniluk, M. Suminski
Swapping Segmented paging allows us to have non-contiguous allocations
Report from Netconf 2009 Jesper Dangaard Brouer
Multi-PCIe socket network device
NVIDIA Jetson Platform Characterization
Boost Linux Performance with Enhancements from Oracle
EE 193: Parallel Computing
CS 179: Lecture 12.
CS703 - Advanced Operating Systems
Integrating DPDK/SPDK with storage application
Pedro Miguel Teixeira Senior Software Developer Microsoft Corporation
Chris Gill CSE 522S – Advanced Operating Systems
Multiplication and Division of Integers
Run time performance for all benchmarked software.
Balanced scales and equations
Your .NET App Won’t Scale
Presentation transcript:

NUMA scaling issues in 10GbE NetConf 2009 PJ Waskiewicz Intel Corp. LAN Access Division

NUMA balancing on 10GbE No affinity to a socket when a driver loads Insmod runs as a single thread, indeterminate of where static structures will be allocated from No linkage (currently) to aligning where driver buffers are allocated to where userspace apps run. Is this important?

Current issues observed in 10GbE Scaling multiple ports of 10GbE can cause NUMA memory bandwidth bottlenecks What happens in systems with a PCIe slot affinitized to a socket? How does the driver know, and allocate accordingly? Kernel currently references everything per-core (per_cpu lists, etc.). Moving to billions of cores, referencing things to node starts to make more sense

Tiny snapshot of balancing issue

Thoughts on which direction to go This problem isn't solved yet, and affects almost everyone. Becomes even worse approaching 40GbE 100GbE Have an API of sorts to “properly” allocate memory for drivers in a NUMA environment Move towards using a single queue (or small set) per NUMA node, instead of queue per CPU core. Inter- socket performance is much better than inter-node Is there any benefit in trying to drive NUMA affinitization into userspace (possibly through the recent Flow Steering from Tom Herbert?)