Embedded System Lab. 김해천 Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist,

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Distributed Systems CS

Matching Memory Access Patterns and Data Placement for NUMA Systems Zoltán Majó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland.

Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.

Thoughts on Shared Caches Jeff Odom University of Maryland.

Multiple Processor Systems

Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University

Background Computer System Architectures Computer System Software.

OGO 2.1 SGI Origin 2000 Robert van Liere CWI, Amsterdam TU/e, Eindhoven 11 September 2001.

An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu

Windows 2000 and Solaris: Threads and SMP Management Submitted by: Rahul Bhuman.

An Introduction to Parallel Computing Dr. David Cronk Innovative Computing Lab University of Tennessee Distribution A: Approved for public release; distribution.

Workload Management Massimo Sgaravatto INFN Padova.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.

CPU Scheduling - Multicore. Reading Silberschatz et al: Chapter 5.5.

Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.

Computer System Architectures Computer System Software

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 1 Introduction Read:

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.

Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.

Embedded System Lab 김해천 Thread and Memory Placement on NUMA Systems: Asymmetry Matters.

Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.

Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.

+ Clusters Alternative to SMP as an approach to providing high performance and high availability Particularly attractive for server applications Defined.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

Martin Kruliš by Martin Kruliš (v1.1)1.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.

Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.

Background Computer System Architectures Computer System Software.

Silberschatz and Galvin  Operating System Concepts Module 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Optimizing Distributed Actor Systems for Dynamic Interactive Services

CS 704 Advanced Computer Architecture

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Thread & Processor Scheduling

Introduction to parallel programming

CS5102 High Performance Computer Systems Thread-Level Parallelism

Distributed Processors

The University of Adelaide, School of Computer Science

CS 147 – Parallel Processing

Task Scheduling for Multicore CPUs and NUMA Systems

Cache Memory Presentation I

CMSC 611: Advanced Computer Architecture

Introduction to Multiprocessors

Lecture 24: Memory, VM, Multiproc

Process scheduling Chapter 5.

Distributed Systems CS

Hybrid Programming with OpenMP and MPI

High Performance Computing

The University of Adelaide, School of Computer Science

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Database System Architectures

CSE 542: Operating Systems

Operating System Overview

Presentation transcript:

Embedded System Lab. 김해천 Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP

김 해 천김 해 천 Embedded System Lab. Introduction to NUMA Non Uniform Memory Access Multiple physical CPUs in a system Each CPU has memory attached to it  Local memory, fast Each CPU can access other CPU’s memory, too  Remote memory, slower SMP

김 해 천김 해 천 Embedded System Lab. Automatic NUMA balancing But, NUMA performance penalties  Higher latency of accessing remote memory  Interconnect contention To enhance NUMA performance  Automatic NUMA balancing strategies CPU follows memory  Try running tasks where their memory is Memory follows CPU  Move memory to where it is accessed Both strategies are used by automatic NUMA balancing  Various mechanisms involved  Lots of interesting corner cases...

김 해 천김 해 천 Embedded System Lab. Automatic NUMA balancing Task placement examples 1  Moving task A to node 1: 40% improvement  Moving a task to node 1 removes a load imbalance  Moving task A to an idle CPU on node 1 is desirable

김 해 천김 해 천 Embedded System Lab. Automatic NUMA balancing Task placement examples 2  Moving task A to node 1: 40% improvement  Moving task T to node 0: 20% improvement  Swapping tasks A & T is desirable

김 해 천김 해 천 Embedded System Lab. Automatic NUMA balancing Task grouping  Multiple tasks can access the same memory  Use CPU num & pid in struct page to detect shared memory At NUMA fault time, check CPU where page was last faulted Group tasks together in numa_group, if PID matches  Grouping related tasks improves NUMA task placement Only group truly related tasks Task grouping & Task placement  Task placement code similar to before

김 해 천김 해 천 Embedded System Lab. Automatic NUMA balancing Task grouping & Task placement

김 해 천김 해 천 Embedded System Lab. Automatic NUMA balancing Pseudo-interleaving  Sometimes one workload spans multiple nodes More threads running than one node has CPU cores Spread out by the load balancer  Goals Maximize memory bandwidth available to workload Minimize number of page migrations  Page interleaving problem

김 해 천김 해 천 Embedded System Lab. Automatic NUMA balancing Pseudo-interleaving Solution  Allow NUMA migration on private faults  Allow NUMA migration from more used, to lesser used node  Nodes are equally use, maximizing memory bandwidth  NUMA page migration only on private faults  NUMA page migration on shared faults avoided private faults: memory access by same task twice in a row shared faults: memory access by different task than last time

김 해 천김 해 천 Embedded System Lab. The End

김 해 천김 해 천 Embedded System Lab. Introduction to NUMA Advanced concept from SMP  Two or more processors, a multiprocessor computer architecture that uses one of a shared memory