Download presentation
Presentation is loading. Please wait.
Published byAngelina Kelly Modified over 9 years ago
1
User-Level Process towards Exascale Systems Akio Shimada [1], Atsushi Hori [1], Yutaka Ishikawa [1], Pavan Balaji [2] [1] RIKEN AICS, [2] Argonne National Laboratory
2
Background MPI processes running on a HPC cluster communicate with each other to exchange the data for parallel computation – An MPI process must wait for a completion of a communication Latency hiding can be considered as an important issue towards Exascale systems – Network system of a HPC cluster will be larger
3
Methods for Latency Hiding Non-blocking communication – Overlapping communication and computation Oversubscription – Binding multiple processes to one CPU core – Switching process when a process is blocked to wait for a completion of a communication
4
Problem Process context switch is slow – The overhead of process context spoils the benefit of the process oversubscription in some cases [ Lancu et al. IPDPS 2010 ] The overhead of jumping into the kernel context The overhead of the address space switching
5
Conventional Approach The oversubscription using user-level thread (e.g. FG-MPI) – Invoking multiple user-level threads within a process – Assigning a role of an MPI process to a user-level thread Pros and cons – Pros Fast context switch – The context switch between user-level threads can be conducted in the user-space – The context switch between user-level threads does not require address space switching – Cons Modification to the application is required – Program code (text) and data (data, bss and heap) are shared among user-level threads playing a role of an MPI process
6
Our Solution User-level process (ULP) – ULP is a “process”, which can be schedules in the user- space The ULP has the beneficial features of the user-level thread The ULP has its own program code and data. (Therefore, we equate the ULP with “process”.) – Capability of ULP The ULP enables the low-overhead process oversubscription Modification to the application is not required Kernel-level ProcessUser-level ThreadUser-level Process Context switchSlowFast Modification to the application Not requiredRequiredNot required
7
Overview of User-level Process Task Scheduler (Kernel-space) data bss text data heap data bss text heap data bss text heap Task Scheduler (User-space) data bss text heap data bss text heap data bss text heap Kernel-level Process User-level Process User-level Process User-level Process Kernel-level Thread Kernel-level Thread Kernel-level Thread User-level Thread User-level Thread User-level Thread Execution Context C CPU Core (a) Kernel-level Process Kernel-level Process (b) User-level Process (c) Kernel-level Thread (d) User-level Thread Kernel-level Process Kernel-level Process stack bss heap text data bss heap text stack Address Space Boundary Task Scheduler (User-space) C C CC Task Scheduler (Kernel-space) The ULP can be scheduled in the user-space – The low-overhead oversubscription can be achieved by avoiding the overhead of the process context switch The ULP has its own program code and data – Modification to the application is not required
8
Address Space Design TEXT DATA&BSS HEAP STACK KERNEL ULP 0 Address low high TEXT DATA&BSS HEAP STACK KERNEL ULP 1 ULP 2 TEXT DATA&BSS HEAP KERNEL STACK 1 STACK 0 STACK N-1 ULP N-1 STACK 2 Process User-level Thread User-level Process
9
Context Switch text data & bss heap stack Partition for ULP 0 Partition for ULP 1 registers text data & bss heap stack registers CPU core ① save context of user-level process 0 ② load context of user-level process 1 Low High Address Context switch from ULP 0 to ULP 1 Segment registers must be considered on x86_64 architectures – Segment registers are not accessible from user-space – The fs register is used for implementing Thread Local Storage (TLS) – Thread safe functions must be build without using TLS
10
ULP API int pvas_ulp_create(int *pvd) – pvas_ulp_create creates address space for ULPs int pvas_ulp_destroy(int pvd) – pvas_ulp_destroy destroys a created address space int pvas_ulp_spawn(int pvd, int pvid, char *filename, char **argv, char **environ) – pvas_ulp_spawn spawns kernel-level process with a ULP int pvas_ulp_exec(int pvid, char *filename, char **argv, char **environ) – pvas_ulp_exec creats and executes a new ULP int pvas_ulp_switch(int pvid) – pvas_ulp_switch conducts context from the current ULP to the indicated ULP
11
Preliminary Evaluation (context switch performance) Benchmark – Invoking multiple parallel processes on a single CPU core – A parallel process may be a kernel-level process or a kernel-level thread or a user-level thread or a user-level process – Measuring a time elapsed until all parallel process performs context switch 1000 times The performance of the ULP is competitive with that of the user-level thread Environment CPU: Intel Xeon X5670 2.93 GHz OS : Linux 2.6.32-el6 for x86_64 Lower is better
12
Summary and Future Work Summary – The ULP enables the low-overhead oversubscription by avoiding the overhead of the process context switch – The oversubscription using ULP does not require any modification to the application Future work – Future work is to embed the capability of the ULP in the MPI runtimes and evaluate it
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.