Download presentation
Presentation is loading. Please wait.
Published byDeirdre Phillips Modified over 5 years ago
1
Department of Computer Science University of California, Santa Barbara
Adaptive Two-level Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines Kai Shen, Hong Tang, and Tao Yang Department of Computer Science University of California, Santa Barbara
2
MPI-Based Parallel Computation on Shared Memory Machines
Shared Memory Machines (SMMs) or SMM Clusters become popular for high end computing. MPI is a portable high performance parallel programming model. MPI on SMMs Threads are easy to program. But MPI is still used on SMMs: Better portability for running on other platforms (e.g. SMM clusters); Good data locality due to data partitioning. 2019/5/24 Shen, Tang, and SuperComputing'99
3
Scheduling for Parallel Jobs in Multiprogrammed SMMs
Gang-scheduling Good for parallel programs which synchronize frequently; Affect resource utilization (Processor-fragmentation; not enough parallelism to use allocated resource). Space/time Sharing Time sharing combined with dynamic partitioning; High throughput. Popular in current OS (e.g., IRIX 6.5) Impact on MPI program execution Not all MPI nodes are scheduled simultaneously; The number of available processors for each application may change dynamically. Optimization is needed for fast MPI execution on SMMs. 2019/5/24 Shen, Tang, and SuperComputing'99
4
Shen, Tang, and Yang @ SuperComputing'99
Techniques Studied Thread-Based MPI execution [PPoPP’99] Compile-time transformation for thread-safe MPI execution Fast context switch and synchronization Fast communication through address sharing Two-level thread management for multiprogrammed environments Even faster context switch/synchronization Use scheduling information to guide synchronization Our prototype system: TMPI 2019/5/24 Shen, Tang, and SuperComputing'99
5
Shen, Tang, and Yang @ SuperComputing'99
Impact of synchronization on coarse-grain parallel programs Running a communication-infrequent MPI program (SWEEP3D) on 8 SGI Origin 2000 processors with multiprogramming degree 3. Synchronization costs 43%-84% of total time. Execution time breakdown for TMPI and SGI MPI: 2019/5/24 Shen, Tang, and SuperComputing'99
6
Shen, Tang, and Yang @ SuperComputing'99
Related Work MPI-related Work MPICH, a portable MPI implementation [Gropp/Lusk et al.]. SGI MPI, highly optimized on SGI platforms. MPI-2, multithreading within a single MPI node. Scheduling and Synchronization Process Control [Tucker/Gupta] and Scheduler Activation [Anderson et al.] Focus on OS research. Scheduler-conscious Synchronization [Kontothanssis et al.] Focus on primitives such as barriers and locks. Hood/Cilk threads [Arora et al.] and Loop-level Scheduling [Yue/Lilja]. Focus on fine-grain parallelism. 2019/5/24 Shen, Tang, and SuperComputing'99
7
Shen, Tang, and Yang @ SuperComputing'99
Outline Motivations & Related Work Adaptive Two-level Thread Management Scheduler-conscious Event Waiting Experimental Studies 2019/5/24 Shen, Tang, and SuperComputing'99
8
Context Switch/Synchronization in Multiprogrammed Environments
In multiprogrammed environments, synchronization leads to more context switches large performance impact. Conventional MPI implementation maps each MPI node to an OS process. Our earlier work maps each MPI node to a kernel thread. Two-level Thread Management: maps each MPI node to a user-level thread. Faster context switch and synchronization among user-level threads Very few kernel-level context switches 2019/5/24 Shen, Tang, and SuperComputing'99
9
System Architecture … ... … ... … ...
MPI application MPI application … ... TMPI Runtime TMPI Runtime … ... User-level threads User-level threads System-wide resource management Targeted at multiprogrammed environments Two-level thread management 2019/5/24 Shen, Tang, and SuperComputing'99
10
Adaptive Two-level Thread Management
System-wide resource manager (OS kernel or User-level central monitor) collects information about active MPI applications; partitions processors among them. Application-wide user-level thread management maps each MPI node into a user-level thread; schedules user-level threads on a pool of kernel threads; controls the number of active kernel threads close to the number of allocated processors. Big picture (in the whole system): #Active kernel threads ≈ #Processors Minimize kernel-level context switch 2019/5/24 Shen, Tang, and SuperComputing'99
11
User-level Thread Scheduling
Every kernel thread can be: active: executing an MPI node (user-level thread); suspended. Execution invariant for each application: #active kernel threads ≈ #allocated processors (minimize kernel-level context switch) #kernel threads = #MPI nodes (avoid dynamic thread creation) Every active kernel thread polls system resource manager, which leads to: Deactivation: suspending itself Activation: waking up some suspended kernel threads No-action When to poll? 2019/5/24 Shen, Tang, and SuperComputing'99
12
Polling in User-Level Context Switch
Context switch is a result of synchronization (e.g. an MPI node waits for a message). Underlying kernel thread polls system resource manager during context switch: Two stack switches if deactivation suspend on a dummy stack One stack switch otherwise After optimization, 2s in average on SGI Power Challenge 2019/5/24 Shen, Tang, and SuperComputing'99
13
Shen, Tang, and Yang @ SuperComputing'99
Outline Motivations & Related Work Adaptive Two-level Thread Management Scheduler-conscious Event Waiting Experimental Studies 2019/5/24 Shen, Tang, and SuperComputing'99
14
Event Waiting Synchronization
All MPI synchronization is based on waitEvent waiter caller waitEvent(*pflag == value); Waiting could be: spinning yielding/blocking waiting *pflag = value; wakeup 2019/5/24 Shen, Tang, and SuperComputing'99
15
Tradeoff between spin and block
Basic rules for waiting using spin-then-block: Spinning wastes CPU cycles. Blocking introduces context switch overhead; always-blocking is not good for dedicated environments. Previous work focuses on choosing the best spin time. Our optimization focus and findings: Fast context switch has substantial performance impact; Use scheduling information to guide spin/block decision: Spinning is futile when the caller is not currently scheduled; Most blocking cost comes from cache flushing penalty. (actual cost varies, up to several ms) 2019/5/24 Shen, Tang, and SuperComputing'99
16
Scheduler-conscious Event Waiting
User-level scheduler provides: scheduling info affinity info 2019/5/24 Shen, Tang, and SuperComputing'99
17
Experimental Settings
Machines: SGI Origin 2000 system with MHz MIPS R10000s with 2GB memory SGI Power Challenge with 4 200MHz MPIS R4400s with 256MB memory Compare among: TMPI-2: TMPI with two-level thread management SGI MPI: SGI’s native MPI implementation TMPI: original TMPI without two-level thread management 2019/5/24 Shen, Tang, and SuperComputing'99
18
Shen, Tang, and Yang @ SuperComputing'99
Testing Benchmarks Sync frequency is obtained by running each benchmark with 4 MPI nodes on 4-processor Power Challenge. The higher the multiprogramming degree, the more spin-blocks (context switch) during each synchronization Sparse LU benchmarks have much more frequent synchronization than others. 2019/5/24 Shen, Tang, and SuperComputing'99
19
Performance evaluation on a Multiprogrammed Workload
Workload: contains a sequence of six jobs launched with a fixed interval. Compare job turnaround time in Power Challenge. 2019/5/24 Shen, Tang, and SuperComputing'99
20
Workload with Certain Multiprogramming Degrees
Goal: identify the performance impact of multiprogramming degrees. Experimental setting: Each workload has one benchmark program. Run n MPI nodes on p processors (n≥p). Multiprogramming degree is n/p. Compare megaflop rates or speedups of the kernel part of each application. 2019/5/24 Shen, Tang, and SuperComputing'99
21
Performance Impact of Multiprogramming Degree (SGI Power Challenge)
2019/5/24 Shen, Tang, and SuperComputing'99
22
Performance Impact of Multiprogramming Degree (SGI Origin 2000)
Performance ratios of TMPI-2 over TMPI Performance ratios of TMPI-2 over SGI MPI 2019/5/24 Shen, Tang, and SuperComputing'99
23
Benefits of Scheduler-conscious Event Waiting
Improvement over simple spin-block on Power Challenge Improvement over simple spin-block on Origin 2000 2019/5/24 Shen, Tang, and SuperComputing'99
24
Shen, Tang, and Yang @ SuperComputing'99
Conclusions Contributions for optimizing MPI execution: Adaptive two-level thread management; Scheduler-conscious event waiting; Great performance improvement: up to an order of magnitude, depending on applications and load; In multiprogrammed environments, fast context switch/synchronization is important even for communication-infrequent MPI programs. Current and future work Support threaded MPI on SMP-clusters 2019/5/24 Shen, Tang, and SuperComputing'99
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.