Download presentation
Presentation is loading. Please wait.
Published byPolly Harrell Modified over 8 years ago
1
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation on Massively Parallel Many-Core Architectures Min Si, Yutaka Ishikawa (Advisor) from University of Tokyo, Pavan Balaji (Advisor) from Argonne National Laboratory Funneled / Serialized Thread Safety My doctoral research focuses on exploiting the capabilities of massively threaded many-core architectures on widely used MPI implementations. I will investigate the characteristics of MPI and develop various design strategies that allow MPI implementations to perform efficiently for different kinds of user programming models. The research plan is broadly divided into three related pieces for various thread modes. This work has been partially supported by the CREST project of the Japan Science and Technology Agency (JST), and the National Project of MEXT called Feasibility Study on Advanced and Efficient Latency Core Architecture. It has been also supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357 Research I. Internal Multithreading in MPI MPI_THREAD_FUNNELED Multiple user threads Only master thread is allowed to make MPI calls MPI_THREAD_SERIALIZED Multiple user threads Only a single thread is allowed to make MPI calls MPI_THREAD_MULTIPLE Multiple user threads All the threads are allowed to make MPI calls by sharing MPI resources MPI_THREAD_SINGLE MPI process per core, no threads Problem Threads are IDLE during MPI calls Problem Overhead to keep internal state consistent (lock, memory barrier etc.) between hundreds of threads Problem Core is IDLE if the MPI process is in blocking wait during MPI calls. Problem: Idle Resources during MPI Calls Funneled Mode #pragma omp parallel { /* User Computation */ } MPI_Function ( ); MPI Process COMP. MPI COMM. Serialized Mode #pragma omp parallel { { /* User Computation */ } #pragma omp single MPI_Function (); } MPI Process COMP. MPI COMM. Multithreads Thread Safety MPI Process COMP. COMP. MPI COMM. Runtime dynamic thread safety reduces the scope of Multithreading region Small collection of threads sharing MPI COMM. COMM 0 Rank 0 COMM 1 Rank 2 TH0TH1 TH2 TH3 P0 Original Critical Region Ideal Critical Region Original Critical Region Ideal Critical region PROBLEM Heavy Synchronization & Memory Barrier between hundreds of Threads Research II. Fine-Grained Consistency Management in MPI P0 COMP MPI COMM. COMP P1 P2 P3 MPI Process COMP. COMP. MPI COMM. MPI Process COMP. COMP. MPI COMM. Forcing waiting threads to enter in a passive wait immediately # pragma omp parallel # pragma omp single { set_fast_yield (1); #pragma omp parallel { … } } Getting the number of idle threads #pragma omp parallel #pragma omp single { #pragma omp parallel \ get_num_idle_threads() num_threads( get_num_idle_threads() ) { … } } Internal Barrier Threads waiting in the barrier are doing SPIN LOOP until timeout ! Solution: Sharing Idle Threads with Application inside MPI #pragma omp parallel { /* User Computation */ } MPI_Function( ){ #pragma omp parallel { /* Internal Processing */ } MPI Process MPI COMM. Challenges The number of IDLE threads required by some parallel algorithms is UNKNOWN. Nested parallelism - Creates new Pthreads - Offloads thread scheduling to OS, causes threads OVERRUNNING issue. MPI Single Thread Safety Research III. Fiber level MPI processes P0 COMP. MPI_Wait COMP. COMP. MPI COMM. COMP. P1 P2 OS Process If blocking, do Yield. MPI Single Thread Safety How to keep the core always BUSY ? PROBLEM Core is IDLE while blocking wait ! - Multiple fiber level MPI processes per core - Fine-grained scheduling between these fibers * This research is based on FG-MPI. We plan to improve various issues based on the current design, such as the task load balancing, and the critical section for shared resources. Derived Data Type Packing Processing - MPI_Pack / MPI_Unpack - Communication using Derived Data Type - Transfers non-contiguous data - Packs / unpacks data internally - Parallelism - Packs/unpacks non-contiguous elements in parallel 2D Halo Exchange Number of Threads Speedup 2D double elements matrix local packing - Original sequential algorithm - Shared user space buffer - Pipelining data movement between sender and receiver Number of Threads Bandwidth Improvement OSU P2P Benchmark - Parallelism - Reserves many available cells as a large cell - Moves this large cell in parallel Intra-node Large Message Communication Speedup is close to linear ! - Parallel InfiniBand Communication - Oneside applications using parallel IB netmod - Large amount of small non-blocking RMA operations - Wait for all operations to complete before returning from the second synchronization call - Parallelize messages sent to different targets Number of Threads BW Improvement Harmonic Mean TEPS Number of Threads Mean Time (sec) Graph500 benchmark (MPI oneside version) using 9 MPI processes ADI3 CH3 SHMnemesis TCP IB … HCA CQ IB CTX QP IB COMM. parallelism experiments based on ib_write_bw Netmod Shared Buffer Cell[0] Cell[1] Cell[2] Cell[3] User Buffer SenderReceiver Cell User Buffer Cell A large parallel cell
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.