Download presentation
Presentation is loading. Please wait.
Published byKory Morton Modified over 9 years ago
1
GACOP JACCA Meeting - February 27, 2004 P AL A New Approach in the System Software Design for Large-Scale Parallel Computers Juan Fernández 1,2, Eitan Frachtenberg 1, Fabrizio Petrini 1, Salvador Coll 1 and José C. Sancho 1 1 Performance and Architecture Lab 2 Grupo de Arquitectura y Computación Paralela (GACOP) CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN URL: http://www.c3.lanl.gov URL: http://www.ditec.um.eshttp://www.c3.lanl.govhttp://www.ditec.um.es email:{juanf,eitanf,fabrizio,scoll,jcsancho}@lanl.gov
2
GACOP JACCA Meeting - February 27, 2004 P AL Motivation System software is a key factor to maximize usability, performance and scalability on large-scale systems!!! Hardware / OSs are glued together by System Software: Resource Management Communications Parallel Development and Debugging Tools Parallel File System Fault Tolerance OS OS OS OS OS OS OS OS
3
GACOP JACCA Meeting - February 27, 2004 P AL Motivation System software complexity due to multiple factors: Extremely complex global state Non-deterministic behavior inherent to computing systems and parallel apps Local OSs lack global awareness of parallel apps Independent design of different components User-level applications rely on system software
4
GACOP JACCA Meeting - February 27, 2004 P AL Outline Motivation Goals Core Primitives Resource Management Communication Libraries Ongoing and future work
5
GACOP JACCA Meeting - February 27, 2004 P AL Target Simplifying design and implementation of the system software for large-scale parallel computers Simplicity, performance, scalability, determinism Approach Built atop a basic set of three primitives Global synchronization/scheduling Vision SIMD system running MIMD applications (variable granularity in the order of hundreds of s) Goals
6
GACOP JACCA Meeting - February 27, 2004 P AL Outline Motivation Goals Core Primitives Resource Management Communication Libraries Ongoing and future work
7
GACOP JACCA Meeting - February 27, 2004 P AL Core Primitives System software built atop three primitives Xfer-And-Signal –Transfer block of data to a set of nodes –Optionally signal local/remote event upon completion Compare-And-Write –Compare global variable on a set of nodes –Optionally write global variable on the same set of nodes Test-Event –Poll local event
8
GACOP JACCA Meeting - February 27, 2004 P AL Core Primitives CharacteristicRequirementSolution Job Launching Data dissemination Flow Control Termination Detection Xfer-And-Signal Compare-And-Write Job Scheduling Heartbeat Context switch responsiveness Xfer-And-Signal Prioritized messages / Multiple Rails Communication PUT GET Barrier Broadcast Reduce Xfer-And-Signal Compare-And-Write Compare-And-Write+Xfer-And-Signal Xfer-And-Signal / “Smart” NIC The proposed mechanisms simplify design and implementation!!!
9
GACOP JACCA Meeting - February 27, 2004 P AL Core Primitives Implementation Global, virtually addressable shared memory Remote Direct Memory Access (RDMA) Hardware-supported multicast Hardware-supported global query Computing capability in the NIC Portability Infiniband, BlueGene/L, QsNET
10
GACOP JACCA Meeting - February 27, 2004 P AL Outline Motivation Goals Core Primitives Resource Management Communication Libraries Ongoing and future work
11
GACOP JACCA Meeting - February 27, 2004 P AL Resource Management STORM: Scalable TOol for Resource Management [1,2] Job launching –binary and data dissemination –actual launching of a parallel job –reporting of job termination Job scheduling –FCFS, gang scheduling,... [3] –new scheduling algorithms can be “plugged” Heartbeat/strobe at regular intervals (time slices) Monitoring Built atop the three core primitives [1] “Scalable Resource Management in High Performance Computers.” E. Frachtenberg, J. Fernández, F. Petrini, and S. Coll. Cluster´02. [2] “STORM: Lightning-Fast Resource Management.” E. Frachtenberg, J. Fernández, F. Petrini, S. Pakin and S. Coll. SC´02. [3] “Flexible CoScheduling: Mitigating Load Imbalance and Improving Utilization of Heterogeneous Resources.” E. Frachtenberg, D. G. Feitelson, F. Petrini and J. Fernández. IPDPS´03.
12
GACOP JACCA Meeting - February 27, 2004 P AL Outline Motivation Goals Core Primitives Resource Management Communication Libraries Ongoing and future work
13
GACOP JACCA Meeting - February 27, 2004 P AL Communication Libraries BCS-MPI: Buffered Coscheduled MPI [4] Global synchronization [5] –Heartbeat/strobe sent at regular intervals (time slices) –All system activities are tightly coupled Global Scheduling –Exchange of communication requirements –Communication scheduling –Perform real transmission and reduce computations [6] Implementation on the NIC (Elan3 - QsNet) Built atop the three core primitives [4] “BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers.” J. Fernández, E. Frachtenberg, and F. Petrini. SC´03. [5] “Scalable Collective Communication on the ASCI Q Machine” J. Fernández, E. Frachtenberg, and F. Pettrini. HOTi´11. [6] “Scalable NIC-based Reduction on Large-scale Clusters.” A. Moody, J. Fernández, F. Petrini and D. K. Panda. SC´03.
14
GACOP JACCA Meeting - February 27, 2004 P AL Communication Libraries Global Strobe (time slice starts) Global Strobe (time slice ends) Exchange of comm requirements Communication scheduling Real transmission Global Synchronization Global Synchronization Time Slice (hundreds of s) BCS-MPI: real-time commication scheduling
15
GACOP JACCA Meeting - February 27, 2004 P AL Ongoing and future work Improved system utilization Scheduling multiple jobs QoS for different types of traffic Scheduling messages may provide traffic segregation Transparent fault tolerance [7] BCS MPI simplifies the state of the machine Kernel-level implementation of BCS-MPI User-level solution is already working Deterministic replay of MPI programs Ordered resource scheduling may enforce reproducibility [7] “On the Feasibility of Incremental Checkpointing for Scientific Computing.” J. C. Sancho, F. Petrini, G. Johnson, J. Fernández and E. Frachtenberg. IPDPS´04.
16
GACOP JACCA Meeting - February 27, 2004 P AL A New Approach in the System Software Design for Large-Scale Parallel Computers Juan Fernández 1,2, Eitan Frachtenberg 1, Fabrizio Petrini 1 1 Performance and Architecture Lab 2 Grupo de Arquitectura y Computación Paralelas (GACOP) CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN URL: http://www.c3.lanl.gov URL: http://www.ditec.um.eshttp://www.c3.lanl.govhttp://www.ditec.um.es email:{juanf,eitanf,fabrizio}@lanl.gov
17
GACOP JACCA Meeting - February 27, 2004 P AL Motivation CharacteristicWorkstationCluster Job Launching Operating SystemScripts/Middleware on top of the OS Job Scheduling Timeshared by OS Batch queued or gang scheduled (with large quanta) using middleware Communication IPC/Shared Memory Message Passing Library (e.g. MPI) / Data-Parallel Programming (e.g. HPF) Fault Tolerance Little or none Application/application-assisted checkpointing Storage Standard file system Custom parallel file system Debuggability Standard tools: Reproducibility!!! Parallel debugging tools: Non-determinism!!! Growing gap between workstation and cluster usability!!!
18
GACOP JACCA Meeting - February 27, 2004 P AL Motivation System software complexity due to multiple factors: Extremely complex global state Thousands of processes, threads, open files, pending messages, etc. Non-deterministic behavior û Inherent to computing systems OS process scheduling û Induced by parallel applications MPI_ANY_SOURCE Local OSs lack global awareness of parallel applications Interferences with fine-grain synchronization operations non- scalable collective communication primitives Independent design of different components Redundancy of functionality Communication protocols û Missing functionality QoS user-level traffic / system-level traffic User-level applications rely on system software û System software performance/scalability impacts user- application performance/scalability
19
GACOP JACCA Meeting - February 27, 2004 P AL Resource Management Job Launching STORM is 40 times faster than the best reported result!!!
20
GACOP JACCA Meeting - February 27, 2004 P AL Resource Management Job Scheduling STORM is able to use very small time slices: RESPONSIVENESS !!!
21
GACOP JACCA Meeting - February 27, 2004 P AL Communication Libraries Non-blocking primitives: MPI_Isend/Irecv
22
GACOP JACCA Meeting - February 27, 2004 P AL Communication Libraries Blocking primitives: MPI_Send/Recv
23
GACOP JACCA Meeting - February 27, 2004 P AL Communication Libraries Global Synchronization Protocol Global Message Scheduling Phase –Microphases: Descriptor Exchange + Message Scheduling Message Transmission Phase: –Microphases: Point-to-point, Barrier and Broadcast, Reduce
24
GACOP JACCA Meeting - February 27, 2004 P AL Communication Libraries SAGE- timing.input (IA32) 0.5% SPEEDUP !!!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.