Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Designing Parallel Operating Systems using Modern Interconnects Eitan Frachtenberg With Fabrizio Petrini, Juan Fernandez, Dror Feitelson, Jose-Carlos Sancho, Kei Davis Computer and Computational Sciences Division Los Alamos National Laboratory Ideas that change the world
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Cluster Supercomputers n Growing in prevalence and performance, 7 out of 10 top supercomputers n Running parallel applications n Advanced, high-end interconnects
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Distributed vs. Parallel Distributed and parallel applications (including operating systems) may be distinguished by their use of global and collective operations n Distributed—local information, relatively small number of point-to-point messages n Parallel—global synchronization: barriers, reductions, exchanges
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 System Software Components Job Scheduling Fault Tolerance Parallel I/O Communication Library Resource Management System Software System Software
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Problems with System Software Independent single-node OS (e.g. Linux) connected by distributed dæmons: u Redundant components u Performance hits u Scalability issues u Load balancing issues
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 OS’s Collective Operations Many OS tasks are inherently global or collective operations: n Job launching, data dissemination n Context switching n Job termination (normal and forced) n Load balancing
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Local Operating System Resource Management Parallel I/O Fault Tolerance Job Scheduling User-Level Communication Local Operating System Resource Management Parallel I/O Fault Tolerance Job Scheduling User-Level Communication Node 1 Node 2 Global Parallel Operating System Job SchedulingFault ToleranceCommunicationParallel I/OResource Mgmt
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 The Vision n Modern interconnects are very powerful u collective operations u programmable NICs u on-board RAM n Use a small set of network mechanisms as parallel OS infrastructure n Build upon this infrastructure to create unified system software n System software Inherits scalability and performance from network features
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Example: ASCI Q Barrier [HotI’03]
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Parallel OS Primitives n System software built atop three primitives u Xfer-And-Signal F Transfer block of data to a set of nodes F Optionally signal local/remote event upon completion u Compare-And-Write F Compare global variable on a set of nodes F Optionally write global variable on the same set of nodes u Test-Event F Poll local event
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Core Primitives on QsNet n System software built atop three primitives u Xfer-And-Signal (QsNet): F Node S transfers block of data to nodes D 1, D 2, D 3 and D 4 F Events triggered at source and destinations SD1D1 D2D2 D4D4 D3D3 Source Event Destination Events
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Core Primitives (cont.) n System software built atop three primitives u Compare-And-Write (QsNet): F Node S compares variable V on nodes D 1, D 2, D 3 and D 4 S D1D1 D2D2 D4D4 D3D3 Is V { , , >} to Value?
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Core Primitives (cont.) n System software built atop three primitives u Compare-And-Write (QsNet): F Node S compares variable V on nodes D 1, D 2, D 3 and D 4 F Partial results are combined in the switches S D1D1 D2D2 D4D4 D3D3
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 System Software Components Job Scheduling Fault Tolerance Parallel I/O Communication Library ResourceManagement System Software System Software
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 - Inherits scalability from network primitives: - Data dissemination and coordination - Interactive job launching speeds - Context-switching at milliseconds level - Described in [SC’02] Scalable Tool for Resource Management
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 State of the Art in Resource Management Resource managers (e.g. PBS, LSF, RMS, LoadLeveler, Maui) are typically implemented using u TCP/IP—favors portability over performance, u Poorly-scaling algorithms for the distribution/collection of data and control messages u Favoring development time over performance Scalable performance not important for small clusters but crucial for large ones. There exists a need for fast and scalable resource management.
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Experimental Setup n 64 nodes/256 processors ES40 Alphaserver cluster n 2 independent network rails of Quadrics Elan3 n Files are placed in ramdisk in order to avoid I/O bottlenecks and expose the performance of the resource management algorithms
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Launch Times (Unloaded System) The launch time is constant when we increase the number of processors. STORM is highly scalable
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Launch Times (Loaded System, 12 MB) Worst case: 1.5 seconds to launch a 12 MB file on 256 processors
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Measured and Estimated Launch Times The model shows that in an ES40-based Alphaserver a 12MB binary can be launched in 135ms on 16,384 nodes
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Comparative Evaluation (Measured & Modeled)
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 System Software ComponentsJobScheduling Fault Tolerance Parallel I/O Communication Library Resource Management System Software System Software
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Job Scheduling n Controls the allocation of space and time resources to jobs n HPC apps have special requirements u Multiple processing and network resources u Synchronization ( < 1ms granularity) u Potentially memory hogs with little locality n Has significant effect on throughput, responsiveness, and utilization
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 First-Come-First-Serve (FCFS)
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Gang Scheduling (GS)
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Implicit CoScheduling
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Hybrid Methods n Combine global synchronization & local information n Rely on scalable primitives for global coordination and information exchange n First implementation of two novel algorithms: u Flexible CoScheduling (FCS) u Buffered CoScheduling (BCS)
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Flexible CoScheduling (FCS) n Measure communication characteristics, such as granularity and wait times n Classify processes based on synchronization requirements n Schedule processes based on class n Described in [IPDPS’03]
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 FCS Classification Granularity Block times Fine Coarse Short Long CS Always gang-scheduled F Preferably gang-scheduled DC Locally scheduled
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Methodology n Synthetic, controllable MPI programs n Workload u Static: all jobs start together u Dynamic: different sizes, arrival and run times n Various schedulers implemented: u FCFS, GS, FCS, SB (ICS), BCS n Emulation vs. simulation u Actual implementation takes into account all the overhead and factors of a real system
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Hardware Environment n Environment ported to three architectures and clusters: u Crescendo: 32x2 Pentium III, 1GB u Accelerando: 32x2 Itanium II, 2GB u Wolverine: 64x4 Alpha ES40, 8GB
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Synthetic Application n Bulk synchronous, 3ms basic granularity n Can control: granularity, variability and Communication pattern
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Synthetic Scenarios Balanced Complementing Imbalanced Mixed
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Turnaround Time
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Dynamic Workloads [JSSPP’03] n Static workloads are simple and offer insights, but are not realistic n Most real-life workloads are more complex n Users submit jobs dynamically, of varying time and space requirements
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Dynamic Workload Methodology n Emulation using a workload model [Lublin03] n 1000 jobs, approx. 12 days, shrunk to 2 hrs n Varying load by factoring arrival times n Using same synthetic application, with random: u Arrival time, run time, and size, based on model u Granularity (fine, medium, coarse) u communication pattern (ring, barrier, none) n Recent study with scientific apps (yet unpublished)
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Load – Response Time
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Load – Bounded Slowdown
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Timeslice – Response Time
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 System Software Components Job Scheduling Fault Tolerance Parallel I/O CommunicationLibrary Resource Management System Software System Software
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Buffered CoScheduling (BCS) n Buffer all communications n Exchange information about pending communication every time slice n Schedule and execute communication n Implemented mostly on the NIC n Requires fine-grained heartbeats n Described in [SC’03]
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Design and Implementation n Global synchronization u Strobe sent at regular intervals (time slices) F Compare-And-Write + Xfer-And-Signal (Master) F Test-Event (Slaves) u All system activities are tightly coupled n Global Scheduling u Exchange of communication requirements F Xfer-And-Signal + Test-Event u Communication scheduling u Real transmission F Xfer-And-Signal + Test-Event
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Design and Implementation n Implementation in the NIC u Application processes interact with NIC threads F MPI primitive Descriptor posted to the NIC F Communications are buffered u Cooperative threads running in the NIC F Synchronize F Partial exchange of control information F Schedule communications F Perform real transmissions and reduce computations u Comp/comm completely overlapped
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Design and Implementation n Non-blocking primitives: MPI_Isend/Irecv
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Design and Implementation n Blocking primitives: MPI_Send/Recv
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Performance Evaluation n BCS MPI vs. Quadrics MPI u Experimental Setup F Benchmarks and Applications NPB (IS,EP,MG,CG,LU) - Class C SWEEP3D - 50x50x50 SAGE - timing.input F Scheduling parameters 500μs communication scheduling time slice (1 rail) 250μs communication scheduling time slice (2 rails)
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Performance Evaluation n Benchmarks and Applications (C) ApplicationSlowdown IS (32PEs)10.40% EP (49PEs) 5.35% MG (32PEs) 4.37% CG (32PEs)10.83% LU (32PEs)15.04% SWEEP3D(49PEs) -2.23% SAGE (62PEs) -0.42%
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Performance Evaluation n SAGE - timing.input (IA32) 0.5% SPEEDUP
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Blocking Communication Blocking vs. Non-blocking SWEEP3D (IA32) MPI_Send/Recv MPI_Isend/Irecv + MPI_Waitall
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 System Software Components Job Scheduling FaultTolerance Parallel I/O Communication Library Resource Management System Software System Software
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Fault Tolerance Today Fault tolerance is commonly achieved, if at all, by n Checkpointing n Segmentation of the machine n Removal of fault-prone components Massive hardware redundancy is not considered economically feasible
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Our Approach to Fault Tolerance n Recent work shows that scalable, system-level fault-tolerance is within reach with current technology, with low overhead, can be achieved through a global operating system n Two results provide the basis for this claim 1. Buffered CoScheduling that enforces frequent, global recovery lines and global control 2. Feasibility of incremental checkpoint
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Checkpointing and Recovery n Simplicity F Easy implementation n Cost-effective F No additional hardware support Critical aspect: Bandwidth requirements Saving process state
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Reducing Bandwidth n Incremental checkpointing F Only the memory modified from the previous checkpoint is saved to stable storage Full Process state Incremental
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Enabling Automatic Checkpointing Low User intervention Checkpoint data Low Hardware Operating system Run-time library Application High automatic
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 The Bandwidth Challenge Does the current technology provide enough bandwidth? Frequent Automatic
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Methodology n Quantifying the Bandwidth Requirements F Checkpoint intervals: 1s to 20s F Comparing with the current bandwidth available 900 MB/s 75 MB/s Sustained network bandwidth Quadrics QsNet II Single sustained disk bandwidth Ultra SCSI controller
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Memory Footprint Sage-1000MB954.6MB Sage-500MB497.3MB Sage-100MB103.7MB Sage-50MB55MB Sweep3D105.5MB SP Class C40.1MB LU Class C16.6MB BT Class C76.5MB FT Class C118MB Increasing memory footprint 64 Itanium II processors
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Bandwidth Requirements Bandwidth (MB/s) Timeslices (s) 78.8MB/ s 12.1MB/ s Decreases with the timeslices Sage-1000MB
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Bandwidth Requirements for 1 second Increases with memory footprint Single SCSI disk performance Most demanding
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Increasing Memory Footprint Size Average Bandwidth (MB/s) Timeslices (s) Increases sublinearly
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Increasing Processor Count Average Bandwidth (MB/s) Timeslices (s) Decreases slightly with processor count Weak-scaling
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Technological Trends Performance of applications bounded by memory improvements Increases at a faster pace Performance Improvement per year
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Conclusions n As clusters grow, interconnection technology advances: u Better bandwidth and latency u On-board programmable processor, RAM u Hardware support for collective operations Allows the development of common system infrastructure that is a parallel program in itself
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Conclusions (cont.) n On top of infrastructure we built: u Scalable resource management (STORM) u Novel job scheduling algorithms u Simplified system design and communication library u Possible basis for transparent fault tolerance
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Conclusions (cont.) n Experimental performance evaluation demonstrates: u Scalable interactive job launching and context- switching u Multiprogramming parallel jobs is feasible u Adaptive scheduling algorithms adjust to different job requirements, improving response times and slowdown in various workloads u Transparent, frequent checkpoint within current reach
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 References Eitan’s web page Fabrizio’s web page PAL team web page:
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Resource Overlapping
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Turnaround Time
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Response Time
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Timeslice – Bounded Slowdown
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 FCFS vs. GS and MPL
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 FCFS vs. GS and MPL (2)
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Backfilling n Backfilling is a technique to move jobs forward in queue n Can be combined with time-sharing schedulers such as GS when all timeslots are full
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Backfilling n Backfilling is a technique to move jobs forward in queue n Can be combined with time-sharing schedulers such as GS when all timeslots are full
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Effect of Backfilling
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Characterization Data initialization Regular processing bursts Sage-1000MB
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Communication Interleaved Sage-1000MB Regular communication bursts