Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL A NEW APPROACH.

Similar presentations

Presentation on theme: "Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL A NEW APPROACH."— Presentation transcript:

1 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 1 CCS-3 P AL A NEW APPROACH

2 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 2 CCS-3 P AL Part 4: A New Approach (or “An Old Paradigm in a Bigger Bottle”) Simply put, short-term strategies and one-time crash programs are unlikely to develop the technology pipelines and new approaches required to realize the petascale computing systems needed by a range of scientific, defense, and national security applications. Rather, multiple cycles of advanced research and development, followed by large-scale prototyping and product development, will be required to develop systems that can consistently achieve a high fraction of their peak performance on critical applications, while also being easier to program and operate reliably. [From Roadmap]

3 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 3 CCS-3 P AL Overview n Background n Buffered Coscheduling n Basic mechanisms n BCS-MPI n Fault tolerance n Resource management

4 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 4 CCS-3 P AL Definition System software is all software running on a machine other than user applications. This includes the OS, and for large parallel systems typically also includes Communication libraries, e.g., MPI, OpenMP Parallel file systems System monitor/manager Job scheduler/resource manager High performance external network

5 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 5 CCS-3 P AL Fundamental Thesis 1.The fundamental problem is the use of largely independent, loosely-coupled compute nodes for the execution of what are inherently tightly-coupled applications (algorithms). 2.Greater hardware integration greatly enables, but is not sufficient, to solve this problem. Tight coupling arises from data dependencies, realized as interprocess communication.

6 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 6 CCS-3 P AL The Success of the Wright Brothers n In December 1903 Orville Wright took off in a powered airplane and flew for 12 seconds and 120 feet n It took several years for the Wrights to build the first truly controllable airplane

7 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 7 CCS-3 P AL Control is the Key n Wilbur Wright in a talk in 1901 said that the “greatest obstacle to a functional airplane was the balancing and steering of the machine after it is actually in flight”

8 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 8 CCS-3 P AL BSP as a Guiding Principle There is a need for global coordination, enforced by global control, reified as a global operating system. We are inspired by the BSP model. Many of LANL’s applications could be recast in BSP style n There is neither budget nor manpower—legacy codes represent $billions in development effort n There is no will to understand or learn a new programming paradigm n BSP applies to the application domain, it does not directly support the amelioration of system software problems. --idealistic postdocs

9 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 9 CCS-3 P AL The Vision We propose a new methodology for the design of parallel system software based on two cornerstones n BSP-like global coordination and control of all of the activities of the machine; and, n with respect to coordination and control, treating the system software suite as any other application. Overall we seek simplicity, uniformity of approach, efficiency, and very high scalability. We believe this can simplify almost all components of system software.

10 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 10 CCS-3 P AL Intuition The global operating system coordinates all system and application software activities in a BSP-like fashion.

11 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 11 CCS-3 P AL Distributed vs. Parallel Distributed and parallel applications (including operating systems) may be distinguished by their use of global and collective operations n Distributed—local information, relatively small number of point-to-point messages; n Parallel—global synchronization: barriers, reductions, exchanges.

12 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 12 CCS-3 P AL OS’s Collective Operations Many OS tasks are inherently global or collective operations: n Context switching, n Job launching n Job termination (normal and forced) n Load balancing

13 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 13 CCS-3 P AL Local Operating System Resource Management Parallel I/O Fault Tolerance Job Scheduling User-Level Communication Local Operating System Resource Management Parallel I/O Fault Tolerance Job Scheduling User-Level Communication Node 1 Node 2 Global Parallel Operating System Job SchedulingFault ToleranceCommunicationParallel I/OResource Mgmt

14 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 14 CCS-3 P AL Buffered CoScheduling n Target u Simplifying design and implementation of the communication layer for large-scale systems u Simplicity, determinism, performance, scalability n Approach u Built atop a basic set of three primitives u Global synchronization/scheduling n Vision u BSP-like system running MIMD applications

15 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 15 CCS-3 P AL BCS Core Primitives n System software built atop three primitives u Xfer-And-Signal F Transfer block of data to a set of nodes F Optionally signal local/remote event upon completion u Compare-And-Write F Compare global variable on a set of nodes F Optionally write global variable on the same set of nodes u Test-Event F Poll local event

16 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 16 CCS-3 P AL Core Primitives on the Quadrics QsNET n System software built atop three primitives u Xfer-And-Signal (QsNet): F Node S transfers block of data to nodes D 1, D 2, D 3 and D 4 F Events triggered at source and destinations SD1D1 D2D2 D4D4 D3D3 Source Event Destination Events

17 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 17 CCS-3 P AL Core Primitives n System software built atop three primitives u Compare-And-Write (QsNet): F Node S compares variable V on nodes D 1, D 2, D 3 and D 4 S D1D1 D2D2 D4D4 D3D3 Is V { ,  , >} to Value?

18 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 18 CCS-3 P AL Core Primitives n System software built atop three primitives u Compare-And-Write (QsNet): F Node S compares variable V on nodes D 1, D 2, D 3 and D 4 F Partial results are combined in the switches S D1D1 D2D2 D4D4 D3D3

19 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 19 CCS-3 P AL Design and Implementation n Global synchronization u Strobe sent at regular intervals (time slices) F Compare-And-Write + Xfer-And-Signal (Master) F Test-Event (Slaves) u All system activities are tightly coupled n Global Scheduling u Exchange of communication requirements F Xfer-And-Signal + Test-Event u Communication scheduling u Real transmission F Xfer-And-Signal + Test-Event

20 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 20 CCS-3 P AL Design and Implementation n Implementation in the NIC u Application processes interact with NIC threads F MPI primitive  Descriptor posted to the NIC F Communications are buffered u Cooperative threads running in the NIC F Synchronize F Partial exchange of control information F Schedule communications F Perform real transmissions and reduce computations u Comp/comm completely overlapped

21 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 21 CCS-3 P AL Design and Implementation n Non-blocking primitives: MPI_Isend/Irecv

22 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 22 CCS-3 P AL Design and Implementation n Blocking primitives: MPI_Send/Recv

23 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 23 CCS-3 P AL Performance Evaluation n BCS MPI vs. Quadrics MPI u Experimental Setup F Benchmarks and Applications NPB (IS,EP,MG,CG,LU) - Class C SWEEP3D - 50x50x50 SAGE - timing.input F Scheduling parameters 500μs communication scheduling time slice (1 rail) 250μs communication scheduling time slice (2 rails)

24 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 24 CCS-3 P AL Performance Evaluation n Benchmarks and Applications (C) ApplicationSlowdown IS (32PEs)10.40% EP (49PEs) 5.35% MG (32PEs) 4.37% CG (32PEs)10.83% LU (32PEs)15.04% SWEEP3D(49PEs) -2.23% SAGE (62PEs) -0.42%

25 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 25 CCS-3 P AL Performance Evaluation n SAGE - timing.input (IA32) 0.5% SPEEDUP

26 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 26 CCS-3 P AL Blocking Communication Blocking vs Non-blocking SWEEP3D (IA32) MPI_Send/Recv  MPI_Isend/Irecv + MPI_Waitall

27 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 27 CCS-3 P AL Fault Tolerance A hot research topic in academia, industry, and federal agencies

28 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 28 CCS-3 P AL Fault Tolerance Today Fault tolerance is commonly achieved, if at all, by n Checkpointing n Segmentation of the machine n Removal of fault-prone components Massive hardware redundancy is not considered econcomically feasible

29 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 29 CCS-3 P AL Checkpointing There are numerous schemes for checkpointing: n User initiated n System initiated n By hardware (proposed) checkpointing of n user-specified data n application image n modified data

30 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 30 CCS-3 P AL Checkpointing (cont’d) Most commonly (in our environment, at least) is n User-initiated checkpointing of user-specified data Pro: n Simple in concept Cons n Effort and care by programmer, particularly for restart n Error-prone—not capturing needed state n Opportunistically chosen program points, coarse granularity u Severe rollback penalty u Bursty I/O n Worsens with scale as MTBF decreases

31 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 31 CCS-3 P AL Checkpointing (cont’d) Defensive I/O accounts for ~80% of I/O on ASCI machines. This biases the relative importance (cost) of I/O subsystem.

32 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 32 CCS-3 P AL Segmentation of Machine The procrustean approach: segment the machine. Divide capability for capacity!

33 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 33 CCS-3 P AL Elimination of Fault-Prone Components n Cluster management software using control network eliminates need for floppy or optical drives on every node n Eliminate hard disks u Makes checkpointing yet more expensive n DRAM not straightforward

34 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 34 CCS-3 P AL Our Approach to Fault Tolerance n Our contribution is to show that scalable, system- level fault-tolerance is within reach with current technology, with low overhead, can be achieved through a global operating system n Two results provide the basis for this claim 1. Buffered CoScheduling that enforces frequent, global recovery lines and global control 2. Feasibility of incremental checkpoint

35 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 35 CCS-3 P AL Checkpointing and Recovery n Simplicity F Easy implementation n Cost-effective F No additional hardware support Critical aspect: Bandwidth requirements Saving process state

36 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 36 CCS-3 P AL Reducing Bandwidth n Incremental checkpointing F Only the memory modified from the previous checkpoint is saved to stable storage Full Process state Incremental

37 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 37 CCS-3 P AL Enabling Automatic Checkpointing Low User intervention Checkpoint data Low Hardware Operating system Run-time library Application High automatic

38 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 38 CCS-3 P AL The Bandwidth Challenge Does the current technology provide enough bandwidth? Frequent Automatic

39 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 39 CCS-3 P AL Methodology n Quantifying the Bandwidth Requirements F Checkpoint intervals: 1s to 20s F Comparing with the current bandwidth available 900 MB/s 75 MB/s Sustained network bandwidth Quadrics QsNet II Single sustained disk bandwidth Ultra SCSI controller

40 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 40 CCS-3 P AL Experimental Environment n 32-node Linux Cluster u 64 Itanium II processors u PCI-X I/O bus u Quadrics QsNet interconnection network n Parallel Scientific Codes u Sage u Sweep3D u NAS parallel benchmarks: SP, LU, BT and FT Representative of the ASCI production codes at LANL

41 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 41 CCS-3 P AL Memory Footprint Sage-1000MB954.6MB Sage-500MB497.3MB Sage-100MB103.7MB Sage-50MB55MB Sweep3D105.5MB SP Class C40.1MB LU Class C16.6MB BT Class C76.5MB FT Class C118MB Increasing memory footprint

42 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 42 CCS-3 P AL Characterization Data initialization Regular processing bursts Sage-1000MB

43 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 43 CCS-3 P AL Communication Interleaved Sage-1000MB Regular communication bursts

44 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 44 CCS-3 P AL Bandwidth Requirements Bandwidth (MB/s) Timeslices (s) 78.8MB/ s 12.1MB/ s Decreases with the timeslices Sage-1000MB

45 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 45 CCS-3 P AL Bandwidth Requirements for 1 second Increases with memory footprint Single SCSI disk performance Most demanding

46 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 46 CCS-3 P AL Increasing Memory Footprint Size Average Bandwidth (MB/s) Timeslices (s) Increases sublinearly

47 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 47 CCS-3 P AL Increasing Processor Count Average Bandwidth (MB/s) Timeslices (s) Decreases slightly with processor count Weak-scaling

48 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 48 CCS-3 P AL Technological Trends Performance of applications bounded by memory improvements Increases at a faster pace Performance Improvement per year

49 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 49 CCS-3 P AL Resource Management Seeking more effective use of cluster resources: Reduced response time, Greater throughput.

50 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 50 CCS-3 P AL STORM STORM (Scalable TOol for Resource Management) n Based on buffered coscheduling, n Easy to port, n Enables resource management to exploit low-level network features, n Is orders of magnitude faster than the best reported results in the literature.

51 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 51 CCS-3 P AL State of the Art in Resource Management Resource managers (e.g. PBS, LSF, RMS, LoadLeveler, Maui) are typically implemented using n TCP/IP—favors portability over performance, n Poorly-scaling algorithms for the distribution/collection of data and control messages—favors development time over performance, Scalable performance not important for small clusters but crucial for large ones. There exists a need for fast and scalable resource management.

52 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 52 CCS-3 P AL Observation If the cluster has a powerful, scalable network, why aren’t we using it?

53 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 53 CCS-3 P AL Experimental Results n 64 nodes/256 processors ES40 Alphaserver cluster n 2 indendent network rails of Quadrics Elan3 n Files are placed in ramdisk in order to avoid I/O bottlenecks and expose the performance of the resource management algorithms

54 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 54 CCS-3 P AL Launch times (unloaded system) The launch time is constant when we increase the number of processors. STORM is highly scalable

55 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 55 CCS-3 P AL Launch times (loaded system, 12 MB) In the worst case it still takes only 1.5  seconds to launch a 12 MB file on 256 processors

56 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 56 CCS-3 P AL Measured and estimated launch times The model shows that in an ES40-based Alphaserver a 12MB binary can be launched in 135ms on 16,384 nodes

57 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 57 CCS-3 P AL Measured and predicted performance of existing job launchers

58 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 58 CCS-3 P AL Acknowledgements n Jose’ Moreira and György Almási (BlueGene/L) n Ron Brightwell (ASCI Red Storm) n Mark Seager (ASCI Thunder) n Paul Terry (Cray XD1) n Srinidhi Varadarajan (System X)

59 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 59 CCS-3 P AL PAL TEAM n Juan Fernandez Peinador (BCS-MPI) n Eitan Frachtenberg (STORM) n Jose’ Carlos Sancho (Fault tolerance) n Salvador Coll (Collective Communication) n Scott Pakin (Noise analysis and STORM) n Darren Kerbyson (Noise analysis) n Adolfy Hoisie (team leader)

60 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 60 CCS-3 P AL References PAL team web page: Fabrizio’s web page

61 Kei Davis and Fabrizio Petrini {kei,fabrizio} Europar 2004, Pisa Italy 61 CCS-3 P AL About the authors Kei Davis is a team leader and technical staff member at Los Alamos National Laboratory (LANL) where he is currently working on system software solutions for reliability and usability of large-scale parallel computers. Previous work at LANL includes computer system performance evaluation and modeling, large-scale computer system simulation, and parallel functional language implementation. His research interests are centered on parallel computing; more specifically, various aspects of operating systems, parallel programming, and programming language design and implementation. Kei received his PhD in Computing Science from Glasgow University and his MS in Computation from Oxford University. Before his appointment at LANL he was a research scientist at the Computing Research Laboratory at New Mexico State University. Fabrizio Petrini is a member of the technical staff of the CCS3 group of the Los Alamos National Laboratory (LANL). He received his PhD in Computer Science from the University of Pisa in 1997. Before his appointment at LANL he was a research fellow of the Computing Laboratory of the Oxford University (UK), a postdoctoral researcher of the University of California at Berkeley, and a member of the technical staff of the Hewlett Packard Laboratories. His research interests include various aspects of supercomputers, including high-performance interconnection networks and network interfaces, job scheduling algorithms, parallel architectures, operating systems and parallel programming languages. He has received numerous awards from the NNSA for contributions to supercomputing projects, and from other organizations for scientific publications.

Download ppt "Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL A NEW APPROACH."

Similar presentations

Ads by Google