Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL A NEW APPROACH.

Slides:



Advertisements
Similar presentations
High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.
Advertisements

Computer and Computational Sciences Division Los Alamos National Laboratory Ideas that change the world Achieving Usability and Efficiency in Large-Scale.
Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL STATE OF THE ART.
2. Computer Clusters for Scalable Parallel Computing
The TickerTAIP Parallel RAID Architecture P. Cao, S. B. Lim S. Venkatraman, J. Wilkes HP Labs.
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Computer Systems/Operating Systems - Class 8
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
INTRODUCTION OS/2 was initially designed to extend the capabilities of DOS by IBM and Microsoft Corporations. To create a single industry-standard operating.
Computer and Computational Sciences Division Los Alamos National Laboratory Ideas that change the world Achieving Usability and Efficiency in Large-Scale.
Chapter 13 Embedded Systems
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL A CASE STUDY.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Chapter 2 Computer Clusters Lecture 2.1 Overview.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
PMIT-6102 Advanced Database Systems
Computer System Architectures Computer System Software
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.
Lappeenranta University of Technology / JP CT30A7001 Concurrent and Parallel Computing Introduction to concurrent and parallel computing.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Bulk Synchronous Parallel Processing Model Jamie Perkins.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Distributed System Concepts and Architectures 2.3 Services Fall 2011 Student: Fan Bai
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose.
Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Designing Parallel Operating Systems using.
Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
GACOP JACCA Meeting - February 27, 2004 P AL A New Approach in the System Software Design for Large-Scale Parallel Computers Juan Fernández 1,2, Eitan.
Outline Why this subject? What is High Performance Computing?
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.
CS4315A. Berrached:CMS:UHD1 Introduction to Operating Systems Chapter 1.
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
Parallel IO for Cluster Computing Tran, Van Hoai.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Background Computer System Architectures Computer System Software.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Algorithm Design
Department of Computer Science University of California, Santa Barbara
Lecture 4- Threads, SMP, and Microkernels
Department of Intelligent Systems Engineering
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL A NEW APPROACH

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 2 CCS-3 P AL Part 4: A New Approach (or “An Old Paradigm in a Bigger Bottle”) Simply put, short-term strategies and one-time crash programs are unlikely to develop the technology pipelines and new approaches required to realize the petascale computing systems needed by a range of scientific, defense, and national security applications. Rather, multiple cycles of advanced research and development, followed by large-scale prototyping and product development, will be required to develop systems that can consistently achieve a high fraction of their peak performance on critical applications, while also being easier to program and operate reliably. [From Roadmap]

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 3 CCS-3 P AL Overview n Background n Buffered Coscheduling n Basic mechanisms n BCS-MPI n Fault tolerance n Resource management

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 4 CCS-3 P AL Definition System software is all software running on a machine other than user applications. This includes the OS, and for large parallel systems typically also includes Communication libraries, e.g., MPI, OpenMP Parallel file systems System monitor/manager Job scheduler/resource manager High performance external network

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 5 CCS-3 P AL Fundamental Thesis 1.The fundamental problem is the use of largely independent, loosely-coupled compute nodes for the execution of what are inherently tightly-coupled applications (algorithms). 2.Greater hardware integration greatly enables, but is not sufficient, to solve this problem. Tight coupling arises from data dependencies, realized as interprocess communication.

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 6 CCS-3 P AL The Success of the Wright Brothers n In December 1903 Orville Wright took off in a powered airplane and flew for 12 seconds and 120 feet n It took several years for the Wrights to build the first truly controllable airplane

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 7 CCS-3 P AL Control is the Key n Wilbur Wright in a talk in 1901 said that the “greatest obstacle to a functional airplane was the balancing and steering of the machine after it is actually in flight”

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 8 CCS-3 P AL BSP as a Guiding Principle There is a need for global coordination, enforced by global control, reified as a global operating system. We are inspired by the BSP model. Many of LANL’s applications could be recast in BSP style n There is neither budget nor manpower—legacy codes represent $billions in development effort n There is no will to understand or learn a new programming paradigm n BSP applies to the application domain, it does not directly support the amelioration of system software problems. --idealistic postdocs

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 9 CCS-3 P AL The Vision We propose a new methodology for the design of parallel system software based on two cornerstones n BSP-like global coordination and control of all of the activities of the machine; and, n with respect to coordination and control, treating the system software suite as any other application. Overall we seek simplicity, uniformity of approach, efficiency, and very high scalability. We believe this can simplify almost all components of system software.

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 10 CCS-3 P AL Intuition The global operating system coordinates all system and application software activities in a BSP-like fashion.

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 11 CCS-3 P AL Distributed vs. Parallel Distributed and parallel applications (including operating systems) may be distinguished by their use of global and collective operations n Distributed—local information, relatively small number of point-to-point messages; n Parallel—global synchronization: barriers, reductions, exchanges.

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 12 CCS-3 P AL OS’s Collective Operations Many OS tasks are inherently global or collective operations: n Context switching, n Job launching n Job termination (normal and forced) n Load balancing

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 13 CCS-3 P AL Local Operating System Resource Management Parallel I/O Fault Tolerance Job Scheduling User-Level Communication Local Operating System Resource Management Parallel I/O Fault Tolerance Job Scheduling User-Level Communication Node 1 Node 2 Global Parallel Operating System Job SchedulingFault ToleranceCommunicationParallel I/OResource Mgmt

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 14 CCS-3 P AL Buffered CoScheduling n Target u Simplifying design and implementation of the communication layer for large-scale systems u Simplicity, determinism, performance, scalability n Approach u Built atop a basic set of three primitives u Global synchronization/scheduling n Vision u BSP-like system running MIMD applications

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 15 CCS-3 P AL BCS Core Primitives n System software built atop three primitives u Xfer-And-Signal F Transfer block of data to a set of nodes F Optionally signal local/remote event upon completion u Compare-And-Write F Compare global variable on a set of nodes F Optionally write global variable on the same set of nodes u Test-Event F Poll local event

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 16 CCS-3 P AL Core Primitives on the Quadrics QsNET n System software built atop three primitives u Xfer-And-Signal (QsNet): F Node S transfers block of data to nodes D 1, D 2, D 3 and D 4 F Events triggered at source and destinations SD1D1 D2D2 D4D4 D3D3 Source Event Destination Events

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 17 CCS-3 P AL Core Primitives n System software built atop three primitives u Compare-And-Write (QsNet): F Node S compares variable V on nodes D 1, D 2, D 3 and D 4 S D1D1 D2D2 D4D4 D3D3 Is V { ,  , >} to Value?

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 18 CCS-3 P AL Core Primitives n System software built atop three primitives u Compare-And-Write (QsNet): F Node S compares variable V on nodes D 1, D 2, D 3 and D 4 F Partial results are combined in the switches S D1D1 D2D2 D4D4 D3D3

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 19 CCS-3 P AL Design and Implementation n Global synchronization u Strobe sent at regular intervals (time slices) F Compare-And-Write + Xfer-And-Signal (Master) F Test-Event (Slaves) u All system activities are tightly coupled n Global Scheduling u Exchange of communication requirements F Xfer-And-Signal + Test-Event u Communication scheduling u Real transmission F Xfer-And-Signal + Test-Event

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 20 CCS-3 P AL Design and Implementation n Implementation in the NIC u Application processes interact with NIC threads F MPI primitive  Descriptor posted to the NIC F Communications are buffered u Cooperative threads running in the NIC F Synchronize F Partial exchange of control information F Schedule communications F Perform real transmissions and reduce computations u Comp/comm completely overlapped

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 21 CCS-3 P AL Design and Implementation n Non-blocking primitives: MPI_Isend/Irecv

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 22 CCS-3 P AL Design and Implementation n Blocking primitives: MPI_Send/Recv

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 23 CCS-3 P AL Performance Evaluation n BCS MPI vs. Quadrics MPI u Experimental Setup F Benchmarks and Applications NPB (IS,EP,MG,CG,LU) - Class C SWEEP3D - 50x50x50 SAGE - timing.input F Scheduling parameters 500μs communication scheduling time slice (1 rail) 250μs communication scheduling time slice (2 rails)

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 24 CCS-3 P AL Performance Evaluation n Benchmarks and Applications (C) ApplicationSlowdown IS (32PEs)10.40% EP (49PEs) 5.35% MG (32PEs) 4.37% CG (32PEs)10.83% LU (32PEs)15.04% SWEEP3D(49PEs) -2.23% SAGE (62PEs) -0.42%

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 25 CCS-3 P AL Performance Evaluation n SAGE - timing.input (IA32) 0.5% SPEEDUP

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 26 CCS-3 P AL Blocking Communication Blocking vs Non-blocking SWEEP3D (IA32) MPI_Send/Recv  MPI_Isend/Irecv + MPI_Waitall

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 27 CCS-3 P AL Fault Tolerance A hot research topic in academia, industry, and federal agencies

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 28 CCS-3 P AL Fault Tolerance Today Fault tolerance is commonly achieved, if at all, by n Checkpointing n Segmentation of the machine n Removal of fault-prone components Massive hardware redundancy is not considered econcomically feasible

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 29 CCS-3 P AL Checkpointing There are numerous schemes for checkpointing: n User initiated n System initiated n By hardware (proposed) checkpointing of n user-specified data n application image n modified data

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 30 CCS-3 P AL Checkpointing (cont’d) Most commonly (in our environment, at least) is n User-initiated checkpointing of user-specified data Pro: n Simple in concept Cons n Effort and care by programmer, particularly for restart n Error-prone—not capturing needed state n Opportunistically chosen program points, coarse granularity u Severe rollback penalty u Bursty I/O n Worsens with scale as MTBF decreases

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 31 CCS-3 P AL Checkpointing (cont’d) Defensive I/O accounts for ~80% of I/O on ASCI machines. This biases the relative importance (cost) of I/O subsystem.

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 32 CCS-3 P AL Segmentation of Machine The procrustean approach: segment the machine. Divide capability for capacity!

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 33 CCS-3 P AL Elimination of Fault-Prone Components n Cluster management software using control network eliminates need for floppy or optical drives on every node n Eliminate hard disks u Makes checkpointing yet more expensive n DRAM not straightforward

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 34 CCS-3 P AL Our Approach to Fault Tolerance n Our contribution is to show that scalable, system- level fault-tolerance is within reach with current technology, with low overhead, can be achieved through a global operating system n Two results provide the basis for this claim 1. Buffered CoScheduling that enforces frequent, global recovery lines and global control 2. Feasibility of incremental checkpoint

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 35 CCS-3 P AL Checkpointing and Recovery n Simplicity F Easy implementation n Cost-effective F No additional hardware support Critical aspect: Bandwidth requirements Saving process state

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 36 CCS-3 P AL Reducing Bandwidth n Incremental checkpointing F Only the memory modified from the previous checkpoint is saved to stable storage Full Process state Incremental

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 37 CCS-3 P AL Enabling Automatic Checkpointing Low User intervention Checkpoint data Low Hardware Operating system Run-time library Application High automatic

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 38 CCS-3 P AL The Bandwidth Challenge Does the current technology provide enough bandwidth? Frequent Automatic

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 39 CCS-3 P AL Methodology n Quantifying the Bandwidth Requirements F Checkpoint intervals: 1s to 20s F Comparing with the current bandwidth available 900 MB/s 75 MB/s Sustained network bandwidth Quadrics QsNet II Single sustained disk bandwidth Ultra SCSI controller

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 40 CCS-3 P AL Experimental Environment n 32-node Linux Cluster u 64 Itanium II processors u PCI-X I/O bus u Quadrics QsNet interconnection network n Parallel Scientific Codes u Sage u Sweep3D u NAS parallel benchmarks: SP, LU, BT and FT Representative of the ASCI production codes at LANL

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 41 CCS-3 P AL Memory Footprint Sage-1000MB954.6MB Sage-500MB497.3MB Sage-100MB103.7MB Sage-50MB55MB Sweep3D105.5MB SP Class C40.1MB LU Class C16.6MB BT Class C76.5MB FT Class C118MB Increasing memory footprint

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 42 CCS-3 P AL Characterization Data initialization Regular processing bursts Sage-1000MB

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 43 CCS-3 P AL Communication Interleaved Sage-1000MB Regular communication bursts

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 44 CCS-3 P AL Bandwidth Requirements Bandwidth (MB/s) Timeslices (s) 78.8MB/ s 12.1MB/ s Decreases with the timeslices Sage-1000MB

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 45 CCS-3 P AL Bandwidth Requirements for 1 second Increases with memory footprint Single SCSI disk performance Most demanding

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 46 CCS-3 P AL Increasing Memory Footprint Size Average Bandwidth (MB/s) Timeslices (s) Increases sublinearly

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 47 CCS-3 P AL Increasing Processor Count Average Bandwidth (MB/s) Timeslices (s) Decreases slightly with processor count Weak-scaling

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 48 CCS-3 P AL Technological Trends Performance of applications bounded by memory improvements Increases at a faster pace Performance Improvement per year

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 49 CCS-3 P AL Resource Management Seeking more effective use of cluster resources: Reduced response time, Greater throughput.

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 50 CCS-3 P AL STORM STORM (Scalable TOol for Resource Management) n Based on buffered coscheduling, n Easy to port, n Enables resource management to exploit low-level network features, n Is orders of magnitude faster than the best reported results in the literature.

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 51 CCS-3 P AL State of the Art in Resource Management Resource managers (e.g. PBS, LSF, RMS, LoadLeveler, Maui) are typically implemented using n TCP/IP—favors portability over performance, n Poorly-scaling algorithms for the distribution/collection of data and control messages—favors development time over performance, Scalable performance not important for small clusters but crucial for large ones. There exists a need for fast and scalable resource management.

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 52 CCS-3 P AL Observation If the cluster has a powerful, scalable network, why aren’t we using it?

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 53 CCS-3 P AL Experimental Results n 64 nodes/256 processors ES40 Alphaserver cluster n 2 indendent network rails of Quadrics Elan3 n Files are placed in ramdisk in order to avoid I/O bottlenecks and expose the performance of the resource management algorithms

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 54 CCS-3 P AL Launch times (unloaded system) The launch time is constant when we increase the number of processors. STORM is highly scalable

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 55 CCS-3 P AL Launch times (loaded system, 12 MB) In the worst case it still takes only 1.5  seconds to launch a 12 MB file on 256 processors

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 56 CCS-3 P AL Measured and estimated launch times The model shows that in an ES40-based Alphaserver a 12MB binary can be launched in 135ms on 16,384 nodes

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 57 CCS-3 P AL Measured and predicted performance of existing job launchers

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 58 CCS-3 P AL Acknowledgements n Jose’ Moreira and György Almási (BlueGene/L) n Ron Brightwell (ASCI Red Storm) n Mark Seager (ASCI Thunder) n Paul Terry (Cray XD1) n Srinidhi Varadarajan (System X)

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 59 CCS-3 P AL PAL TEAM n Juan Fernandez Peinador (BCS-MPI) n Eitan Frachtenberg (STORM) n Jose’ Carlos Sancho (Fault tolerance) n Salvador Coll (Collective Communication) n Scott Pakin (Noise analysis and STORM) n Darren Kerbyson (Noise analysis) n Adolfy Hoisie (team leader)

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 60 CCS-3 P AL References PAL team web page: Fabrizio’s web page

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 61 CCS-3 P AL About the authors Kei Davis is a team leader and technical staff member at Los Alamos National Laboratory (LANL) where he is currently working on system software solutions for reliability and usability of large-scale parallel computers. Previous work at LANL includes computer system performance evaluation and modeling, large-scale computer system simulation, and parallel functional language implementation. His research interests are centered on parallel computing; more specifically, various aspects of operating systems, parallel programming, and programming language design and implementation. Kei received his PhD in Computing Science from Glasgow University and his MS in Computation from Oxford University. Before his appointment at LANL he was a research scientist at the Computing Research Laboratory at New Mexico State University. Fabrizio Petrini is a member of the technical staff of the CCS3 group of the Los Alamos National Laboratory (LANL). He received his PhD in Computer Science from the University of Pisa in Before his appointment at LANL he was a research fellow of the Computing Laboratory of the Oxford University (UK), a postdoctoral researcher of the University of California at Berkeley, and a member of the technical staff of the Hewlett Packard Laboratories. His research interests include various aspects of supercomputers, including high-performance interconnection networks and network interfaces, job scheduling algorithms, parallel architectures, operating systems and parallel programming languages. He has received numerous awards from the NNSA for contributions to supercomputing projects, and from other organizations for scientific publications.