Presentation is loading. Please wait.

Presentation is loading. Please wait.

Architecture of Parallel Computers CSC / ECE 506 BlueGene Architecture 4/26/2007 Dr Steve Hunter.

Similar presentations


Presentation on theme: "Architecture of Parallel Computers CSC / ECE 506 BlueGene Architecture 4/26/2007 Dr Steve Hunter."— Presentation transcript:

1 Architecture of Parallel Computers CSC / ECE 506 BlueGene Architecture 4/26/2007 Dr Steve Hunter

2 2 Arch of Parallel Computers CSC / ECE 506 December 1999: IBM Research announced a 5 year, $100M US, effort to build a petaflop/s scale supercomputer to attack science problems such as protein folding. Goals: –A dvance the state of the art of scientific simulation. –Advance the state of the art in computer design and software for capability and capacity markets. November 2001: Announced Research partnership with Lawrence Livermore National Laboratory (LLNL). November 2002: Announced planned acquisition of a BG/L machine by LLNL as part of the ASCI Purple contract. May 11, 2004: Four racks DD1 (4096 nodes at 500 MHz) running Linpack at 11.68 TFlops/s. It was ranked #4 on 23rd Top500 list. June 2, 2004: 2 racks DD2 (1024 nodes at 700 MHz) running Linpack at 8.655 TFlops/s. It was ranked #8 on 23rd Top500 list. September 16, 2004, 8 racks running Linpack at 36.01 TFlops/s. November 8, 2004, 16 racks running Linpack at 70.72 TFlops/s. It was ranked #1 on the 24 th Top500 list. December 21, 2004 First 16 racks of BG/L accepted by LLNL. BlueGene/L Program

3 3 Arch of Parallel Computers CSC / ECE 506 BlueGene/L Program Massive collection of low-power CPUs instead of a moderate-sized collection of high-power CPUs. –A joint development of IBM and DOE’s National Nuclear Security Administration (NNSA) and installed at DOE’s Lawrence Livermore National Laboratory BlueGene/L has occupied the No. 1 position on the last three TOP500 lists (http://www.top500.org/)http://www.top500.org/ –It has reached a Linpack benchmark performance of 280.6 TFlop/s (“teraflops” or trillions of calculations per second) and still remains the only system ever to exceed the level of 100 TFlop/s. –BlueGene/L holds the #1 and #3 positions in top 10. “Objective was to retain exceptional cost/performance levels achieved by application-specific machines, while generalizing the massively parallel architecture enough to enable a relatively broad class of applications” - Overview of BG/L system architecture, IBM JRD –Design approach was to use a very high level of integration that made simplicity in packaging, design, and bring-up possible –JRD issue available at: http://www.research.ibm.com/journal/rd49-23.html

4 4 Arch of Parallel Computers CSC / ECE 506 BlueGene/L Program BlueGene is a family of supercomputers. –BlueGene/L is the first step, aimed as a multipurpose, massively parallel, and cost/effective supercomputer 12/04 –BlueGene/P is the petaflop generation 12/06 –BlueGene/Q is the third generation ~2010. Requirements for future generations –Processors will be more powerful. –Networks will be higher bandwidth. –Applications developed on BlueGeneG/L will run well on BlueGene/P.

5 5 Arch of Parallel Computers CSC / ECE 506 Low Complexity nodes gives more flops per transistor and per watt 3D Interconnect supports many scientific simulations as nature as we see it is 3D BlueGene/L Fundamentals

6 6 Arch of Parallel Computers CSC / ECE 506 BlueGene/L Fundamentals Cellular architecture –Large numbers of low power, more efficient processors interconnected Rmax of 280.6 Teraflops –Maximal LINPACK performance achieved Rpeak of 360 Teraflops –Theoretical peak performance 65,536 dual-processor compute nodes –700MHz IBM PowerPC 440 processors –512 MB memory per compute node, 16 TB in entire system. –800 TB of disk space 2,500 square feet

7 7 Arch of Parallel Computers CSC / ECE 506 Comparing Systems (Peak)

8 8 Arch of Parallel Computers CSC / ECE 506  Red Storm2.02003  Earth Simulator2.02002  Intel Paragon1.81992  nCUBE/21.01990  ASCI Red1.0 (0.6)1997  T3E0.81996  BG/L1.5 0.75(torus)+0.75(tree)2004  Cplant0.11997  ASCI White0.12000  ASCI Q0.05 Quadrics2003  ASCI Purple0.12004  Intel Cluster 0.1 IB2004  Intel Cluster0.008 GbE2003  Virginia Tech0.16 IB2003  Chinese Acad of Sc0.04 QsNet2003  NCSA- Dell0.04 Myrinet2003 Comparing Systems (Byte/Flop)

9 9 Arch of Parallel Computers CSC / ECE 506 Power efficiencies of recent supercomputers –Blue: IBM Machines –Black: Other US Machines –Red: Japanese Machines IBM Journal of Research and Development Comparing Systems (GFlops/Watt)

10 10 Arch of Parallel Computers CSC / ECE 506 ASCI WhiteASCI QEarth Simulator Blue Gene/L Machine Peak (TF/s) 12.33040.96367 Total Mem. (TBytes) 8331032 Footprint (sq ft)10,00020,00034,0002,500 Power (MW)*13.86-8.51.5 Cost ($M)100200400100 # Nodes512409664065,536 MHz3751000500700 Comparing Systems * 10 megawatts approximate usage of 11,000 households

11 11 Arch of Parallel Computers CSC / ECE 506 BG/L Summary of Performance Results DGEMM (Double-precision, GEneral Matrix-Multiply): –92.3% of dual core peak on 1 node –Observed performance at 500 MHz: 3.7 GFlops –Projected performance at 700 MHz: 5.2 GFlops (tested in lab up to 650 MHz) LINPACK: –77% of peak on 1 node –70% of peak on 512 nodes (1435 GFlops at 500 MHz) sPPM (Spare Matrix Multiple Vector Multiply), UMT2000: –Single processor performance roughly on par with POWER3 at 375 MHz –Tested on up to 128 nodes (also NAS Parallel Benchmarks) FFT (Fast Fourier Transform): –Up to 508 MFlops on single processor at 444 MHz (TU Vienna) –Pseudo-ops performance (5N log N) @ 700 MHz of 1300 Mflops (65% of peak) STREAM – impressive results even at 444 MHz: –Tuned: Copy: 2.4 GB/s, Scale: 2.1 GB/s, Add: 1.8 GB/s, Triad: 1.9 GB/s –Standard: Copy: 1.2 GB/s, Scale: 1.1 GB/s, Add: 1.2 GB/s, Triad: 1.2 GB/s –At 700 MHz: Would beat STREAM numbers for most high end microprocessors MPI: –Latency – < 4000 cycles (5.5 ls at 700 MHz) –Bandwidth – full link bandwidth demonstrated on up to 6 links

12 12 Arch of Parallel Computers CSC / ECE 506 BlueGene/L Architecture To achieve this level of integration, the machine was developed around a processor with moderate frequency, available in system-on-a-chip (SoC) technology –This approach was chosen because of the performance/power advantage –In terms of performance/watt the low-frequency, low-power, embedded IBM PowerPC core consistently outperforms high-frequency, high-power, microprocessors by a factor of 2 to 10 –Industry focus on performance / rack »Performance / rack = Performance / watt * Watt / rack »Watt / rack = 20kW for power and thermal cooling reasons Power and cooling –Using conventional techniques, a 360 Tflops machine would require 10-20 megawatts. –BlueGene/L uses only 1.76 megawatts

13 13 Arch of Parallel Computers CSC / ECE 506 Microprocessor Power Density Growth

14 14 Arch of Parallel Computers CSC / ECE 506 System Power Comparison

15 15 Arch of Parallel Computers CSC / ECE 506 BlueGene/L Architecture Networks were chosen with extreme scaling in mind –Scale efficiently in terms of both performance and packaging –Support very small messages »As small as 32 bytes –Includes hardware support for collective operations »Broadcast, reduction, scan, etc. Reliability, Availability and Serviceability (RAS) is another critical issue for scaling –BG/L need to be reliable and usable even at extreme scaling limits –20 fails per 1,000,000,000 hours = 1 node failure every 4.5 weeks System Software and Monitoring also important to scaling –BG/L designed to efficiently utilize a distributed memory, message-passing programming model –MPI is the dominant message-passing model with hardware features added and parameter tuned

16 16 Arch of Parallel Computers CSC / ECE 506 System designed for RAS from top to bottom –System issues »Redundant bulk supplies, power converters, fans, DRAM bits, cable bits »Extensive data logging (voltage, temp, recoverable errors … ) for failure forecasting »Nearly no single points of failure –Chip design »ECC on all SRAMs »All dataflow outside processors is protected by error-detection mechanisms »Access to all state via noninvasive back door –Low power, simple design leads to higher reliability –All interconnects have multiple error detections and correction coverage »Virtually zero escape probability for link errors RAS (Reliability, Availability, Serviceability)

17 17 Arch of Parallel Computers CSC / ECE 506 136.8 Teraflop/s on LINPACK (64K processors) 1 TF = 1000,000,000,000 Flops Rochester Lab 2005 BlueGene/L System

18 18 Arch of Parallel Computers CSC / ECE 506 BlueGene/L System

19 19 Arch of Parallel Computers CSC / ECE 506 BlueGene/L System

20 20 Arch of Parallel Computers CSC / ECE 506 BlueGene/L System

21 21 Arch of Parallel Computers CSC / ECE 506 Physical Layout of BG/L

22 22 Arch of Parallel Computers CSC / ECE 506 Midplanes and Racks

23 23 Arch of Parallel Computers CSC / ECE 506 The Compute Chip System-on-a-chip (SoC) 1 ASIC –2 PowerPC processors –L1 and L2 Caches –4MB embedded DRAM –DDR DRAM interface and DMA controller –Network connectivity hardware –Control / monitoring equip. (JTAG)

24 24 Arch of Parallel Computers CSC / ECE 506 Compute Card

25 25 Arch of Parallel Computers CSC / ECE 506 Node Card

26 26 Arch of Parallel Computers CSC / ECE 506 BlueGene/L Compute ASIC IBM CU-11, 0.13 µm 11 x 11 mm die size 25 x 32 mm CBGA 474 pins, 328 signal 1.5/2.5 Volt

27 27 Arch of Parallel Computers CSC / ECE 506 3 Dimensional Torus –Main network, for point-to-point communication –High-speed, high-bandwidth –Interconnects all compute nodes (65,536) –Virtual cut-through hardware routing –1.4Gb/s on all 12 node links (2.1 GB/s per node) –1 µs latency between nearest neighbors, 5 µs to the farthest –4 µs latency for one hop with MPI, 10 µs to the farthest –Communications backbone for computations –0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth Global Tree –One-to-all broadcast functionality –Reduction operations functionality –MPI collective ops in hardware –Fixed-size 256 byte packets –2.8 Gb/s of bandwidth per link –Latency of one way tree traversal 2.5 µs –~23TB/s total binary tree bandwidth (64k machine) –Interconnects all compute and I/O nodes (1024) –Also guarantees reliable delivery Ethernet –Incorporated into every node ASIC –Active in the I/O nodes (1:64) –All external comm. (file I/O, control, user interaction, etc.) Low Latency Global Barrier and Interrupt –Latency of round trip 1.3 µs Control Network BlueGene/L Interconnect Networks

28 28 Arch of Parallel Computers CSC / ECE 506 The Torus Network 3 dimensional: 64 x 32 x 32 –Each compute node is connected to its six neighbors: x+, x-, y+, y-, z+, z- –Compute card is 1x2x1 –Node card is 4x4x2 –16 compute cards in 4x2x2 arrangement –Midplane is 8x8x8 –16 node cards in 2x2x4 arrangement Communication path –Each uni-directional link is 1.4Gb/s, or 175MB/s. –Each node can send and receive at 1.05GB/s. –Supports cut-through routing, along with both deterministic and adaptive routing. –Variable-sized packets of 32,64,96…256 bytes –Guarantees reliable delivery

29 29 Arch of Parallel Computers CSC / ECE 506 Complete BlueGene/L System at LLNL BG/L compute nodes 65,536 BG/L I/O nodes 1,024 Federated Gigabit Ethernet Switch 2,048 ports Front-end nodes Service node WAN visualization archive CWFS 8 8 Control network 8 512 128 64 48 1024

30 30 Arch of Parallel Computers CSC / ECE 506 System Software Overview Operating system - Linux Compilers - IBM XL C, C++, Fortran95 Communication - MPI, TCP/IP Parallel File System - GPFS, NFS support System Management - extensions to CSM Job scheduling - based on LoadLeveler Math libraries - ESSL

31 31 Arch of Parallel Computers CSC / ECE 506 BG/L Software Hierarchical Organization Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK) I/O nodes run Linux and provide a more complete range of OS services – files, sockets, process launch, signaling, debugging, and termination Service node performs system management services (e.g., heart beating, monitoring errors) - transparent to application software

32 32 Arch of Parallel Computers CSC / ECE 506 BG/L System Software Simplicity –Space-sharing –Single-threaded –No demand paging Familiarity –MPI (MPICH2) –IBM XL Compilers for PowerPC

33 33 Arch of Parallel Computers CSC / ECE 506 Operating Systems Front-end nodes are commodity systems running Linux I/O nodes run a customized Linux kernel Compute nodes use an extremely lightweight custom kernel Service node is a single multiprocessor machine running a custom OS

34 34 Arch of Parallel Computers CSC / ECE 506 Compute Node Kernel (CNK) Single user, dual-threaded Flat address space, no paging Physical resources are memory-mapped Provides standard POSIX functionality (mostly) Two execution modes: –Virtual node mode –Coprocessor mode

35 35 Arch of Parallel Computers CSC / ECE 506 Service Node OS Core Management and Control System (CMCS) BG/L’s “global” operating system. MMCS - Midplane Monitoring and Control System CIOMAN - Control and I/O Manager DB2 relational database

36 36 Arch of Parallel Computers CSC / ECE 506 Running a User Job Compiled, and submitted from front-end node. External scheduler Service node sets up partition, and transfers user’s code to compute nodes. All file I/O is done using standard Unix calls (via the I/O nodes). Post-facto debugging done on front-end nodes.

37 37 Arch of Parallel Computers CSC / ECE 506 Performance Issues User code is easily ported to BG/L. However, MPI implementation requires effort & skill –Torus topology instead of crossbar –Special hardware, such as collective network.

38 38 Arch of Parallel Computers CSC / ECE 506 BG/L MPI Software Architecture GI = Global Interrupt CIO = Control and I/O Protocol CH3 = Primary device distributed with MPICH2 communication MPD = Multipurpose Daemon

39 39 Arch of Parallel Computers CSC / ECE 506 MPI_Bcast

40 40 Arch of Parallel Computers CSC / ECE 506 MPI_Alltoall

41 41 Arch of Parallel Computers CSC / ECE 506 References IBM Journal of Research and Development, Vol. 49, No. 2-3. –http://www.research.ibm.com/journal/rd49-23.html »“Overview of the Blue Gene/L system architecture” »“Packaging the Blue Gene/L supercomputer” »“Blue Gene/L compute chip: Memory and Ethernet subsystems” »“Blue Gene/L torus interconnection network” »“Blue Gene/L programming and operating environment” »“Design and implementation of message-passing services for the Blue Gene/L supercomputer”

42 42 Arch of Parallel Computers CSC / ECE 506 References (cont.) BG/L homepage @ LLNL: BlueGene homepage @ IBM:

43 43 Arch of Parallel Computers CSC / ECE 506 The End


Download ppt "Architecture of Parallel Computers CSC / ECE 506 BlueGene Architecture 4/26/2007 Dr Steve Hunter."

Similar presentations


Ads by Google