Top 500 Computers Federated Distributed Systems Anda Iamnitchi.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Threads, SMP, and Microkernels
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
2. Computer Clusters for Scalable Parallel Computing
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
CMSC 421: Principles of Operating Systems Section 0202 Instructor: Dipanjan Chakraborty Office: ITE 374
Distributed Processing, Client/Server, and Clusters
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.
Parallel Computers Past and Present Yenchi Lin Apr 17,2003.
Chapter 13 Embedded Systems
Introduction  What is an Operating System  What Operating Systems Do  How is it filling our life 1-1 Lecture 1.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Interconnection and Packaging in IBM Blue Gene/L Yi Zhu Feb 12, 2007.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
PMIT-6102 Advanced Database Systems
1 Computing platform Andrew A. Chien Mohsen Saneei University of Tehran.
Computer System Architectures Computer System Software
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems.
1 COMPSCI 110 Operating Systems Who - Introductions How - Policies and Administrative Details Why - Objectives and Expectations What - Our Topic: Operating.
“The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı.
Thanks to Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction n What is an Operating System? n Mainframe Systems.
Chapter 1. Introduction What is an Operating System? Mainframe Systems
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
N. GSU Slide 1 Chapter 05 Clustered Systems for Massive Parallelism N. Xiong Georgia State University.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
Computer Organization & Assembly Language © by DR. M. Amer.
1 THE EARTH SIMULATOR SYSTEM By: Shinichi HABATA, Mitsuo YOKOKAWA, Shigemune KITAWAKI Presented by: Anisha Thonour.
Interconnection network network interface and a case study.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
CS4315A. Berrached:CMS:UHD1 Introduction to Operating Systems Chapter 1.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Background Computer System Architectures Computer System Software.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems Multiprocessor Systems Distributed Systems Clustered System Real.
PARALLEL MODEL OF EVOLUTIONARY GAME DYNAMICS Amanda Peters MIT /13/2009.
Chapter 1: Introduction
Chapter 1: Introduction
Applied Operating System Concepts
Chapter 1: Introduction
Chapter 1: Introduction
Overview of Earth Simulator.
The Earth Simulator System
Introduction to Operating System (OS)
Chapter 1: Introduction
Chapter 1: Introduction
Chapter 1: Introduction
Operating System Concepts
Chapter 1: Introduction
Introduction to Operating Systems
Introduction to Operating Systems
Chapter 1: Introduction
Chapter 1: Introduction
Chapter 1: Introduction
Chapter 1: Introduction
Operating System Overview
Operating System Concepts
Lecture Topics: 11/1 Hand back midterms
Chapter 1: Introduction
Presentation transcript:

Top 500 Computers Federated Distributed Systems Anda Iamnitchi

Objectives and Outline Question to answer in the end: differences between clusters and supercomputers? Top 500 computers: –List –How is it decided? (Linpack Benchmark) –Some historical data IBM Blue Gene Earth Simulator

System components BG/L Objective: 65,536 nodes produced in 130-nm copper IBM CMOS 8SFG technology. Each node: a single application-specific integrated circuit (ASIC) with two processors and nine double-data-rate (DDR) synchronous dynamic random access memory (SDRAM) chips. The SoC ASIC that powers the node incorporates all of the functionality needed by BG/L. It also contains 4 MB of extremely high-bandwidth embedded DRAM [15] that is of the order of 30 cycles from the registers on most L1/L2 cache misses.15 Nodes are physically small, allowing for very high packaging density.

System components BG/L (cont) Power is critical: densities achieved are a factor of 2 to 10 greater than those available with traditional high-frequency uniprocessors. System packaging: 512 processing nodes, each with a peak performance of 5.6 Gflops, on a doubled-sided board, or midplane, (20 in. by 25 in). Each node contains two processors, which makes it possible to vary the running mode. Each compute node has a small operating system that can handle basic I/O tasks and all functions necessary for high-performance code. For file systems, compiling, diagnostics, analysis, and service of BG/L, an external host computer (or computers) is required. The I/O nodes contain a software layer above the layer on the compute nodes to handle communication with the host. Partitioning the machine space in a manner that enables each user to have a dedicated set of nodes for their application, including dedicated network resources. This partitioning is utilized by a resource allocation system which optimizes the placement of user jobs on hardware partitions in a manner consistent with the hardware constraints.

ComponentDescriptionNo. per rack Node card16 compute cards, two I/O cards32 Compute card Two compute ASICs, DRAM512 I/O cardTwo compute ASICs, DRAM, Gigabit Ethernet2 to 64, selectable Link cardSix-port ASICs8 Service cardOne to twenty clock fan-out, two Gigabit Ethernet to 22 Fast Ethernet fan-out, miscellaneous rack functions, 2.5/3.3-V persistent dc 2 Midplane16 node cards2 Clock fan- out card One to ten clock fan-out with and without master oscillator1 Fan unitThree fans, local control20 Power system ac/dc1 Compute rack With fans, ac/dc power1

Nodes are interconnected through five networks, the most significant of which is a 64 × 32 × 32 three-dimensional torus that has the highest aggregate bandwidth and handles the bulk of all communication. There are virtually no asymmetries in this interconnect. This allows for a simple programming model because there are no edges in a torus configuration.

Principles: 1.Simplicity for ease of development and high reliability, e.g.: One job per partition (hence, no protection for communication) One thread per processor (hence, more deterministic computation and scalability) No virtual memory (hence, no page faults, thus more deterministic computation) 2.Performance 3.Familiarity: programming environments popular with scientists (MPI) for portability.

The Earth Simulator: Development Schedule

Development Goals Distributed Memory Parallel Computing System which 640 processor nodes interconnected by Single-Stage Crossbar Network Processor Node: 8 vector processors with shared memory Peak Performance: 40 Tflops Total Main Memory: 10 TB Inter-node data transfer rate: 12.3GB/s x2 Target for Sustained Performance: 5 Tflops using Atmosphere general circulation model (AGCM). *The goal above was its thousand times' performance of the AGCM in those times (1997).

System Configuration The ES is a highly parallel vector supercomputer system of the distributed-memory type, and consisted of 640 processor nodes (PNs) connected by 640x640 single-stage crossbar switches. Each PN is a system with a shared memory, consisting of 8 vector-type arithmetic processors (APs), a 16- GB main memory system (MS), a remote access control unit (RCU), and an I/O processor. The peak performance of each AP is 8Gflops. The ES as a whole thus consists of 5120 APs with 10 TB of main memory and the theoretical performance of 40Tflops Peak performance/AP: 8Gflops; Total number of APs: 5120 Peak performance/PN: 64 Gflops; Total number of PNs: 640 Shared memory/PN: 16G; Total peak performance: 40Tflops Total main memory10TB

Each PN is a system with a shared memory with: 8 vector-type arithmetic processors (APs): peak performance 8Gflops 16-GB main memory system (MS) remote access control unit (RCU) an I/O processor Thus, 5120 APs with 10 TB of main memory and the theoretical performance of 40Tflops Highly parallel vector supercomputer system with distributed memory: 640 processor nodes (PNs) connected by 640x640 single- stage crossbar switches.

The RCU is directly connected to the crossbar switches and controls inter- node data communications at 12.3GB/s bidirectional transfer rate for both sending and receiving data. Thus the total bandwidth of inter-node network is about 8TB/s.

The number of Cables which connects PN cabinet and IN cabinet is 640x130=83200, and the total extension is 2,400Km.

Figure 1: Super Cluster System of ES: A hierarchical management system is introduced to control the ES. Every 16 nodes are collected as a cluster system and therefore there are 40 sets of cluster in total. A set of cluster is called an "S-cluster" which is dedicated for interactive processing and small-scale batch jobs. A job within one node can be processed on the S-cluster. The other sets of cluster is called "L-cluster" which are for medium-scale and large-scale batch jobs. Parallel processing jobs on several nodes are executed on some sets of cluster. Each cluster has a cluster control station (CCS) which monitors the state of the nodes and controls electricity of the nodes belonged to the cluster. A super cluster control station (SCCS) plays an important role of integration and coordination of all the CCS operations.

Parallel File System If a large parallel job running on 640 PNs reads from/writes to one disk installed in a PN, each PN accesses to the disk in sequence and performance degrades terribly. Although local I/O in which each PN reads from or writes to its own disk solves the problem, it is a very hard work to manage such a large number of partial files. Therefore, parallel I/O is greatly demanded in ES from the point of view of both performance and usability. The parallel file system (PFS) provides the parallel I/O features to ES (Figure 2). It enables handling multiple files, which are located on separate disks of multiple PNs, as logically one large file. Each process of a parallel program can read/write distributed data from/to the parallel file concurrently with one I/O statement to achieve high performance and usability of I/O. Figure 2: Parallel File System (PFS): A parallel file, i.e. file on PFS is striped and stored cyclically in the specified blocking size into the disk of each PN. When a program accesses to the file, the File Access Library (FAL) sends a request for I/O via IN to the File Server on the node that owns the data to be accessed.

2 types of queues: L: production run S: single-node batch jobs (pre- and post-processing: creating initial data, processing results of a simulation and other processes

Programming Model in ES The ES hardware has a 3-level hierarchy of parallelism: vector processing in an AP, parallel processing with shared memory in a PN, and parallel processing among PNs via IN. To bring out high performance of ES fully, one must develop parallel programs that make the most use of such parallelism.

Yellow: theoretical peak; Pink: 30% sustained performance. The top three were honorees of the Gordon Bell Award 2002 (peek performance, language and special accomplishments). Nov. 2003: The Gordon Bell Award for Peak Performance: Modeling of global seismic wave propagation on the Earth Simulator (5Tflops) Nov. 2004: The Gordon Bell Award for Peak Performance: The simulation of a dynamo and geomagnetic fields by the Earth Simulator (15.2Tflops)

Cluster vs. Supercomputer? …