Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology 010652000 Introduction to Parallel Computing Group.

Slides:



Advertisements
Similar presentations
Distributed Systems CS
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach.
PARAM Padma SuperComputer
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Ver 0.1 Page 1 SGI Proprietary Introducing the CRAY SV1 CRAY SV1-128 SuperCluster.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Today’s topics Single processors and the Memory Hierarchy
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Types of Parallel Computers
Computer Architecture & Organization
1 Introduction to MIMD Architectures Sima, Fountain and Kacsuk Chapter 15 CSE462.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Introduction CS 524 – High-Performance Computing.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd. SR8000 Concept Tim Lanfear Hitachi Europe GmbH.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Parallel Computers Past and Present Yenchi Lin Apr 17,2003.
Parallel Computing Overview CS 524 – High-Performance Computing.
Chapter 17 Parallel Processing.
Earth Simulator Jari Halla-aho Pekka Keränen. Architecture MIMD type distributed memory 640 Nodes, 8 vector processors each. 16GB shared memory per node.
Sun FIRE Jani Raitavuo Niko Ronkainen. Sun FIRE 15K Most powerful and scalable Up to 106 processors, 576 GB memory and 250 TB online disk storage Fireplane.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
KUAS.EE Parallel Computing at a Glance. KUAS.EE History Parallel Computing.
HITACHI All Rights Reserved, Copyright C 2001, Hitachi, Ltd. Overview of Hitachi’s Super Technical Server SR8000 Overview of Hitachi’s Super Technical.
Computer System Architectures Computer System Software
Executing OpenMP Programs Mitesh Meswani. Presentation Outline Introduction to OpenMP Machine Architectures Shared Memory (SMP) Distributed Memory MPI.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Computers Central Processor Unit. Basic Computer System MAIN MEMORY ALUCNTL..... BUS CONTROLLER Processor I/O moduleInterconnections BUS Memory.
“The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı.
2007 Sept 06SYSC 2001* - Fall SYSC2001-Ch1.ppt1 Computer Architecture & Organization  Instruction set, number of bits used for data representation,
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
Lappeenranta University of Technology / JP CT30A7001 Concurrent and Parallel Computing Introduction to concurrent and parallel computing.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Computer Architecture and Organization Introduction.
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
Sun Fire™ E25K Server Keith Schoby Midwestern State University June 13, 2005.
Parallel Computer Architecture and Interconnect 1b.1.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Chapter 1 Introduction. Architecture & Organization 1 Architecture is those attributes visible to the programmer —Instruction set, number of bits used.
Chapter 1 Introduction. Objectives To explain the definition of computer architecture To discuss the history of computers To describe the von-neumann.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
1 THE EARTH SIMULATOR SYSTEM By: Shinichi HABATA, Mitsuo YOKOKAWA, Shigemune KITAWAKI Presented by: Anisha Thonour.
Interconnection network network interface and a case study.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.
Guoliang Chen Parallel Computing Guoliang Chen
ECEG-3202 Computer Architecture and Organization
ECEG-3202 Computer Architecture and Organization
Presentation transcript:

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group 2: Juha Huttunen, Tite 4 Olli Ryhänen, Tite 4

History of SR8000 Successor of Hitachi S-3800 vector super computer and SR2201 parallel computer.

Overview of system architecture Distributed-memory parallel computer with pseudo-vector SMP nodes.

Processing Unit IBM PowerPC CPU architecture with Hitachi’s extensions –64 bit PowerPC RISC processors –Available in speeds: 250MHz, 300MHz, 375MHz and 450MHz –Hitachi extensions Additional 128 floating-point registers (total of 160 FPRs) Fast hardware barrier synchronisation mechanism Pseudo Vector Processing (PVP)

160 Floating-Point registers 160 FP registers –FR0 – FR31 global part –FR32 – FR128 slide part FPR operations extended to handle slide part Inner Product of two arrays

Pseudo Vector Processing (PVP) Introduced in Hitachi SR2201 supercomputer Designed to solve memory bandwidth problems in RISC CPUs –Performance similar of vector processor –Non-blocking arithmetic execution –Reduce chances of cache misses Pipelined data loading –pre-fetch –pre-load

Pseudo Vector Processing (PVP) Performance effect of PVP

Node Structure Pseudo vector SMP-nodes –8 instruction processors (IP) for computation –1 system control processor (SP) for management –Co-operative Micro-processors in single Address Space (COMPAS) –Maximum number of nodes is 512 (4096 processors) Node types –Processing Nodes (PRN) –I/O Nodes (ION) –Supervisory Node (SVN) One per system

Node Partitioning/Grouping A physical node can belong to many logical partitions A node can belong to multiple node groups –Node groups are created dynamically by the master node

COMPAS Auto parallelization by the compiler Hardware support for fast fork/join sequences –Small start-up overhead –Cache coherency –Fast signalling between child and parent processes

COMPAS Performance effect of COMPAS

Interconnection Network Interconnection network –Multidimensional crossbar 1, 2 or 3-dimensional Maximum of 8 nodes/dimension –External connections via I/O nodes Ethernet, ATM, etc. Remote Direct Memory Access (RDMA) –Data transfer between nodes –Minimizes operating system overhead –Support in MPI and PVM libraries

RDMA

Overview of Architecture

Software on SR8000 Operating System –HI-UX with MPP (Massively Parallel Processing) features –Built-in maintenance tools –64 bit addressing with 32 bit code support –Single system for the whole computer Programming tools –Optimized F77, F90, Parallel Fortran, C and C++ compilers –MPI-2 (Message Parsing Interface) –PVM (Parallel Virtual Machine) –Variety of debugging tools (eg. Vampir and Totalview)

Hybrid Programming Model Supports several parallel programming methods –MPI + COMPAS Each node has one MPI process Pseudo vectorization by PVP Auto parallelization by COMPAS –MPI + OpenMP Each node has one MPI process Divided to threads between the 8 CPUs by OpenMP –MPI + MPP Each CPU has one MPI process (max 8 processes/node) –COMPAS Each node has one process Pseudo vectorization by PVP Auto parallelization by COMPAS

Hybrid Programming Model –OpenMP Each node has one process Divided to threads between the 8 CPUs by OpenMP –Scalar One application with a single thread on one CPU Can use the 9th CPU –ION Default model for commands like ’ls’, ’vi’ etc. Can use the 9th CPU

Hybrid Programming Model Performance Effects Parallel vector-matrix multiplication used as example

Performance Figures 10 places on the Top 500 list –Highest rankings 26 and 27 Theoretical maximum performance 7,3Tflop/s with 512 nodes Node performance depends on the model, from 8Gflop/s to 14,4Gflop/s depending on the CPU speed. Maximum memory capacity 8TB Latency from processor to various locations –To memory: 30 – 200 nanoseconds –To remote memory via RDMA feature: ~3-5 microseconds –MPI (without RDMA): ~6-20 microseconds –To disk: ~8 milliseconds –To tape: ~30 seconds

Scalability Highly scalable architecture –Fast interconnection network and modular node structure –Externally coupling 2 G1 frames performanceof 1709Gflop/s out of 2074Gflop/s was achieved (82% efficiency)

Leibniz-Rechenzentrum SR8000-F1 in Leibniz-Rechenzentrum (LRZ), Munich –German federal top-level compute server in Bavaria System information –168 nodes (1344 processors, 375 MHz) –1344GB of memory 8 GB/node 4 nodes with 16 GB –10TB of disk storage

Leibniz-Rechenzentrum Performance –Peak performance per CPU – 1,5 GFlop/s (per node 12 GFlop/s) –Total peak performance – 2016 GFlop/s (Linpack 1645 GFlop/s) –I/O bandwidth – to /home 600 MB/s, to /tmp 2,4GB/s –Expected efficiency (from LRZ benchmarks) >600 GFlop/s –Performance from main memory (most unfavourable case) >244 GFlop/s

Leibniz-Rechenzentrum –Unidirectional communication bandwidth: MPI without RDMA – 770 MB/s MPI without RDMA – 950 MB/s Hardware – 1000 MB/s –2*unidirectional bisection bandwidth MPI and RDMA – 2x79 = 158 GB/s Hardware – 2x84 = 168 GB/s