Presentation is loading. Please wait.

Presentation is loading. Please wait.

“The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı.

Similar presentations


Presentation on theme: "“The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı."— Presentation transcript:

1 “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

2 Outline Introduction Introduction Specification of CP-PACS Specification of CP-PACS Pseudo Vector Processor PVP-SW Pseudo Vector Processor PVP-SW Interconnection Network of CP-PACS Interconnection Network of CP-PACS Hyper-crossbar Network Hyper-crossbar Network Remote DMA message transfer Remote DMA message transfer Message broadcasting Message broadcasting Barrier synchronization Barrier synchronization Performance Evaluation Performance Evaluation Conclusion, References, Questions & Comments Conclusion, References, Questions & Comments

3 Introduction CP-PACS: Computational Physics by Parallel Array Computer Systems CP-PACS: Computational Physics by Parallel Array Computer Systems To construct a dedicated MMP for computational physics, study Quantum-Chromo Dynamics To construct a dedicated MMP for computational physics, study Quantum-Chromo Dynamics Center for Computational Physics, University of Tsukaba, Japan Center for Computational Physics, University of Tsukaba, Japan

4 Specification of CP-PACS MIMD parallel processing system with distributed memory. MIMD parallel processing system with distributed memory. Each Processing Unit (PU) has a RISC processor and a local memory. Each Processing Unit (PU) has a RISC processor and a local memory. 2048 of such PU’s, connected by an interconnection network. 2048 of such PU’s, connected by an interconnection network. 128 IO units, that support a distributed disk space. 128 IO units, that support a distributed disk space.

5 Specification of CP-PACS

6 Theoretical performance Theoretical performance To be able to solve problems like QCD, Astro- fluid dynamics, etc. a grat number of PUs are required. To be able to solve problems like QCD, Astro- fluid dynamics, etc. a grat number of PUs are required. For budget, reliability reasons, number of PUs is limited at 2048. For budget, reliability reasons, number of PUs is limited at 2048.

7 Specification of CP-PACS Node processor Node processor Improve function of node processors first. Improve function of node processors first. Caches do not work efficiently on ordinary RISC processors. Caches do not work efficiently on ordinary RISC processors. New technique for cache function is introduced: PVP-SW New technique for cache function is introduced: PVP-SW

8 Specification of CP-PACS Interconnection Network Interconnection Network 3-dimensional Hyper-Crossbar (3-D HXB) 3-dimensional Hyper-Crossbar (3-D HXB) Peak throughput of a single link: 300 MB/sec Peak throughput of a single link: 300 MB/sec Provides Provides Hardware message broadcasting Hardware message broadcasting Block-stride message transfer Block-stride message transfer Barrier synchronization Barrier synchronization

9 Specification of CP-PACS I/O system I/O system 128 I/O units, equipped with RAID-5 hard disk system. 128 I/O units, equipped with RAID-5 hard disk system. 528 GB total system disk space. 528 GB total system disk space. RAID-5 system increases fault tolerance. RAID-5 system increases fault tolerance.

10 Pseudo Vector Processor PVP-SW MPPs require high performance node processors. MPPs require high performance node processors. A node processor cannot achieve high performance unless cache system works efficiently. A node processor cannot achieve high performance unless cache system works efficiently. Little temporal locality exists Little temporal locality exists Data space of application is much larger than cache size. Data space of application is much larger than cache size.

11 Pseudo Vector Processor PVP-SW Vector processors Vector processors Main memory is pipelined. Main memory is pipelined. Vector length of load/store is long. Vector length of load/store is long. Load/store is executed in parallel with arithmetic execution. Load/store is executed in parallel with arithmetic execution. We require these in our node processor We require these in our node processor PVP-SW is introduced. PVP-SW is introduced. It is pseudo-vector. It is pseudo-vector.

12 Pseudo Vector Processor PVP-SW Cannot increase number of registers, register field in instructions is limited. Cannot increase number of registers, register field in instructions is limited. So, a new technique, Slide-Windowed Registers is introduced. So, a new technique, Slide-Windowed Registers is introduced.

13 Pseudo Vector Processor PVP-SW Slide-Windowed Registers Slide-Windowed Registers Physical registers consist of logical windows, a window consists of 32 registers. Physical registers consist of logical windows, a window consists of 32 registers. Total number of registers is 128. Total number of registers is 128. Global registers & Window registers Global registers & Window registers Global registers are static and shared by all windows Global registers are static and shared by all windows Local registers are not shared. Local registers are not shared. One window active at a certain time. One window active at a certain time.

14 Pseudo Vector Processor PVP-SW Slide-Windowed Registers Slide-Windowed Registers Active window is identified by a pointer, FW-STP. Active window is identified by a pointer, FW-STP. New instructions are introduced, to deal with FW-STP: New instructions are introduced, to deal with FW-STP: FWSTPSet: Sets new location for FW-STP. FWSTPSet: Sets new location for FW-STP. FRPreload: Load data from memory into a window. FRPreload: Load data from memory into a window. FRPoststore: Store data into memory from a window. FRPoststore: Store data into memory from a window.

15 Pseudo Vector Processor PVP-SW

16 Interconnection Network of CP-PACS Topology is a Hyper-Crossbar Network (HXB) Topology is a Hyper-Crossbar Network (HXB) 8 x 17 x 16, 2048 PUs, 128 I/O units. 8 x 17 x 16, 2048 PUs, 128 I/O units. On a dimension of hypercube, the PUs are interconnected by a crossbar. On a dimension of hypercube, the PUs are interconnected by a crossbar. For example: On Y dimension, a Y x Y size crossbar is used. For example: On Y dimension, a Y x Y size crossbar is used. Routing is simple, route on 3 dimensions consecutively. Routing is simple, route on 3 dimensions consecutively. Wormhole routing is employed. Wormhole routing is employed.

17 Interconnection Network of CP-PACS Wormhole routing & HXB together has these properties: Wormhole routing & HXB together has these properties: Small network diameter Small network diameter Same sized torus can be simulated. Same sized torus can be simulated. Message broadcasting by hardware. Message broadcasting by hardware. Binary hypercube can be emulated. Binary hypercube can be emulated. Througput in even random transfer is high. Througput in even random transfer is high.

18 Interconnection Network of CP-PACS Remote DMA transfer Remote DMA transfer Making a system call to OS and copying data to OS area is messy. Making a system call to OS and copying data to OS area is messy. Instead, access remote node’s memory directly. Instead, access remote node’s memory directly. Remote DMA is good, because: Remote DMA is good, because: Mode switching (kernel/user mode) is tedious. Mode switching (kernel/user mode) is tedious. Redundant data copying (user  kernel space) is not done. Redundant data copying (user  kernel space) is not done.

19 Interconnection Network of CP-PACS Message Broadcasting Message Broadcasting Supported by hardware. Supported by hardware. First, perform on one dimension First, perform on one dimension Then perform on other dimensions Then perform on other dimensions Hardware mechanisms to prevent deadlock caused by two nodes broadcasting at the same time are present. Hardware mechanisms to prevent deadlock caused by two nodes broadcasting at the same time are present. Hardware partitioning is possible. Hardware partitioning is possible. Send broadcast message to nodes in the sender’s partition only. Send broadcast message to nodes in the sender’s partition only.

20 Interconnection Network of CP-PACS Barrier Synchronization Barrier Synchronization A synchronization mechanism is required in IPC systems. A synchronization mechanism is required in IPC systems. CP-PACS supports a hardware barrier synchronization facility. CP-PACS supports a hardware barrier synchronization facility. Makes use of special syncronization packets, other than usual data packets. Makes use of special syncronization packets, other than usual data packets. CP-PACS also supports partitioned pieces of network to use barrier synchronization. CP-PACS also supports partitioned pieces of network to use barrier synchronization.

21 Performance Evaluation Based on LINPACK benchmark. Based on LINPACK benchmark. LU decomposition of a matrix. LU decomposition of a matrix. Outer product method is used, based on 2- dimensional block-cyclic distribution. Outer product method is used, based on 2- dimensional block-cyclic distribution. All floating point and data loading/storing operations are done in PVP-SW manner. All floating point and data loading/storing operations are done in PVP-SW manner.

22 Performance Evaluation

23

24

25 Conclusion CP-PACS is operational in University of Tsukuba. CP-PACS is operational in University of Tsukuba. Working on large scale QCD calculations. Working on large scale QCD calculations. Sponsored by Hitachi Ltd. & Grant-in-aid of Ministry of Education, Science of Culture, in Japan. Sponsored by Hitachi Ltd. & Grant-in-aid of Ministry of Education, Science of Culture, in Japan.

26 References T.Boku, H. Nakamura, K. Nakazawa, Y. Iwasaki, The architecture of Massively Parallel Processor CP-PACS, Institute of Information Sciences and Electronics, University of Tsukuba T.Boku, H. Nakamura, K. Nakazawa, Y. Iwasaki, The architecture of Massively Parallel Processor CP-PACS, Institute of Information Sciences and Electronics, University of Tsukuba

27 Questions & Comments


Download ppt "“The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı."

Similar presentations


Ads by Google