Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Slides:

Advertisements

Similar presentations

Instruction Level Parallelism and Superscalar Processors

Advertisements

Superscalar and VLIW Architectures Miodrag Bolic CEG3151.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.

ARM Cortex-A9 MPCore ™ processor Presented by- Chris Cai (xiaocai2) Rehana Tabassum (tabassu2) Sam Mussmann (mussmnn2)

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Implements IBM PowerPC architecture v2.06  Clock.

Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.

IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Chapter Hardwired vs Microprogrammed Control Multithreading

Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.

Chapter 17 Parallel Processing.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.

Virtual Memory BY JEMINI ISLAM. What is Virtual Memory Virtual memory is a memory management system that gives a computer the appearance of having more.

1 Chapter 4 The Central Processing Unit and Memory.

Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.

Computer performance.

Computers Central Processor Unit. Basic Computer System MAIN MEMORY ALUCNTL..... BUS CONTROLLER Processor I/O moduleInterconnections BUS Memory.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

Computer Architecture Lecture 3 Cache Memory. Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics.

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

The original MIPS I CPU ISA has been extended forward three times The practical result is that a processor implementing MIPS IV is also able to run MIPS.

Computer Organization & Assembly Language © by DR. M. Amer.

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.

Computer Architecture Furkan Rabee

William Stallings Computer Organization and Architecture 6th Edition

Itanium® 2 Processor Architecture

William Stallings Computer Organization and Architecture 7th Edition

Memory COMPUTER ARCHITECTURE

Microarchitecture.

Architecture & Organization 1

Cache Memory Presentation I

William Stallings Computer Organization and Architecture 7th Edition

COMP4211 : Advance Computer Architecture

Flow Path Model of Superscalars

Architecture & Organization 1

Shared Memory Multiprocessors

CMPT 886: Computer Architecture Primer

Systems Architecture II

Cache Tuning Student: João Gabriel Gazolla

Chapter 6 Memory System Design

Parallel Computing Explained How to Parallelize a Code

Course Outline for Computer Architecture

The University of Adelaide, School of Computer Science

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009 Parallel Computing Explained About the IBM Regatta P690

Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 7 Cache Tuning 8 Parallel Performance Analysis 9 About the IBM Regatta P IBM p690 General Overview 9.2 IBM p690 Building Blocks 9.3 Features Performed by the Hardware 9.4 The Operating System 9.5 Further Information

About the IBM Regatta P690 To obtain your program’s top performance, it is important to understand the architecture of the computer system on which the code runs. This chapter describes the architecture of NCSA's IBM p690. Technical details on the size and design of the processors, memory, cache, and the interconnect network are covered along with technical specifications for the compute rate, memory size and speed, and interconnect bandwidth.

IBM p690 General Overview The p690 is IBM's latest Symmetric Multi-Processor (SMP) machine with Distributed Shared Memory (DSM). This means that memory is physically distributed and logically shared. It is based on the Power4 architecture and is a successor to the Power3-II based RS/6000 SP system. IBM p690 Scalability The IBM p690 is a flexible, modular, and scalable architecture. It scales in these terms: Number of processors Memory size I/O and memory bandwidth and the Interconnect bandwidth

Agenda 9 About the IBM Regatta P IBM p690 General Overview 9.2 IBM p690 Building Blocks Power4 Core Multi-Chip Modules The Processor Cache Architecture Memory Subsystem 9.3 Features Performed by the Hardware 9.4 The Operating System 9.5 Further Information

IBM p690 Building Blocks An IBM p690 system is built from a number of fundamental building blocks. The first of these building blocks is the Power4 Core, which includes the processors and L1 and L2 caches. At NCSA, four of these Power4 Cores are linked to form a Multi-Chip Module. This module includes the L3 cache and four Multi-Chip Modules are linked to form a 32 processor system (see figure on the next slide). Each of these components will be described in the following sections.

32-processor IBM p690 configuration (Image courtesy of IBM)

Power4 Core The Power4 Chip contains: Two processors Local caches (L1) External cache for each processor (L2) I/O and Interconnect interfaces

The POWER4 chip (Image curtsey of IBM)

Multi-Chip Modules Four Power4 Chips are assembled to form a Multi-Chip Module (MCM) that contains 8 processors. Each MCM also supports the L3 cache for each Power4 chip. Multiple MCM interconnection (Image courtesy of IBM)

The Processor The processors at the heart of the Power4 Core are speculative superscalar out of order execution chips. The Power4 is a 4-way superscalar RISC architecture running instructions on its 8 pipelined execution units. Speed of the Processor The NCSA IBM p690 has CPUs running at 1.3 GHz. 64-Bit Processor Execution Units There are 8 independent fully pipelined execution units. 2 load/store units for memory access 2 identical floating point execution units capable of fused multiply/add 2 fixed point execution units 1 branch execution unit 1 logic operation unit

The Processor The units are capable of 4 floating point operations, fetching 8 instructions and completing 5 instructions per cycle. It is capable of handling up to 200 in-flight instructions. Performance Numbers Peak Performance: 4 floating point instructions per cycle 1.3 Gcycles/sec * 4 flop/cycle yields 5.2 GFLOPS MIPS Rating: 5 instructions per cycle 1.3 Gcycles/sec * 5 instructions/cycle yields 65 MIPS Instruction Set The instruction set (ISA) on the IBM p690 is the PowerPC AS Instruction set.

Cache Architecture Each Power4 Core has both a primary (L1) cache associated with each processor and a secondary (L2) cache shared between the two processors. In addition, each Multi- Chip Module has a L3 cache. Level 1 Cache The Level 1 cache is in the processor core. It has split instruction and data caches. L1 Instruction Cache The properties of the Instruction Cache are: 64KB in size direct mapped cache line size is 128 bytes L1 Data Cache The properties of the L1 Data Cache are: 32KB in size 2-way set associative FIFO replacement policy 2-way interleaved cache line size is 128 bytes Peak speed is achieved when the data accessed in a loop is entirely contained in the L1 data cache.

Cache Architecture Level 2 Cache on the Power4 Chip When the processor can't find a data element in the L1 cache, it looks in the L2 cache. The properties of the L2 Cache are: external from the processor unified instruction and data cache 1.41MB per Power4 chip (2 processors) 8-way set associative split between 3 controllers cache line size is 128 bytes pseudo LRU replacement policy for cache coherence GB/s peak bandwidth from L2

Cache Architecture Level 3 Cache on the Multi-Chip Module When the processor can't find a data element in the L2 cache, it looks in the L3 cache. The properties of the L3 Cache are: external from the Power4 Core unified instruction and data cache 128MB per Multi-Chip Module (8 processors) 8-way set associative cache line size is 512 bytes 55.5 GB/s peak bandwidth from L2

Memory Subsystem The total memory is physically distributed among the Multi-Chip Modules of the p690 system (see the diagram in the next slide). Memory Latencies The latency penalties for each of the levels of the memory hierarchy are: L1 Cache - 4 cycles L2 Cache - 14 cycles L3 Cache cycles Main Memory cycles

Memory distribution within an MCM

Agenda 9 About the IBM Regatta P IBM p690 General Overview 9.2 IBM p690 Building Blocks 9.3 Features Performed by the Hardware 9.4 The Operating System 9.5 Further Information

Features Performed by the Hardware The following is done completely by the hardware, transparent to the user: Global memory addressing (makes the system memory shared) Address resolution Maintaining cache coherency Automatic page migration from remote to local memory (to reduce interconnect memory transactions)

The Operating System The operating system is AIX. NCSA's p690 system is currently running version 5.1 of AIX. Version 5.1 is a full 64- bit file system. Compatibility AIX 5.1 is highly compatible to both BSD and System V Unix

Further Information Computer Architecture: A Quantitative Approach John Hennessy, et al. Morgan Kaufman Publishers, 2nd Edition, 1996 Computer Hardware and Design: The Hardware/Software Interface David A. Patterson, et al. Morgan Kaufman Publishers, 2nd Edition, 1997 IBM P Series [595] at the URL: IBM p690 Documentation at NCSA at the URL: