High performance computing architecture examples Unit 2.

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

GRAPHICS AND COMPUTING GPUS Jehan-François Pâris

Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.

1 RISC Machines l RISC system »instruction –standard, fixed instruction format –single-cycle execution of most instructions –memory access is available.

PlayStation 2 Architecture Irin Jose Farid Momin Quy Ngo Olivia Wong.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

COMPUTER ARCHITECTURE

Emotion Engine A look at the microprocessor at the center of the PlayStation2 gaming console Charles Aldrich.

Computer performance.

J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

COMPUTER ARCHITECTURE (for Erasmus students)

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.

Introduction to the Cell multiprocessor J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, D. Shippy (IBM Systems and Technology Group)

Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Cell Broadband Engine Architecture Bardia Mahjour ENCM 515 March 2007 Bardia Mahjour ENCM 515 March 2007.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Practical PC, 7th Edition Chapter 17: Looking Under the Hood

Extracted directly from:

Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.

Multi-core architectures. Single-core computer Single-core CPU chip.

1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek.

Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

History of Microprocessor MPIntroductionData BusAddress Bus

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.

1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

1 Latest Generations of Multi Core Processors

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.

Sony PlayStation 3 Sony also laid out the technical specs of the device. The PlayStation 3 will feature the much-vaunted Cell processor, which will run.

Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University

Playstation2 Architecture Architecture Hardware Design.

Optimizing Ray Tracing on the Cell Microprocessor David Oguns.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

Hewlett-Packard PA-RISC Bit Processors: History, Features, and Architecture Presented By: Adam Gray Christie Kummers Joshua Madagan.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.

IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.

Visit for more Learning Resources

Cell Architecture.

Graphics Processing Unit

CSE 502: Computer Architecture

Multicore and GPU Programming

Presentation transcript:

High performance computing architecture examples Unit 2

Classification of parallel programming models 1.1 Process interaction – Shared memory – Message passing – Implicit interaction 1.2 Problem decomposition – Task parallelism – Data parallelism – Implicit parallelism

IBM CELL Broadband Engine

Cell (microprocessor) Cell is a multi-core (asymmetric) microprocessor microarchitecture that combines a general- purpose Power Architecture core of modest performance with streamlined coprocessing elements which greatly accelerate multimedia and vector processing applications It was developed by Sony, Toshiba, and IBM, an alliance known as "STI“ Power :Performance Optimization With Enhanced RISC. Power Architecture Core: An old microprocessor instruction set architecture designed by IBM.

The Cell Broadband Engine, or Cell as it is more commonly known, is a microprocessor intended as a hybrid of conventional desktop processors (such as the Athlon 64, and Core 2 families) and more specialized high- performance processors, such as the NVIDIA and ATI graphics-processors (GPUs)Athlon 64

The first major commercial application of Cell was in Sony's PlayStation 3 game console The Cell architecture includes a memory coherence architecture that emphasizes power efficiency, prioritizes bandwidth over low latency, and favors peak computational throughput over simplicity of program code.

Applications Current and future online distribution systems like : 1.high-definition displays and recording equipment 2.HDTV systems 3.digital imaging systems (medical, scientific, etc.) and 4.physical simulation (e.g., scientific and structural engineering modeling).

IBM provides a Linux-based development platform to help developers program for Cell chips Useful for scientific computing

Architecture Cell processor can be split into four components: 1. External input and output structures 2.The main processor called the Power Processing Element (PPE) 3. Eight fully functional co-processors called the Synergistic Processing Elements, or SPEs, and 4.A specialized high-bandwidth circular data bus connecting the PPE, input/output elements and the SPEs, called the Element Interconnect Bus or EIB.

To make the best of EIB, and to overlap computation and data transfer, each of the nine processing elements (PPE and SPEs) is equipped with a DMA engine. Since the SPE's load/store instructions can only access its own local memory, each SPE entirely depends on DMAs to transfer data to and from the main memory and other SPEs' local memories. A DMA operation can transfer either a single block area of size up to 16KB, or a list of 2 to 2048 such blocks.

The PPE, which is capable of running a conventional operating system, has control over the SPEs and can start, stop, interrupt, and schedule processes running on the SPEs. Unlike SPEs, the PPE can read and write the main memory and the local memories of SPEs through the standard load/store instructions. The SPEs are not fully autonomous and require the PPE to prime them before they can do any useful work.

Both the PPE and SPE are RISC architectures with a fixed- width 32-bit instruction format. The PPE contains a 64-bit general purpose register set (GPR), a 64-bit floating point register set (FPR), and a 128-bit Altivec register set. The SPE contains 128-bit registers only. These can be used for scalar data types ranging from 8-bits to 64-bits in size or for SIMD computations on a variety of integer and floating point formats. System memory addresses for both the PPE and SPE are expressed as 64-bit values for a theoretic address range of 2 64 bytes (16 exabytes or 16,777,216 terabytes). In practice, not all of these bits are implemented in hardware.

Local store addresses internal to the SPU (Synergistic Processor Unit) processor are expressed as a 32-bit word. In documentation relating to Cell a word is always taken to mean 32 bits, a doubleword means 64 bits, and a quadword means 128 bits.

Nvidia Tesla Nvidia Tesla is Nvidia's brand name for their productsNvidia very high computational power Tesla products target the high performance computing market Tesla products primarily operate: in simulations and in large scale calculations (especially floating-point calculations) for high-end image generation for applications in professional and scientific fields with the use of OpenCL or CUDA. OpenCL

TPC: texture/processor cluster SM: streaming multiprocessor SP: streaming processor Tex: texture ROP: raster operation processor. SMC:SM controller

TESLA GPUS FOR WORKSTATIONS FeatureTesla K40Tesla K20 Peak double precision floating point performance 1.43 Tflops1.17 Tflops Peak single precision floating point performance 4.29 Tflops3.52 Tflops Memory bandwidth (ECC off) 288 GB/sec208 GB/sec Memory size (GDDR5)12 GB5 GB CUDA cores

Intel Larrabee Micro architecture The product is intended as a co-processor for high performance computing Does not function as a graphics processing unit Hybrid between a multi-core CPU and a GPU, and has similarities to both. Its coherent cache hierarchy and x86 architecture compatibility are CPU-like, while its wide SIMD vector units and texture sampling hardware are GPU-like.

Differences with CPUs Larrabee's x86 cores were based on the much simpler P54C Pentium design The P54C-derived core is superscalar but does not include out-of-order execution, though it has been updated with modern features such as x86-64 support Each Larrabee core contained a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time

Additional features like scatter/gather instructions and a mask register designed to make the vector unit easier and more efficient. Larrabee derives most of its number- crunching power from these vector units Fixed-function graphics hardware feature: texture sampling units. These perform trilinear and anisotropic filtering and texture decompression

1024-bit (512-bit each way) ring bus for communication between cores and to memory Included explicit cache control instructions to reduce cache thrashing during streaming operations which only read/write data once Explicit prefetching into L2 or L1 cache is also supported Enhanced multithreading – 4 threads per core

Architecture

“FULLY” programmable Legacy code easy to migrate and deploy

Nehalem (microarchitecture) Nehalem is the codename for an Intel processor microarchitecture, which is the successor to the older Core microarchitecture

Technology Microarchitecture of a processor core in the quad-core implementation Hyper-threading reintroduced. 4–12 MB L3 cache Second-level branch predictor and translation lookaside buffer Native (all processor cores on a single die) quad- and octa-core processors Intel QuickPath Interconnect in high-end models replacing the legacy front side bus 64 KB L1 cache per core (32 KB L1 data and 32 KB L1 instruction), and 256 KB L2 cache per core.

Integrated memory controller supporting two or three memory channels of DDR3 SDRAM or four FB-DIMM2 channels Second-generation Intel Virtualization Technology, which introduced Extended Page Table support, virtual processor identifiers (VPIDs), and non-maskable interrupt-window exiting 20 to 24 pipeline stages

References wikipedia.org