Presentation is loading. Please wait.

Presentation is loading. Please wait.

High performance computing architecture examples Unit 2.

Similar presentations


Presentation on theme: "High performance computing architecture examples Unit 2."— Presentation transcript:

1 High performance computing architecture examples Unit 2

2 Classification of parallel programming models 1.1 Process interaction – 1.1.1 Shared memory – 1.1.2 Message passing – 1.1.3 Implicit interaction 1.2 Problem decomposition – 1.2.1 Task parallelism – 1.2.2 Data parallelism – 1.2.3 Implicit parallelism

3 IBM CELL Broadband Engine

4 Cell (microprocessor) Cell is a multi-core (asymmetric) microprocessor microarchitecture that combines a general- purpose Power Architecture core of modest performance with streamlined coprocessing elements which greatly accelerate multimedia and vector processing applications It was developed by Sony, Toshiba, and IBM, an alliance known as "STI“ Power :Performance Optimization With Enhanced RISC. Power Architecture Core: An old microprocessor instruction set architecture designed by IBM.

5 The Cell Broadband Engine, or Cell as it is more commonly known, is a microprocessor intended as a hybrid of conventional desktop processors (such as the Athlon 64, and Core 2 families) and more specialized high- performance processors, such as the NVIDIA and ATI graphics-processors (GPUs)Athlon 64

6

7 The first major commercial application of Cell was in Sony's PlayStation 3 game console The Cell architecture includes a memory coherence architecture that emphasizes power efficiency, prioritizes bandwidth over low latency, and favors peak computational throughput over simplicity of program code.

8 Applications Current and future online distribution systems like : 1.high-definition displays and recording equipment 2.HDTV systems 3.digital imaging systems (medical, scientific, etc.) and 4.physical simulation (e.g., scientific and structural engineering modeling).

9 IBM provides a Linux-based development platform to help developers program for Cell chips Useful for scientific computing

10 Architecture Cell processor can be split into four components: 1. External input and output structures 2.The main processor called the Power Processing Element (PPE) 3. Eight fully functional co-processors called the Synergistic Processing Elements, or SPEs, and 4.A specialized high-bandwidth circular data bus connecting the PPE, input/output elements and the SPEs, called the Element Interconnect Bus or EIB.

11

12 To make the best of EIB, and to overlap computation and data transfer, each of the nine processing elements (PPE and SPEs) is equipped with a DMA engine. Since the SPE's load/store instructions can only access its own local memory, each SPE entirely depends on DMAs to transfer data to and from the main memory and other SPEs' local memories. A DMA operation can transfer either a single block area of size up to 16KB, or a list of 2 to 2048 such blocks.

13 The PPE, which is capable of running a conventional operating system, has control over the SPEs and can start, stop, interrupt, and schedule processes running on the SPEs. Unlike SPEs, the PPE can read and write the main memory and the local memories of SPEs through the standard load/store instructions. The SPEs are not fully autonomous and require the PPE to prime them before they can do any useful work.

14 Both the PPE and SPE are RISC architectures with a fixed- width 32-bit instruction format. The PPE contains a 64-bit general purpose register set (GPR), a 64-bit floating point register set (FPR), and a 128-bit Altivec register set. The SPE contains 128-bit registers only. These can be used for scalar data types ranging from 8-bits to 64-bits in size or for SIMD computations on a variety of integer and floating point formats. System memory addresses for both the PPE and SPE are expressed as 64-bit values for a theoretic address range of 2 64 bytes (16 exabytes or 16,777,216 terabytes). In practice, not all of these bits are implemented in hardware.

15 Local store addresses internal to the SPU (Synergistic Processor Unit) processor are expressed as a 32-bit word. In documentation relating to Cell a word is always taken to mean 32 bits, a doubleword means 64 bits, and a quadword means 128 bits.

16 Nvidia Tesla Nvidia Tesla is Nvidia's brand name for their productsNvidia very high computational power Tesla products target the high performance computing market Tesla products primarily operate: in simulations and in large scale calculations (especially floating-point calculations) for high-end image generation for applications in professional and scientific fields with the use of OpenCL or CUDA. OpenCL

17

18

19 TPC: texture/processor cluster SM: streaming multiprocessor SP: streaming processor Tex: texture ROP: raster operation processor. SMC:SM controller

20 TESLA GPUS FOR WORKSTATIONS FeatureTesla K40Tesla K20 Peak double precision floating point performance 1.43 Tflops1.17 Tflops Peak single precision floating point performance 4.29 Tflops3.52 Tflops Memory bandwidth (ECC off) 288 GB/sec208 GB/sec Memory size (GDDR5)12 GB5 GB CUDA cores28802496

21 Intel Larrabee Micro architecture The product is intended as a co-processor for high performance computing Does not function as a graphics processing unit Hybrid between a multi-core CPU and a GPU, and has similarities to both. Its coherent cache hierarchy and x86 architecture compatibility are CPU-like, while its wide SIMD vector units and texture sampling hardware are GPU-like.

22 Differences with CPUs Larrabee's x86 cores were based on the much simpler P54C Pentium design The P54C-derived core is superscalar but does not include out-of-order execution, though it has been updated with modern features such as x86-64 support Each Larrabee core contained a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time

23 Additional features like scatter/gather instructions and a mask register designed to make the vector unit easier and more efficient. Larrabee derives most of its number- crunching power from these vector units Fixed-function graphics hardware feature: texture sampling units. These perform trilinear and anisotropic filtering and texture decompression

24 1024-bit (512-bit each way) ring bus for communication between cores and to memory Included explicit cache control instructions to reduce cache thrashing during streaming operations which only read/write data once Explicit prefetching into L2 or L1 cache is also supported Enhanced multithreading – 4 threads per core

25 Architecture

26 “FULLY” programmable Legacy code easy to migrate and deploy

27 Nehalem (microarchitecture) Nehalem is the codename for an Intel processor microarchitecture, which is the successor to the older Core microarchitecture

28 Technology Microarchitecture of a processor core in the quad-core implementation Hyper-threading reintroduced. 4–12 MB L3 cache Second-level branch predictor and translation lookaside buffer Native (all processor cores on a single die) quad- and octa-core processors Intel QuickPath Interconnect in high-end models replacing the legacy front side bus 64 KB L1 cache per core (32 KB L1 data and 32 KB L1 instruction), and 256 KB L2 cache per core.

29 Integrated memory controller supporting two or three memory channels of DDR3 SDRAM or four FB-DIMM2 channels Second-generation Intel Virtualization Technology, which introduced Extended Page Table support, virtual processor identifiers (VPIDs), and non-maskable interrupt-window exiting 20 to 24 pipeline stages

30

31 References wikipedia.org


Download ppt "High performance computing architecture examples Unit 2."

Similar presentations


Ads by Google