Lecture 6: Multicore Systems

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Structure of Computer Systems

Parallel computer architecture classification

Thoughts on Shared Caches Jeff Odom University of Maryland.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Instruction Level Parallelism (ILP) Colin Stevens.

Chapter Hardwired vs Microprogrammed Control Multithreading

Chapter 17 Parallel Processing.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.

Chapter 18 Multicore Computers

Computer System Architectures Computer System Software

Lecture 2 : Introduction to Multicore Computing

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Multi-core architectures. Single-core computer Single-core CPU chip.

Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.

Multi-Core Architectures

1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:

Operating Systems Lecture 02: Computer System Overview Anda Iamnitchi

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

Outline Why this subject? What is High Performance Computing?

EKT303/4 Superscalar vs Super-pipelined.

Succeeding with Technology Chapter 2 Hardware Designed to Meet the Need The Digital Revolution Integrated Circuits and Processing Storage Input, Output,

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Lecture 3: Computer Architectures

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.

Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

William Stallings Computer Organization and Architecture 8th Edition

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

COMP 740: Computer Architecture and Implementation

Distributed Processors

Parallel Processing - introduction

Multi-core processors

CS 147 – Parallel Processing

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Hyperthreading Technology

Levels of Parallelism within a Single Processor

Symmetric Multiprocessing (SMP)

/ Computer Architecture and Design

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

William Stallings Computer Organization and Architecture 8th Edition

Levels of Parallelism within a Single Processor

Chapter 4 Multiprocessors

William Stallings Computer Organization and Architecture 8th Edition

Presentation transcript:

Lecture 6: Multicore Systems

Multicore Computers (chip multiprocessors) Combine two or more processors (cores) on a single piece of silicon Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches Multithreading is used

Pollack’s Rule Performance increase is roughly proportional to the square root of the increase in complexity performance  √complexity Power consumption increase is roughly linearly proportional to the increase in complexity power consumption  complexity

Pollack’s Rule complexity power performance 1 1 1 4 4 2 25 25 5 1 1 1 4 4 2 25 25 5 100s of low complexity cores, each operating at very low power Ex: Four small cores 4x1 4x1 4

Increasing CPU Performance Manycore Chip Composed of hybrid cores Some general purpose Some graphics Some floating point

Exascale Systems Millions of cores Exascale systems (1018 Flop/s) Board composed of multiple manycore chips sharing memory Rack composed of multiple boards A room full of these racks Millions of cores Exascale systems (1018 Flop/s)

Moore’s Law Reinterpreted Number of cores per chip doubles every 2 years Number of threads of execution doubles every 2 years

Shared Memory MIMD Shared memory Single address space All processes have access to the pool of shared memory Memory Bus P

Shared Memory MIMD Each processor executes different instructions asynchronously, using different data data CU PE data CU PE Memory data CU PE data CU PE instruction

Symmetric Multiprocessors (SMP) MIMD Shared memory UMA Proc Proc L1 L1 … L2 L2 System bus Satllings 9th 637 I/O Main Memory I/O I/O

Symmetric Multiprocessors (SMP) Characteristics: Two or more similar processors Processors share the same memory and I/O facilities Processors are connected by bus or other internal connection scheme, such that memory access time is the same for each processor All processors share access to I/O devices All processors can perform the same functions The system is controlled by an integrated operating system that provides interaction between processors and their programs

Symmetric Multiprocessors (SMP) Operating system: Provides tools and functions to exploit the parallelism Schedules processes or threads across all of the processors Takes care of scheduling of threads and processes on processors synchronization among processors

Multicore Computers Dedicated L1 Cache (ARM11 MPCore) CPU core 1 CPU core n L1-I L1-D L1-I L1-D … Stallings 9th 697 L2 I/O Main Memory I/O I/O

Multicore Computers Dedicated L2 Cache (AMD Opteron) CPU core 1 CPU core n L1-I L1-D L1-I L1-D … L2 L2 I/O Main Memory I/O I/O

Multicore Computers Shared L2 Cache (Intel Core Duo) CPU core 1 CPU core n L1-I L1-D L1-I L1-D … L2 I/O Main Memory I/O I/O

Multicore Computers Shared L3 Cache (Intel Core i7) CPU core 1 CPU core n L1-I L1-D L1-I L1-D … L2 L2 L3 I/O Main Memory I/O I/O

Advantages of Shared L2 cache Advantages of Dedicated L2 cache Multicore Computers Advantages of Shared L2 cache Reduced overall miss rate Thread on one core may cause a frame to be brought into the cache, thread on another core may access the same location that has already been brought into the cache Data shared by multiple cores is not replicated The amount of shared cache allocated to each core may be dynamic Interprocessor communication is easy to implement Advantages of Dedicated L2 cache Each core can access its private cache more rapidly L3 cache When the amount of memory and number of cores grow, L3 cache provides better performance Stallings 9th 649

Multicore Computers On-chip interconnects Bus Crossbar Off-chip communication (CPU-to-CPU or I/O): Bus-based

Multicore Computers (chip multiprocessors) Combine two or more processors (cores) on a single piece of silicon Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches Multithreading is used

Multicore Computers Multithreading A multithreaded processor provides a separate PC for each thread (hardware multithreading) Implicit multithreading Concurrent execution of multiple threads extracted from a single sequential program Explicit multithreading Execute instructions from different explicit threads by interleaving instructions from different threads on shared or parallel pipelines Stallings 9th 649

Explicit Multithreading Multicore Computers Explicit Multithreading Fine-grained multithreading (Interleaved multithreading) Processor deals with two or more thread contexts at a time Switching from one thread to another at each clock cycle Coarse-grained multithreading (Blocked multithreading) Instructions of a thread are executed sequentially until an event that causes a delay (eg. cache miss) occurs This event causes a switch to another thread Simultaneous multithreading (SMT) Instructions are simultaneously issued from multiple threads to the execution units of a superscalar processor Thread-level parallelism is combined with instruction-level parallelism (ILP) Chip multiprocessing (CMP) Each processor of a multicore system handles separate threads

Coarse-grained, Fine-grained, Symmetric Multithreading, CMP Patterson, Hennessy, Hw/Sw interf, 4th 647

GPUs (Graphics Processing Units) Characteristics of GPUs GPUs are accelerators for CPUs SIMD GPUs have many parallel processors and many concurrent threads (i.e. 10 or more cores; 100s or 1000s of threads per core) CPU-GPU combination is an example for heterogeneous computing GPGPU (general purpose GPU): using a GPU to perform applications traditionally handled by the CPU

GPUs

GPUs Core Complexity Out-of-order execution Dynamic branch prediction Larger pipelines for higher clock rates  More circuitry  High performance

GPUs Complex cores are preferable: Highly instruction parallel numeric applications Floating-point applications Large number of simple cores are preferable: Application’s serial part is small

Cache Performance Intel Core i7

Roofline Performance Model Arithmetic intensity is the ratio of floating-point operations in a program to the number of data bytes accessed by the program from main memory floating-point operations Arithmetic intensity = --------------------------------------- = FLOPs/Byte number of data bytes

Roofline Performance Model Attainable GFLOPs/second Peak memory bandwidth x Arithmetic intensity = min Peak floating-point performance

Roofline Performance Model Peak floating-point performance is given by the hardware specifications of the computer (FLOPs/second) For multicore chips, peak performance is the collective performance of all the cores on the chip. So, multiply the peak per chip by the number of chips Peak memory performance is also given by the hardware specifications of the computer (Mbytes/second) Maximum floating-point performance that the memory system of the computer can support for a given arithmetic intensity, can be plotted as Peak memory bandwidth x Arithmetic intensity (bytes/second) x (FLOPs/bytes) ==> FLOPs/second

Roofline Performance Model Roofline sets an upper bound on performance Roofline of a computer does not vary by benchmark kernel

Stream Benchmark A synthetic benchmark Measures the performance of long vector operations They have no temporal locality and they access arrays that are larger than the cache size http://www.cs.virginia.edu/stream/ref.html define N 2000000 ... void tuned_STREAM_Copy() { void tuned_STREAM_Add() { int j; #pragma omp parallel for for (j=0; j<N; j++) c[j] = a[j]; c[j] = a[j]+b[j]; } void tuned_STREAM_Scale(double scalar) { void tuned_STREAM_Triad(double scalar) { b[j] = scalar*c[j]; a[j] = b[j]+scalar*c[j];