CS5102 High Performance Computer Systems Thread-Level Parallelism

Slides:



Advertisements
Similar presentations
The University of Adelaide, School of Computer Science
Advertisements

Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer System Architectures Computer System Software
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Lecture 13: Multiprocessors Kai Bu
Memory/Storage Architecture Lab Computer Architecture Multiprocessors.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Embedded Computer Architecture 5SAI0 Multi-Processor Systems
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
The University of Adelaide, School of Computer Science
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015.
The University of Adelaide, School of Computer Science
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
18-447: Computer Architecture Lecture 30B: Multiprocessors
Lecture 5 Approaches to Concurrency: The Multiprocessor
Computer Architecture: Parallel Processing Basics
Distributed Processors
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 147 – Parallel Processing
EE 193: Parallel Computing
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Distributed Systems CS
The University of Adelaide, School of Computer Science
CS5102 High Performance Computer Systems Distributed Shared Memory
Lecture 1: Parallel Architecture Intro
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Chapter 17: Database System Architectures
Embedded Computer Architecture 5SAI0 Multi-Processor Systems
Multiprocessors - Flynn’s taxonomy (1966)
CS 213: Parallel Processing Architectures
Introduction to Multiprocessors
Parallel Processing Architectures
Lecture 24: Memory, VM, Multiproc
Distributed Systems CS
Distributed Systems CS
The University of Adelaide, School of Computer Science
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Database System Architectures
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

CS5102 High Performance Computer Systems Thread-Level Parallelism Prof. Chung-Ta King Department of Computer Science National Tsing Hua University, Taiwan (Slides are from textbook, Prof. O. Mutlu)

Outline Introduction (Sec. 5.1) Centralized shared-memory architectures (Sec. 5.2) Distributed shared-memory and directory-based coherence (Sec. 5.4) Synchronization: the basics (Sec. 5.5) Models of memory consistency (Sec. 5.6)

The University of Adelaide, School of Computer Science 10 September 2018 Why Multiprocessors? Improve performance (execution time or task throughput) Reduce power consumption: 4N cores at frequency F/4 consume less power than N cores at frequency F Leverage replication, reduce complexity, improve scalability Improve dependability: redundant execution in space Chapter 2 — Instructions: Language of the Computer

Types of Multiprocessors Loosely coupled multiprocessors No shared global memory address space (processes) Multicomputer network Network-based multiprocessors Usually programmed via message passing Explicit calls (send, receive) for communication Tightly coupled multiprocessors Shared global memory address space Programming model similar to uniprocessors (i.e., multitasking uniprocessor) Threads cooperate via shared variables (memory), while operations on shared data require synchronization

Loosely-Coupled vs Tightly-Coupled Summation of elements of an array for (i = 0; i < 10000; i++) sum = sum + A[i]; For tightly-coupled multiprocessors: 10 nodes for (i = 1000*pid; i < 1000*(pid+1); i++) sum = sum + A[i]; // critical section! For loosely-coupled multiprocessors for (i=0; i<1000; i++) sum = sum + A[i]; if (pid != 0) send(0,sum); else for(i=1; i<9; i++) { receive(i,partial_sum); sum = sum + partial_sum;}

Loosely Coupled Multiprocessors Each node has private memory Cannot directly access memory on another node Use explicit send/recv to exchange data Data allocation is important IBM SP-2, cluster of workstations MPI programming Node 0 Node 1 send NI Mem P $ NI Mem P $ N-1 N-1 Interconnection network NI Mem P $ NI Mem P $ N-1 N-1 Node 2 Node 3

Message Passing Programming Model User-level send/receive abstraction Local buffer (x, y), process(or) (P, Q) and tag (t) Explicit communication, synchronization Local Process Address Space address x address y match Process(or) P Process(or) Q Send x, Q, t Recv y, P, t Process P specifies to send x to Q with a tag t; Process Q specifies to receive a data from P with a tag t and put it into y.

Thread-Level Parallelism The University of Adelaide, School of Computer Science 10 September 2018 Thread-Level Parallelism Thread-level parallelism Have multiple program counters, share address space Use MIMD model Amount of computation assigned to each thread (grain size) must be sufficiently large, as compared to array or vector processors We will focus on tightly-coupled multiprocessors Computers consisting of tightly-coupled processors whose coordination and usage are controlled by a single OS and that share memory through a shared address space If the multiprocessor is implemented on a single chip, then we have a multicore Chapter 2 — Instructions: Language of the Computer

Tightly Coupled Multiprocessors Multiple threads/processors use shared memory (address space) Communication is implicit via loads and stores Opposite of explicit message-passing multiprocessors Theoretical foundation: PRAM model P1 P2 P3 P4 Familiar for system (OS and DB) as well as application programmers Why? The same techniques that use concurrency to overlap disk and network latencies also work for parallel speedup on MPs. The exact same software can take advantage of MPs with only small changes Memory System

Why Shared Memory? Pros: Cons: Application sees multitasking uniprocessor Familiar programming model, similar to multitasking uniprocessor and no need to manage data allocation OS needs only evolutionary extensions Communication happens without OS Cons: Synchronization is complex Communication is implicit and indirect (hard to optimize) Hard to implement (in hardware)

Tightly Coupled Multiprocessors: 2 Types The University of Adelaide, School of Computer Science 10 September 2018 Tightly Coupled Multiprocessors: 2 Types Symmetric multiprocessors (SMP) Small number of cores Share single memory with uniform memory access/latency (UMA) Fig. 5.1 Chapter 2 — Instructions: Language of the Computer

UMA: Uniform Memory/Cache Access All cores have same uncontended latency to memory Latencies get worse as system grows + Data placement unimportant/less important (easier to optimize code and make use of available memory space) - Contention could restrict bandwidth and increase latency

Tightly Coupled Multiprocessors: 2 Types The University of Adelaide, School of Computer Science 10 September 2018 Tightly Coupled Multiprocessors: 2 Types Distributed shared memory (DSM) Memory distributed among processors  more # of cores Non-uniform memory access/latency (NUMA) Fig. 5.2 Chapter 2 — Instructions: Language of the Computer

Alternative View of DSM All local memories are addressed by a global addressing space A node can directly access memory on other nodes, using normal ld/st Nodes connected via direct (switched) or multi-hop interconnection networks Node 0 Node 1 ld NI Mem P $ NI Mem P $ N-1 N 2N-1 Interconnection network NI Mem P $ NI Mem P $ 2N 3N-1 3N 4N-1 Node 2 Node 3

NUMA: NonUniform Memory/Cache Access Shared memory as local versus remote memory + Low latency, high bandwidth to local memory - Much higher latency to remote memories - Performance very sensitive to data placement

Caveats of Parallelism Amdahl’s Law f: Parallelizable fraction of a program N: Number of processors Maximum speedup limited by serial portion: serial bottleneck Parallel portion is usually not perfectly parallel Synchronization overhead (e.g., updates to shared data) Load imbalance overhead (imperfect parallelization) Resource sharing overhead (contention among N cores) 1 Speedup = + f 1 - f N

Bottlenecks in Parallel Portion Synchronization: operations manipulating shared data cannot be parallelized Locks, mutual exclusion, barrier synchronization Communication: tasks may need values from each other Causes thread serialization when shared data is contended Load imbalance: parallel tasks have different lengths Imperfect parallelization or microarchitectural effects Reduces speedup in parallel portion Resource contention: parallel tasks can share hardware resources, delaying each other Replicating all resources (e.g., memory) expensive Additional latency not present when each task runs alone

Issues in Tightly Coupled Multiprocessors Exploiting parallelism in applications Long latency of remote access Shared memory synchronization Locks, atomic operations Cache coherence and memory consistency Ordering of memory operations What should programmer expect hardware to provide? Resource sharing, contention, partitioning Communication: interconnection networks Load imbalance