Lecture 17: Multi-threaded Applications

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
1 Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, intro to multi-thread programming models.
1 Lecture 19: Shared-Memory Multiprocessors Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections )
Lecture 18: Multiprocessors
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
1 Lecture 18: Large Caches, Multiprocessors Today: NUCA caches, multiprocessors (Sections ) Reminder: assignment 5 due Thursday (don’t procrastinate!)
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Lecture 16: Virtual Memory Today: DRAM innovations, virtual memory (Sections )
1 Lecture 24: Multiprocessors Today’s topics:  Directory-based cache coherence protocol  Synchronization  Consistency  Writing parallel programs Reminder:
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
1 Lecture 2: Intro and Snooping Protocols Topics: multi-core cache organizations, programming models, cache coherence (snooping-based)
1 Lecture 18: Shared-Memory Multiprocessors Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections )
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
1 Lecture 1: Introduction Course organization:  13 lectures on parallel architectures  ~5 lectures on cache coherence, consistency  ~3 lectures on TM.
1 Lecture 25: Multi-core Processors Today’s topics:  Writing parallel programs  SMT  Multi-core examples Reminder:  Assignment 9 due Tuesday.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1 Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols.
1 Lecture 2: Parallel Programs Topics: parallel applications, parallelization process, consistency models.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Outline Why this subject? What is High Performance Computing?
Lecture 3: Computer Architectures
Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.
1 Lecture 25: Multiprocessors Today’s topics:  Synchronization  Consistency  Shared memory vs message-passing  Simultaneous multi-threading (SMT)
1 Lecture: Memory Technology Innovations Topics: state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile cells, photonics Multiprocessor.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.
1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
These slides are based on the book:
CS5102 High Performance Computer Systems Thread-Level Parallelism
Distributed Processors
buses, crossing switch, multistage network.
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
CS 147 – Parallel Processing
Multiprocessors Oracle SPARC M core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB,
Lecture 15: DRAM Main Memory Systems
Multi-Processing in High Performance Computer Architecture:
Lecture: Memory, Multiprocessors
CMSC 611: Advanced Computer Architecture
Lecture 26: Multiprocessors
Lecture 26: Multiprocessors
Lecture 1: Parallel Architecture Intro
Lecture 2: Parallel Programs
Chapter 17 Parallel Processing
Outline Interconnection networks Processor arrays Multiprocessors
Lecture: Memory Technology Innovations
Multiprocessors - Flynn’s taxonomy (1966)
Lecture: Coherence Protocols
Lecture 27: Pot-Pourri Today’s topics:
CS 213: Parallel Processing Architectures
Introduction to Multiprocessors
Parallel Processing Architectures
Lecture 24: Memory, VM, Multiproc
buses, crossing switch, multistage network.
Lecture: Coherence Protocols
Distributed Systems CS
Lecture 27: Multiprocessors
Lecture 26: Multiprocessors
Chapter 4 Multiprocessors
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 3: Coherence Protocols
Lecture 19: Coherence Protocols
Lecture 18: Cache Coherence
Presentation transcript:

Lecture 17: Multi-threaded Applications Today: Memory wrap-up, multiprocessors, shared memory and message-passing HW6 will be posted this weekend Notes on memory systems will also be posted shortly

.. .. .. .. .. .. .. .. Modern Memory System PROC 4 DDR3 channels 64-bit data channels 800 MHz channels 1-2 DIMMs/channel 1-4 ranks/channel .. ..

.. .. .. .. Cutting-Edge Systems SMB PROC The link into the processor is narrow and high frequency The Scalable Memory Buffer chip is a “router” that connects to multiple DDR3 channels (wide and slow) Boosts processor pin bandwidth and memory capacity More expensive, high power

Future Memory Trends Processor pin count is not increasing High memory bandwidth requires high pin frequency High memory capacity requires narrow channels per “DIMM” 3D stacking can enable high memory capacity and high channel frequency (e.g., Micron HMC)

Future Memory Cells DRAM cell scaling is expected to slow down Emerging memory cells are expected to have better scaling properties and eventually higher density: phase change memory (PCM), spin torque transfer (STT-RAM), etc. PCM: heat and cool a material with elec pulses – the rate of heat/cool determines if the material is crystalline/amorphous; amorphous has higher resistance (i.e., no longer using capacitive charge to store a bit) Advantages: non-volatile, high density, faster than Flash/disk Disadvantages: poor write latency/energy, low endurance

Silicon Photonics Game-changing technology that uses light waves for communication; not mature yet and high cost likely No longer relies on pins; a few waveguides can emerge from a processor Each waveguide carries (say) 64 wavelengths of light (dense wave division multiplexing – DWDM) The signal on a wavelength can be modulated at high frequency – gives very high bandwidth per waveguide

Taxonomy SISD: single instruction and single data stream: uniprocessor MISD: no commercial multiprocessor: imagine data going through a pipeline of execution engines SIMD: vector architectures: lower flexibility MIMD: most multiprocessors today: easy to construct with off-the-shelf computers, most flexibility

Memory Organization - I Centralized shared-memory multiprocessor or Symmetric shared-memory multiprocessor (SMP) Multiple processors connected to a single centralized memory – since all processors see the same memory organization  uniform memory access (UMA) Shared-memory because all processors can access the entire memory address space Can centralized memory emerge as a bandwidth bottleneck? – not if you have large caches and employ fewer than a dozen processors

SMPs or Centralized Shared-Memory Processor Processor Processor Processor Caches Caches Caches Caches Main Memory I/O System

Memory Organization - II For higher scalability, memory is distributed among processors  distributed memory multiprocessors If one processor can directly address the memory local to another processor, the address space is shared  distributed shared-memory (DSM) multiprocessor If memories are strictly local, we need messages to communicate data  cluster of computers or multicomputers Non-uniform memory architecture (NUMA) since local memory has lower latency than remote memory

Interconnection network Distributed Memory Multiprocessors Processor & Caches Processor & Caches Processor & Caches Processor & Caches Memory I/O Memory I/O Memory I/O Memory I/O Interconnection network

Shared-Memory Vs. Message-Passing Well-understood programming model Communication is implicit and hardware handles protection Hardware-controlled caching Message-passing: No cache coherence  simpler hardware Explicit communication  easier for the programmer to restructure code Sender can initiate data transfer

Ocean Kernel Procedure Solve(A) begin diff = done = 0; while (!done) do diff = 0; for i  1 to n do for j  1 to n do temp = A[i,j]; A[i,j]  0.2 * (A[i,j] + neighbors); diff += abs(A[i,j] – temp); end for if (diff < TOL) then done = 1; end while end procedure

Shared Address Space Model procedure Solve(A) int i, j, pid, done=0; float temp, mydiff=0; int mymin = 1 + (pid * n/procs); int mymax = mymin + n/nprocs -1; while (!done) do mydiff = diff = 0; BARRIER(bar1,nprocs); for i  mymin to mymax for j  1 to n do … endfor LOCK(diff_lock); diff += mydiff; UNLOCK(diff_lock); BARRIER (bar1, nprocs); if (diff < TOL) then done = 1; endwhile int n, nprocs; float **A, diff; LOCKDEC(diff_lock); BARDEC(bar1); main() begin read(n); read(nprocs); A  G_MALLOC(); initialize (A); CREATE (nprocs,Solve,A); WAIT_FOR_END (nprocs); end main

Message Passing Model main() for i  1 to nn do read(n); read(nprocs); CREATE (nprocs-1, Solve); Solve(); WAIT_FOR_END (nprocs-1); procedure Solve() int i, j, pid, nn = n/nprocs, done=0; float temp, tempdiff, mydiff = 0; myA  malloc(…) initialize(myA); while (!done) do mydiff = 0; if (pid != 0) SEND(&myA[1,0], n, pid-1, ROW); if (pid != nprocs-1) SEND(&myA[nn,0], n, pid+1, ROW); RECEIVE(&myA[0,0], n, pid-1, ROW); RECEIVE(&myA[nn+1,0], n, pid+1, ROW); for i  1 to nn do for j  1 to n do … endfor if (pid != 0) SEND(mydiff, 1, 0, DIFF); RECEIVE(done, 1, 0, DONE); else for i  1 to nprocs-1 do RECEIVE(tempdiff, 1, *, DIFF); mydiff += tempdiff; if (mydiff < TOL) done = 1; for i  1 to nprocs-1 do SEND(done, 1, I, DONE); endif endwhile

Title Bullet