SIMD Single instruction on multiple data – This form of parallel processing has existed since the 1960s – The idea is rather than executing array operations.

Slides:



Advertisements
Similar presentations
1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.
Advertisements

Tiling Examples for X86 ISA Slides Selected from Radu Ruginas CS412/413 Lecture on Instruction Selection at Cornell.
Part IV: Memory Management
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel ® Software Development.
Dependence Precedence. Precedence & Dependence Can we execute a 1000 line program with 1000 processors in one step? What are the issues to deal with in.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Streaming SIMD Extension (SSE)
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
Programmability Issues
Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 ILP (Recap). 2 Basic Block (BB) ILP is quite small –BB: a straight-line code sequence with no branches in except to the entry and no branches out except.
1 RISC Machines Because of their load-store ISAs, RISC architectures require a large number of CPU registers. These register provide fast access to data.
Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,
Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Bernstein’s Conditions. Techniques to Exploit Parallelism in Sequential Programming Hierarchy of levels of parallelism: Procedure or Methods Statements.
EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Compiler Challenges for High Performance Architectures
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
Parallel and Cluster Computing 1. 2 Optimising Compilers u The main specific optimization is loop vectorization u The compilers –Try to recognize such.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Compilation Techniques for Multimedia Processors Andreas Krall and Sylvain Lelait Technische Universitat Wien.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Instruction Level Parallelism (ILP) Colin Stevens.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
CS 112 Intro to Computer Science II Sami Rollins Spring 2007.
© The McGraw-Hill Companies, 2006 Chapter 5 Arrays.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Parallelizing Compilers Presented by Yiwei Zhang.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp
NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
1 ISA&D7‏/8‏/ ISA&D7‏/8‏/2013 Systems Development Life Cycle Phases and Activities in the SDLC Variations of the SDLC models.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
13 B Lines in 2D and 3D. The vector AB and the vector equation of the line AB are very different things. x x x x The line AB is a line passing through.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
Dave Murray: Developing Fast DSP Libraries for Advanced Processors DSP libraries need to be efficient Efficiency is expensive to achieve Liberator is our.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
ArrayList Class An ArrayList is an object that contains a sequence of elements that are ordered by position. An ArrayList is an object that contains a.
Matlab Programming for Engineers
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.
1 Lecture 5a: CPU architecture 101 boris.
COMP4211 : Advance Computer Architecture
Appendix G Authors: John Hennessy & David Patterson.
Array Processor.
Coe818 Advanced Computer Architecture
Multivector and SIMD Computers
How to improve (decrease) CPI
Presentation transcript:

SIMD Single instruction on multiple data – This form of parallel processing has existed since the 1960s – The idea is rather than executing array operations by loop, we execute all of the array operations in parallel on different processing elements (ALUs) we convert for(i=0;i<n;i++) a[i]++; into a single operation, say A=A+1 – Not only do we get a speedup from the parallelism, we also get to remove the looping operation (incrementing i, the comparison and conditional branch) There have been 3 approaches to SIMD – Vector architectures (including matrix architectures) – Multimedia SIMD instruction set extensions – Graphics processor units here we concentrate on the first and second

Two Views If we have n processing elements – we view the CPU as having a control unit and n ALUs (processing elements, or PEs in the figure) – each PE handles 1 datum from the array where data is cached in a PE’s local cache Otherwise, we use pipelined functional units – rather than executing the instruction on n data simultaneously, in each cycle we start the next array operation in the functional unit pipeline

The Pipelined Approach Although the simultaneous execution provides the more efficient execution, the pipelined approach is preferred in modern architectures for several reasons – It is a lot cheaper than having n PEs – We already have pipelined functional units so the vector processing does not require a significant change to our ALU – The simultaneous execution is limited to n parallel operations per cycle because of the limitation in PEs and so we still may need to execute the looping mechanism e.g., a loop of 100 array elements on an architecture of 8 PEs still needs the loop to iterate 13 times – There is no need to support numerous individual caches with parallel access although we will use multi-banked caches – Requires less power utilization which is significant today

VMIPS We alter MIPS to now support vector operations – The idea is that we will combine array elements into storage so that we can fetch several array elements in one cycle from memory (cache) and store them in large (wide) registers we will use vector registers where one register stores multiple array elements, a portion of the entire array – This requires widening the bus and also costs us in terms of greater memory access times because we are retrieving numerous words at a time in VMIPS, a register can store 64 elements of 64-bit items and there are 8 such registers – additionally, there are scalar registers (32 integer and 32 FP) the registers all connect via ports to all of the functional units as well as the load/store unit, there are numerous ports to support parallel data movement (see slide 2 or figure 4.2 page 265)

VMIPS Instruction Set Aside from the ordinary MIPS instructions (scalar operations), we enhance MIPS with the following: – LV, SV – load vector, store vector LV V1, R1 – load vector register V1 with the data starting at the memory location stored in R1 also LVI/SVI for using indexed addressing mode, and LVWS and SVWS for using scaled addressing mode – ADDVV.D V1, V2, V3 (V1  V2 + V3) – ADDVS.D V1, V2, F0 (scalar addition) similarly for SUB, MUL and DIV – S--VV.D V1, V2 and S--VS.D V1, F0 to compare pairwise elements in V1 and V2 or V1 and F0 -- is one of EQ, NE, GT, LT, GE, LE result of comparison is a set of boolean values placed into the bit vector register VM which we can then use to implement if statements – POP R1, VM – count number of 1s in the VM and store in R1 this is only a partial list of instructions, and only the FP operations, see figure 4.3 for more detail, missing are any integer based operations

Example Let’s look at a typical vector processing problem, computing Y = a*X + Y – Where X & Y are vectors and a is a scalar (e.g., y[i]=y[i]+a*x[i]) The MIPS code is on the left and the VMIPS code is on the right L.DF0, a DADDIR4, Rx, #512 Loop:L.DF2, 0(Rx) MUL.DF2, F2, F0 L.DF4, 0(Ry) ADD.DF4, F4, F2 S.DF4, 0(Ry) DADDIRx, Rx, #8 DADDIRy, Ry, #8 DSUBR20, R4, Rx BNEZR20, Loop L.DF0, a LVV1, Rx MULVS.DV2, V1, F0 LVV3, Ry ADDVV.DV4, V2, V3 SVV4, Ry In MIPS, we execute almost 600 instructions whereas in VMIPS, only 6 (there are 64 elements in the array to process, each is 8 bytes long) and there are no RAW hazards or control hazards to deal with

Vector Execution Time Although we typically compute execution time in seconds (ns) or clock cycles, for vector operations, architects are more interested in the number of distinct issues required to execute some chunk of code – This requires some explanation – The vector processor’s performance is impacted by the length of the vector (the number of array values stored in a single vector), any structural hazards (caused by limitations to the number and type of functional units) and data dependencies between vectors we will ignore the last one, at least for now – The vector processor’s performance then is primarily based on the length of the vector for instance, in VMIPS, our vector length is 64 doubles, but if our vector stores 128 doubles, then we have to do our vector operation twice

Convoys and Chimes A convoy is a set of sequential vector operations that can be issued together without a structural hazard – Because we are operating on vectors in a pipeline, the execution of these operations can be overlapped e.g., L.V V1, Rx followed by ADDVV.D V3, V1, V2 would allow us to retrieve the first element of V1 and then start the addition while retrieving the second element of V1 A chime is the amount of time it takes to execute a convoy – We will assume that there are no stalls in executing the convoy, so the chime will take n + x – 1 cycles where x is the length of the convoy and n is the number of data in the vector – A program of m convoys will take m chimes, or m * (n + x – 1) cycles (again, assuming no stalls) – The chime time ignores pipeline overhead, and so architects prefer to discuss performance in chimes

Convoy Example Assume we have 1 functional unit for each operation (load/store, add, multiply, divide) – We have the following VMIPS code executing on a vector of 64 doubles LVV1, Rx MULVS.DV2, V1, F0 LVV3, Ry ADDVV.DV4, V2, V3 SVV4, Ry The first LV and MULVS.D can be paired in a convoy, but not the next LV because there is only 1 load unit – Similarly, the second LV and ADDVV.D are paired but not the final SV – This gives us 3 convoys: LVMULVS.D LVADDVV.D SV

Multiple Lanes The original idea behind SIMD was to have n PEs so that n vector elements could be executed at the same time – We can combine the pipeline and the n PEs, in which case, the parallel functional units are referred to as lanes Without lanes, we launch 1 FP operation per cycle in our pipelined functional unit – With lanes, we launch n FP operations per cycle, one per lane, where elements are placed in a lane based on their index for instance, if we have 4 lanes, lane 0 gets all elements with index i % 4 == 0 whereas lane 1 gets all elements with i % 4 == 1 To support lanes, we need lengthy vectors – If our vector is 64 doubles and we have 16 lanes on a 7 cycle multiply functional unit, then we are issuing 16 instructions per cycle over 4 cycles and before we finish the first multiplies, we are out of data, so we don’t get full advantage of the pipelined nature of the functional units! – We also need multi-banked caches to permit multiple loads/stores per cycle to keep up with the lanes

Handling Vectors > 64 Elements The obvious question with respect to SIMD is what happens if our vector’s length > the size of our vector register (which we will call maximum vector length or MVL) – If this is the case, then we have to issue the vector code multiple times, in a loop by resorting to using a loop, we lose some of the advantage – no branch penalties or loop mechanisms – On the other hand, we cannot provide an infinitely long (or ridiculously long) vector register in hopes to satisfy all array usage – Strip mining is the process of generating code to handle such a loop in which the number of loop iterations is n / MVL where n is the size of the program’s vector Note: if n / MVL leaves a remainder then our last iteration will take place on only a partial vector see the discussion on pages for more detail

Handling If Statements As with loop unrolling, if our vector code employs if statements, we can find this a challenge to deal with – Consider for instance for(i=0;i<n;i++) if(x[i] != 0) x[i]=x[i] – y[i]; – We cannot launch the subtraction down the FP adder functional unit until we know the result of the condition In order to handle such a problem, vector processors use a vector mask register – The condition is applied in a pipelined unit creating a list of 1s and 0s, one per vector element – This is stored in the VM (vector mask) register – Vector mask operations are available so that the functional unit only executes on vector elements where the corresponding mask bit is 1 LV V1, Rx LV V2, Ry L.D F0, #0 SNEVS.D V1, F0 SUBVV.D V1, V2, V2 SV V1, Rx Notice SUBVV.D is a normal subtract instruction -- we need to modify it to execute using the vector mask

Memory Bank Support We will use non-blocking caches with critical word first/early restart – However, that does not necessarily guarantee 1 vector element per cycle to keep up with the pipelined functional unit because we may not have enough banks to accommodate the MVL the Cray T90 has 32 processors, each capable of generating up to 4 loads and 2 stores per clock cycle the processor’s clock cycle is ns and cache has a response time of 15 ns to support the full performance of the processor, we need 15 / * 32 * (4 + 2) = 1344 individual accesses per cycle, thus 1344 banks! It actually has 1024 banks (altering the SRAM to permit pipelined accesses makes up for this)

Strides Notice that the vector registers store consecutive memory locations (e.g., a[i], a[i+1], a[i+2], …) – In some cases, code does not visit array locations in sequential order, this is especially problematic in 2-D array code a[i][j]=a[i][j] + b[i][k] * d[k][j] A stride is the distance separating elements in a given operation – The optimal stride is 1 but for the above code, we would either have difficulty when accessing b[i][k] or d[k][j] depending on loop ordering resulting in a stride of as large as 100 – The larger the stride, the less effective the vector operations may be because multiple vector register loads will be needed cycle- after-cycle blocking (refer back to one of the compiler optimizations for cache) can be used to reduce the impact – To support reducing such an impact, we use cache banks and also a vector load that loads vector elements based on strides rather than consecutive elements

SIMD Extensions for Multimedia When processors began to include graphics instructions, architects realized that operations not necessarily need to be 32-bit instructions – Graphics for instance often operates on several 8-bit operations (one each for red, green, blue, transparency) so while a datum might be 32 bits in length, it really codified 4 pieces of data, each of which could be operated on simultaneously within the adder – Additionally, sounds are typically stored as segments of 8 or 16 bit data Thus, vector SIMD operations were incorporated into early MMX style architectures – This did not require additional hardware, just new instructions to take advantage of the hardware already available

Instructions Unsigned add/subt Maximum/minimum Average Shift right/left – These all allow for 32 8-bit, bit, 8 32-bit or 4 64-bit operations Floating point – bit, 8 32-bit, 4 64-bit or bit Usually no conditional execution instructions because there would not necessarily be a vector mask register No sophisticated addressing modes to permit strides (or deal with sparse matrices, a topic we skipped) The MMX extension to x86 architectures introduced hundreds of new instructions The streaming SIMD extensions (SSE) to x86 in 1999 added 128-bit wide registers and the advanced vector extensions (AVD) in 2010 added 256-bit registers

Example In this example, we use a 256-bit SIMD MIPS – The 4D suffix implies 4 doubles per instruction – The 4 doubles are operated on in parallel either because the FP unit is wide enough to accommodate 256 bits or because there are 4 parallel FP units L.DF0, a MOVF1, F0 MOVF2, F0 MOVF3, F0 DADDIR4, Rx, #512 Loop:L.4DF4, 0(Rx) MUL.4DF4, F4, F0 L.4DF8, 0(Ry) ADD.4DF8, F8, F4 S.4DF8, 0(Rx) DADDIRx, Rx, #32 DADDIRy, Ry, #32 DSUBR20, R4, Rx BNEZR20, Loop The 4D extension used with register F0 Means that we are actually using F0, F1, F2, F3 combined L.4D/S.4D moves 4 array elements at a time

TLP: Multiprocessor Architectures In chapter 3, we looked at ways to directly support threads in a processor, here we expand our view to multiple processors – We will differentiate among them as follows Multiple cores Multiple processors each with one core Multiple processors each with multiple cores And whether the processors/cores share memory – When processors share memory, they are known as tightly coupled and they can promote two types of parallelism Parallel processing of multiple threads (or processes) which are collaborating on a single task Request-level parallelism which has relatively independent processes running on separate processors (sometimes called multiprogramming)

Shared Memory Architecture We commonly refer to this type of architecture as symmetric multiprocessors (SMP) – Tightly coupled, or shared memory also known as a uniform memory access multiprocessor – Probably only a few processors in this architecture (no more than 8 or shared memory becomes a bottleneck) Although in the past, multiprocessor computers could fall into this category, today we typically view this category as a multicore processor, true multiprocessor computers will use distributed memory instead of shared memory

Challenges How much parallelism exists within a single program to take advantage of the multiple processors? – Within this challenge, we want to minimize the communication that will arise between processors (or cores) because the latency is so much higher than the latency of a typical memory access we wish to achieve an 80 times speedup from 100 processors, using Amdahl’s Law, compute the amount of time the processors must be working on their own (not communicating together). Solution: 99.75% of the time (solution on page 349) – What is the impact of the latency of communication? we have 32 processors and a 200 ns time for communication latency which stalls the processor, if the processor’s clock rate is 3.3 GHz and the ideal CPI is.5, how much faster is a machine with no interprocess communication versus one that spends.2% of the time communicating? 3.4 times faster (solution on page 350)

Cache Coherence The most challenging aspect of a shared memory architecture is ensuring data coherence across processors – What happens if two processors both read the same datum? If one changes the datum, the other has a stale value, how do we alert it to update the value? As an example, consider the following time line of events TimeEventA’s Cache storing X B’s Cache storing X Memory item X A reads X B reads X111 3A stores 0 into X010

Cache Coherence Problem We need our memory system to be both coherent and consistent – A memory system is coherent if a read by processor P to X followed by a write of P to X with no writes of X by any other processor always returns the value written by P a read by a processor to X following a write by another processor to X returns the written value if the read and write are separated by a sufficient amount of time writes to the same location are serialized so that the writes are seen by all processors in the same order – Consistency determines when a written value will be returned by a later read we will assume that a write is only complete once that write becomes available to all processors (that is, a write to a local cache does not mean a write has completed, the write must also be made to shared memory) if two writes take place, to X and Y, then all processors must see the two writes in the same order (X first and then Y for instance)

Snooping Coherence Protocol In an SMP, all of the processors have caches which are connected to a common bus – The snoopy cache listens to the bus for write updates Data falls into one of these categories – Shared – datum that can be read by anyone and is valid – Modified – datum has been modified by this processor and must be updated on all other processors – Invalid – data has been modified by another processor but not yet updated by this cache The snooping protocol has two alternatives – Write-invalidate – upon a write, the other caches must mark their own copies as invalid and retrieve the updated datum before using it it two processors attempt to write at the same time, only one wins, the other must invalidate its write, obtain the new datum and then reperform its operation(s) on the new datum – Write-update – upon a write, update all other caches at the same time by broadcasting the new datum

ProcessorBusA’s cacheB’s cacheMemory A reads XCache miss B reads XCache miss00 (from A, not memory) 0 A writes XInvalidate X (or ----) B reads XCache miss111

Extensions to Protocol MESI – adds a state called Exclusive – If a datum is exclusive to the cache, it can be written without generating an invalidate message to the bus – If a read miss occurs to a datum that is exclusive to a cache, then the cache must intercept the miss, send the datum to the requesting cache and modify the state to S (shared) MOESI – adds a state called Owned – In this case, the cache owns the datum AND the datum is out of date in memory (hasn’t been written back yet) – This cache MUST respond to any requests for the datum since memory is out of date – But the advantage is that if a modified block is known to be exclusive, it can be changed to Owned to avoid writing back to memory at this time

A Variation of the SMP As before, each processor has its own L1 and L2 caches – snooping must occur at the interconnection network in order to modify the L1/L2 caches A shared L3 cache is banked to improve performance The shared memory level is the backup to L3 as usual and is also banked

Performance for Shared Memory Here, we concentrate just on memory accesses of a multicore processor with a snoopy protocol (not the performance of the processors themselves) – Overall cache performance is a combination of miss rate as derived from compulsory, conflict and capacity misses (these misses are sometimes called true sharing misses) traffic from communication including invalidations and cache misses after invalidations, these are sometimes referred to as coherence misses (these misses are sometimes called false sharing misses) Example – Assume that x1 and x2 are in the same cache block and are shared by P1 and P2, indicate the true and false misses and hits from below: P1P2 write x1 – true (P1 must send out invalidate signal) read x2 – false (block was invalidated) write x1 – false (block marked as shared because of P2’s read of x2) write x2 – false (block marked as shared with P1) read x2 – true (need new value from P2)

Commercial Workloads To demonstrate the performance of the snoopy cache protocol on a SMP, we look at a study done on the DEC ALPHA from 1998 – 4 processors (from 1998) with each processor issuing up to 4 instr/clock cycle, 3 levels of cache L1: 8 KB/8 KB instr/data cache, direct-mapped, 32 byte blocks, 7 cycle miss penalty L2: 96 KB, 3 way set assoc, 32 byte block, 21 cycles L3: 2 MB, direct mapped, 64 byte block, 80 cycle miss – As a point of comparison, the Intel i7 has these three caches L1: 32 KB/32 KB instr/data cache, 4 way/8 way, 64 byte blocks, 10 cycle miss penalty L2: 256 KB, 8 way set assoc, 64 byte block, 35 cycles L3: 2 MB (per core), 16 way, 64 byte block, 100 cycle miss The study looks at 3 benchmarks: – OLTP - user mode 71%, kernel time 18%, idle 11% – DSS – 87%, 4%, 9% – AltaVista (search engine) – 98%, <1%, <1%

Distributed Shared Memory The tightly coupled (shared memory) multiprocessor is useful for promoting parallelism within tasks (whether 1 process, a group of threads, or related processes) However, when processes generally will not communicate with each other, there is little need to force the architect to build a shared memory system – The loosely coupled, or distributed memory system, is generally easier to construct and possibly cheaper in fact, any network of computers can be thought of as a loosely coupled multiprocessor – Any multicore multiprocessor will be of this configuration

DSM Architecture Here, each multicore MP is a SMP as per our previous slides Connecting each processor together is an interconnection network – An example ICN is shown to the right, there are many topologies including nearest neighbors of 1-D, 2-D, 3-D and hypercubes

Directory-based Protocol The snoopy protocol requires that caches broadcast invalidates to other caches For a DSM, this is not practical because of the lengthy latencies in communication – Further, there is no central bus that all processors are listening to for such messages (the ICN is at a lower level of the hierarchy, passed the caches but possibly before a shared memory) So the DSM requires a different form for handling coherence, so we turn to the directory-based protocol – We keep track of every block that may be cached in a central repository called a directory – This directory maintains information for each block: in which caches it is stored whether it is dirty who currently “owns” the block

The Basics of the Protocol Cache blocks will have one of three states – Shared – one or more nodes currently have the block that contains the datum and the value is up to date in all caches and main memory – Uncached – no node currently has the datum, only memory – Modified – the datum has been modified by one node, called the owner for a node to modify a datum, it must be the only node to store the datum, so this permits exclusivity if a node intends to modify a shared datum, it must first seek to own the datum from the other caches, this allows a node to modify a datum without concern that the datum is being or has been modified by another node in the time it takes to share the communication once modified, the datum in memory (and any other cache) is invalid, or dirty

The Directory(ies) The idea is to have a single directory which is responsible for keeping track of every block – But it is impractical to use a single directory because such an approach is not scalable Therefore, the directory must be distributed – Refer back to the figure 3 slides ago, we enhance this by adding a directory to each MP each MP now has its multicores & L1/L2 caches, a shared L3 cache, I/O, and a directory – The local directory consists of 1 entry per block in the caches (assuming we are dealing with multicore processors and not collections of processors) we differentiate between the local node (the one making a request) and the home node (the node storing or owning the datum) and a remote node (a node that has requested an item from the owner or a node that requires invalidation once the owner has modified the datum)

Protocol Messages TypeSourceDestContentFunction Read missLocalDirectoryP, AP has a read miss at A, requests data to make P a sharer Write missLocalDirectoryP, AP has a write miss at A, requests data and makes P owner InvalidateLocalDirectoryAInvalidate all remote caches for A InvalidateDirectoryRemoteAInvalidate a shared copy of A FetchDirectoryRemoteAFetch block A from remote cache, send to home directory and change A to shared Fetch/InvalidateDirectoryRemoteAFetch block A from remote cache, send to home directory and change remote cache’s A to invalid Data value replyDirectoryLocalDReturn datum from home Data write-backRemoteDirectoryA, DWrite A back

Example Protocol