Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1

Multi-core Processors are abundant Multi-cores increase the compute resources on the chip without increasing hardware complexity Keeps power consumption within the budgets. 2 AMD Phenom (4-core) Sun Niagara 2 (8-core) Tile64 (64-core) Intel Polaris (80-core)

Multi-Core Processors are underutilized 3 … b = a + 4 … (0) c = b * 8 … (1) d = c – 2 … (2) e = b * b … (3) f = e * 3 … (4) g = f + d … (5) … b = a + 4 … (0) c = b * 8 … (1) d = c – 2 … (2) e = b * b … (3) f = e * 3 … (4) g = f + d … (5) … 0 0 3 3 2 2 5 5 11 12 1 Single –thread code Parallel Execution 1 1 4 4 2 3 13 14 13 0 0 3 3 2 2 5 5 2 1 1 1 4 4 3 5 4 6 Serial Execution Software gets the responsibility of utilizing the cores with parallel instruction streams Hard to parallelize applications.

Tiled Architectures increase Utilization by enabling Parallelization 4 The OCN communication latencies are of the order of 2+(distance between tiles) cycles* *Latency for RAW inter-ALU OCN Tiled architectures are of class of multi-core architectures Provide mechanisms to facilitate automatic parallelization of single- threaded programs Fast On Chip Networks (OCNs) to connect cores

Automatic Parallelization on Tiled Architectures 5 … b = a + 4 … (0) c = b * 8 … (1) d = c – 2 … (2) e = b * b … (3) f = e * 3 … (4) g = f + d … (5) … b = a + 4 … (0) c = b * 8 … (1) d = c – 2 … (2) e = b * b … (3) f = e * 3 … (4) g = f + d … (5) … 0 0 3 3 2 2 5 5 11 12 1 Single –thread code Multi-coresTiled Architecture In tiled architectures, dependent instructions can be placed on multiple cores with low penalty in tiled architectures due to cheap inter-ALU communication. 1 1 4 4 2 3 13 14 13 0 0 3 3 2 2 5 5 2 1 1 1 4 4 3 5 4 6 0 0 3 3 2 2 5 5 2 3 1 1 1 4 4 2 3 4 5 4

Why aren’t tiled architectures used everywhere? 6 Automatic parallelization is still very difficult due to slow resolution of remote memory dependencies Tiled Architecture Memory systems have a special requirement – Fast Memory Dependence Resolution … (*b) = a + 4 … (0) c = (*b) * 8 … (1) (*d) = c – 2 … (2) e = (*h) * 4 … (3) f = e * 3 … (4) g = f + (*i) … (5) … (*b) = a + 4 … (0) c = (*b) * 8 … (1) (*d) = c – 2 … (2) e = (*h) * 4 … (3) f = e * 3 … (4) g = f + (*i) … (5) … 0 0 3 3 2 2 5 5 11 12 1 Single –thread code Multi-coresTiled Architecture 1 1 4 4 2 3 13 14 13 0 0 3 3 2 2 5 5 2 1 1 1 4 4 3 5 4 6 0 0 3 3 2 2 5 5 11 12 1 1 1 4 4 2 3 13 14 13 What if we add some memory instructions?

Outline Motivation Preserving Memory Ordering Memory Ordering in Existing Work Analysis of Existing Work Future Work and Conclusion 7

Memory Dependence Static AnalysisTypea addressb addressStatic placement No 0x10000x2000 MustTrue0x1000 MayTrue0x1000 False0x10000x2000 *a = … … = *b foo (int * a, int * b) { *a = … … = *b } foo (int * a, int * b) { *a = … … = *b } *a = … … = *b *a = … … = *b 8 *a = … … = *b

Memory Coherence Coherent space provides an abstraction of a single data buffer with a single read write port Hierarchical implementation of shared memory ◦ Require coherence protocols to provide the same abstraction Core 0 Core 1 Shared Memory Core 0 Write A = 1 Core 0 Write A = 1 Core 1 Read A Core 1 Read A Shared Memory Cache Shared Buffer Write A = 1 Read A A = 0 A = 1 9 Dependence Signal

Improving Memory Dependence Resolution Memory Dependence Resolution Performance depends on – ◦True Dependence Performance ◦False Dependence Performance ◦Coherence System Performance 10

True Dependence Resolution Delay 1 – Determined by Signaling Stage ◦Earlier is better Delay 2 – Determined by signaling delay inside the ordering mechanism ◦Faster is better Delay 3 – Determined by Stalling Stage ◦Later is better Delays 1 and 3 are determined by the resolution model 11 SourceDestination Signal Stall Stage Signal Stage 1 1 2 2 3 3 Delay

False Dependence Resolution False Dependencies occur when ◦Static analysis cannot disambiguate ◦Memory Dependence encoding is not partial For false dependencies, dependent instruction should ideally not wait for any signal ◦Runtime Disambiguation The address comparison done in hardware to declare the dependent instruction as free ◦Speculation Dependent instruction is issued speculatively assuming the dependence is false 12

Fast Data Access Local L1 caches can help decrease average latencies ◦ No network delays Cache Coherence (CC) ◦ Dynamic access – data location not known statically ◦ Expensive dynamic access in the absence of CC 13

What features to look out for? 14

Outline Motivation Preserving Memory Ordering Memory Ordering in Existing Work ◦ RAW ◦ WaveScalar ◦ EDGE Analysis of Existing Work Future Work and Conclusion 15

RAW A highly static tiled architecture Array of simple in-order MIPS cores Scalar Operand Network (SON) for fast inter-ALU communication Shared address space, local caches and shared DRAMs No cache coherence mechanism Software cache management through flush and invalidation 16 *Taylor et al, IEEE Micro 2002

Artifacts of Software Cache Management Difficult to keep track of the most up-to- date version of a memory address All memory accesses can be categorized as - ◦ Static Access  The location of the cache line is known statically ◦ Dynamic Access  A runtime lookup is required for determining the location of the cache line  These are really expensive (36 vs 7) 17

Static-Dynamic Access Ordering Two static accesses ◦ Synchronization over SON Dependence between a static and a dynamic access ◦ Synchronizing over SON between  Static access  Static requestor or receiver for dynamic access Execute side resolution No speculative runahead False dependencies are as expensive as true dependence 18

Summary 19

Dynamic Access Ordering Execute side resolution very expensive Resolution done late in the memory system Static ordering point ◦ Turnstile tile ◦ One per equivalence class ◦ Equivalence class - set of all memory operations that can access the same memory address Requests sent on static SON to turnstile ◦ Receives in memory order In-order dynamic network channels 20

Summary 21

WaveScalar A fully dynamic Tiled Architecture with Memory Ordering Clusters arranged in 2D array connected by mesh dynamic network Each tile has a store buffer and banked data cache Secondary memory system made up of L2 caches around the tiles Cache coherence 23 *Swanson et al, Micro 2003

Memory Ordering Load A Store B Load C Store Buffer WaveScalar preserves memory ordering by using a sequence number for each memory operation in a wave ◦ Unique ◦ Indicates age Each memory operation also stores its predecessor’s and successor’s sequence number ◦ Use “?” if not known at compile time There cannot be a memory operation whose possible predecessor has it’s successor marked as “?” and vice-versa ◦ MEM-NOPs A request is allowed to go ahead if it’s predecessor has issued In hardware this ordering is managed in the store buffers ◦ A single store buffer is responsible to handle all memory requests for a dynamic wave Load A Store B Load C Load A Store B Load C Nop Store B Load C Store B Load A Load C Load A 24

Removing False Load Dependencies Sequence number based ordering is highly restrictive ◦ Loads are stalled on previous loads Each memory operation has ripple number as last store’s sequence number Memory operation can issue if op with ripple number has issued ◦ Loads can issue OoO Stores still have total ordering 25

Summary 26

EDGE A partially dynamic Tiled Architecture with block execution Array of tiles connected over fast OCNs Primary memory system is distributed over tiles Each such tile has address interleaved Data cache Load Store Queue Distributed Secondary Memory System Cache Coherence 28 *S. Sethumadhavan et al, ICCD ‘06

Memory Ordering Unique 5 bit tag called LSID ◦ Completion of block execution ◦ Ordering of memory operations DTs get a list of all LSIDs in a block during fetch stage Memory operations reach a DT ◦ LSID sent to all the DTs Request issued if all requests with earlier LSIDs completed ◦ memory side dependence resolution When all memory operations have completed, block is committed Ld A Ld B St C Ld C Ld A Ld B St C Ld C Ld A Ld B St C Ld C, 1,0, 1,0, 1,0, 1, 3,0, 1, 3,2 <0,1,2,3><0,1,2,3> <0,1,2,3><0,1,2,3> 29 Control Tile Execution Tiles Interleaved Data Tiles

Dependence Speculation EDGE memory ordering is very restrictive ◦ Total memory order Loads execute speculatively Earlier store to the same address causes squash ◦ Predictor used to reduce squashes 30

Summary 31

True Dependence Optimization 33

Memory Side Resolution allows more Overlap Requestor A Requestor B Home Node Requestor A Requestor B Home Node Requestor A Requestor B Home Node Tag Buffer Turnstile RAWsdEDGE/WaveScalarRAWdd RAWsd E/WS RAWdd 34 *The length of the bars do not indicate delays Request A Response A Coherence delay A Request B Response B Coherence delay B

Network Stalls should be avoided Execute Side Resolution - e ◦ RAWsd Memory Side Resolution - m ◦ Edge, WaveScalar RAW dynamic ordering - m t ◦ Network delay to memory system is overlapped e e m m mtmt F N a E N $ T p N m T s M N c N r W m,m t e e m m mtmt mtmt E,W,N $, N r Tp,NmTp,Nm 35

False Dependence Optimization 36 Partial Ordering reduces false deps Speculation on false deps reduces stalls Disambiguation should be done early

What’s a Good Tiled Architecture Memory System? Local caches for fast L1 hit Cache Coherence support for ease in programmability and no dynamic access delays Fast True Dependence Resolution ◦ Performance comparable to same core placement of operations ◦ Late stalls ◦ Early signaling Reduction of false dependencies through partial memory operation ordering Fast False Dependence resolution ◦ Performance comparable to same core placement of operations ◦ Early runtime memory disambiguation ◦ Speculative memory requests 38

Conclusion Auto-parallelization on tiled architecture can benefit from fast Memory Dependence resolution ◦ Multi-core memory system were not designed with this goal Performance of both true and false dependence resolution should be comparable dependent memory instructions placed on the same core ISA should support partial memory operation ordering to avoid artificial false dependencies Memory system should have local caches and cache coherence for performance and programmability 39 Thank You! Questions?

Dynamic Accesses are expensive X looks up a global address list and sends a dynamic request to owner Y Y is interrupted, data is fetched and dynamic request sent to Z Z is interrupted, data is stored in local cache One table lookup, two interrupt handlers and two dynamic requests make dynamic loads expensive Lifted portions represent processor occupancy, while unlifted portions portion represents network latency 40

Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

Similar presentations

Presentation on theme: "Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1.

Similar presentations

Presentation on theme: "Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1."— Presentation transcript:

Similar presentations

About project

Feedback