Download presentation
Presentation is loading. Please wait.
Published byBryan Cross Modified over 8 years ago
1
ISCA2000 Norman Margolus MIT/BU SPACERAM: An embedded DRAM architecture for large-scale spatial lattice computations
2
ISCA2000 Overview The Task: Large-scale brute force computations with a crystalline-lattice structure The Opportunity: Exploit row- at-a-time access in DRAM to achieve speed and size. The Challenge: Dealing efficiently with memory granularity, commun, proc The Trick: Data movement by layout and addressing 4Mbits/DRAM, 256 bit I/O 20 @200 MHz 1 Tbit/sec, 10 MB, 130 mm 2 (.18µm) and 8 W (mem)
3
ISCA2000 1. Crystalline lattice computations 2. Mapping a lattice into hardware
4
ISCA2000 Crystalline lattice computations Large scale, with spatial regularity, as in finite difference 1 Tbit/sec ≈ 10 10 32-bit mult- accum / sec (sustained) Many image processing and rendering algorithms have spatial regularity.
5
ISCA2000 Brute-force physical simulations (complex) Bit-mapped 3D games and virtual reality Logic simulation (physical wavefront) Pyramid, multigrid and multiresolution computations Symbolic lattice computations
6
ISCA2000 Site update rate / processor chip
7
ISCA2000 Site update rate / processor chip
8
ISCA2000 1. Crystalline lattice computations 2. Mapping a lattice into hardware
9
ISCA2000 Mapping a lattice into hardware Divide the lattice up evenly among mesh array of chips Each chip handles an equal sized sector of the emulated lattice
10
ISCA2000 Mapping a lattice into hardware
11
ISCA2000 Mapping a lattice into hardware Use blocks of DRAM to hold the lattice data (Tbits/sec) Use virtual processors for large comps (depth-first, wavefront, skew) Main challenge: mapping lattice computations onto granular memory
12
ISCA2000 Lattice computation model 1D example: the rows are bit-fields and columns sites Shift bit-fields periodically and uniformly (may be large) Process sites independently and identically (SIMD) A B A B
13
ISCA2000 Mapping the lattice into memory A B A B A B A B
14
ISCA2000 Mapping the lattice into memory CAM-8 approach: group adjacent into memory words Shifts can be long but re- aligning messy chaining Can process word-wide (CAM-8 did site-at-a-time) A B A B
15
ISCA2000 Mapping the lattice into memory CAM-8: Adjacent bits of bit-field are stored together =+++=+++
16
ISCA2000 Mapping the lattice into memory SPACERAM: bits stored as evenly spread skip samples. =+++=+++
17
ISCA2000 Processing skip-samples A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] Process sets of evenly spaced lattice sites (site groups) Data movement is done by reading shifted data
18
ISCA2000 Processing skip-samples A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] A[0] B[0]
19
ISCA2000 Processing skip-samples A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] A[0] B[0] A[1] B[1]
20
ISCA2000 Processing skip-samples A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] A[0] B[0] A[1] B[1] A[2] B[2]
21
ISCA2000 Processing skip-samples A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3]
22
ISCA2000 Processing skip-samples
23
ISCA2000 Processing skip-samples A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] A[2] B[0]
24
ISCA2000 Processing skip-samples A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] A[2] B[0] A[3] B[1]
25
ISCA2000 Processing skip-samples A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] A[2] B[0] A[3] B[1] A[0] B[2]
26
ISCA2000 Processing skip-samples A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] A[2] B[0] A[3] B[1] A[0] B[2] A[1] B[3]
27
ISCA2000 Processing skip-samples A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] A[2] B[0] A[3] B[1] A[0] B[2] A[1] B[3]
28
ISCA2000 Processing skip-samples
29
ISCA2000 DRAM module Assume hardware word size is 64-bits Assume we can address any word For each site group: address next word we need; rot if necessary Each bit field resides in a different module DRAM block (64 bit words) Barrel Rotator 64 address shift amt to processor
30
ISCA2000 Problem: granularity DRAM block has structure: eg., all words in a row must be used before switching rows Solution: grouping + ( + ) = ( + )
31
ISCA2000 DRAM module 3 levels of granularity: 2K-bit row, 256-bit word, 64-bit word All handled by data layout, addressing, and rot within 64-word DRAM block 2K x 8 x 256 Barrel Rotator 64 row addr shift amt to processor 256 64 sub-addr 64 x 4:64 col addr
32
ISCA2000 Gluing chips together All chips operate in lockstep Shift is periodic within each sector To glue sectors, bits that stick out replace corresponding bits in the next sector
33
ISCA2000 DRAM module Any data that wraps around is transmitted over the mesh Corresponding bit on corresponding wire is substituted for it This is done sequentially in each of the three directions DRAM block 2K x 8 x 256 Barrel Rotator 64 shift amt to processor 256 64 sub-addr 64 x 4:64 Mesh xyz Subst 64 wrap info row addr col addr
34
ISCA2000 Partitioning a 2D sector
35
ISCA2000 Partitioning a 2D sector
36
ISCA2000 The SPACERAM chip.18 µm CMOS DRAM proc. (20W, 215 mm 2 ) 10% of memory bandwidth control Memory hierarchy (external memory) Single DRAM address space
37
ISCA2000 The SPACERAM chip module #00 module #01 module #02 module #03 module #19 PE0PE63 … … … … … …
38
ISCA2000 The SPACERAM chip Module #19 2K rows x 2K cols PE0PE1PE63 /20 /2K32//2K32//2K32/...
39
ISCA2000 SPACERAM: symbolic PE Computation by LUT Permuter lets any DRAM wire play any role MUX lets LUT output be used conditionally LUT is really a MUX, with data bussed
40
ISCA2000 SPACERAM: symbolic PE
41
ISCA2000 Conclusions http://www.im.lcs.mit.edu/nhm/isca.pdf. 18µm CMOS, 10 Mbytes DRAM, 1 Tbit/sec to mem, 215 mm 2, 20W Addressing is powerful Simple hardware provides a general mechanism for lattice data movement We can exploit the parallelism of DRAM access without extra overhead
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.