ISCA2000 Norman Margolus MIT/BU SPACERAM: An embedded DRAM architecture for large-scale spatial lattice computations.

Slides:



Advertisements
Similar presentations
A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University.
Advertisements

A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution.
Memory Management Norman White Stern School of Business.
Modified from notes by Saeid Nooshabadi
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
File System Implementation CSCI 444/544 Operating Systems Fall 2008.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Memory Organization.
Chapter 9 Virtual Memory Produced by Lemlem Kebede Monday, July 16, 2001.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Norman Margolus Lectures. LGA = lattice gas Margolus.
Fall 2008Introduction to Parallel Processing1 Introduction to Parallel Processing.
Embedded DRAM for a Reconfigurable Array S.Perissakis, Y.Joo 1, J.Ahn 1, A.DeHon, J.Wawrzynek University of California, Berkeley 1 LG Semicon Co., Ltd.
COEN 180 Main Memory Cache Architectures. Basics Speed difference between cache and memory is small. Therefore:  Cache algorithms need to be implemented.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.
Disk Structure Disk drives are addressed as large one- dimensional arrays of logical blocks, where the logical block is the smallest unit of transfer.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
PIT 05/01/06 Physics of Information Technology Norm Margolus Crystalline Computation.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Hardware Assisted Control Flow Obfuscation for Embedded Processors Xiaoton Zhuang, Tao Zhang, Hsien-Hsin S. Lee, Santosh Pande HIDE: An Infrastructure.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 14: Mass-Storage Systems Disk Structure Disk Scheduling Disk Management Swap-Space.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Computer Architecture Lecture 26 Fasih ur Rehman.
Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.
By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.
Virtual Memory 1 1.
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
Copyright  2005 SRC Computers, Inc. ALL RIGHTS RESERVED Overview.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
Memory Management. Why memory management? n Processes need to be loaded in memory to execute n Multiprogramming n The task of subdividing the user area.
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
The Universal Machine (UM) Implementing the UM Noah Mendelsohn Tufts University Web:
COMP 3221: Microprocessors and Embedded Systems Lectures 27: Cache Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.
Digital Logic & Design Dr.Waseem Ikram Lecture No. 43.
Chapter 2 Operating System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Backprojection Project Update January 2002
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital.
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
ESE532: System-on-a-Chip Architecture
Memory COMPUTER ARCHITECTURE
Yu-Lun Kuo Computer Sciences and Information Engineering
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
Multiple Platters.
Design for Embedded Image Processing on FPGAs
Chapter 12: Mass-Storage Structure
Cache Memory Presentation I
CMSC 611: Advanced Computer Architecture
The Main Memory system: DRAM organization
High Throughput LDPC Decoders Using a Multiple Split-Row Method
Cache Memory.
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Main Memory Cache Architectures
COMP60621 Fundamentals of Parallel and Distributed Systems
COMP3221: Microprocessors and Embedded Systems
COMP60611 Fundamentals of Parallel and Distributed Systems
ADSP 21065L.
Virtual Memory 1 1.
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

ISCA2000 Norman Margolus MIT/BU SPACERAM: An embedded DRAM architecture for large-scale spatial lattice computations

ISCA2000 Overview The Task: Large-scale brute force computations with a crystalline-lattice structure The Opportunity: Exploit row- at-a-time access in DRAM to achieve speed and size. The Challenge: Dealing efficiently with memory granularity, commun, proc The Trick: Data movement by layout and addressing 4Mbits/DRAM, 256 bit I/O  MHz  1 Tbit/sec, 10 MB, 130 mm 2 (.18µm) and 8 W (mem)

ISCA Crystalline lattice computations 2. Mapping a lattice into hardware

ISCA2000 Crystalline lattice computations Large scale, with spatial regularity, as in finite difference 1 Tbit/sec ≈ bit mult- accum / sec (sustained) Many image processing and rendering algorithms have spatial regularity.

ISCA2000 Brute-force physical simulations (complex) Bit-mapped 3D games and virtual reality Logic simulation (physical  wavefront) Pyramid, multigrid and multiresolution computations Symbolic lattice computations

ISCA2000 Site update rate / processor chip

ISCA2000 Site update rate / processor chip

ISCA Crystalline lattice computations 2. Mapping a lattice into hardware

ISCA2000 Mapping a lattice into hardware Divide the lattice up evenly among mesh array of chips Each chip handles an equal sized sector of the emulated lattice

ISCA2000 Mapping a lattice into hardware

ISCA2000 Mapping a lattice into hardware Use blocks of DRAM to hold the lattice data (Tbits/sec) Use virtual processors for large comps (depth-first, wavefront, skew) Main challenge: mapping lattice computations onto granular memory

ISCA2000 Lattice computation model 1D example: the rows are bit-fields and columns sites Shift bit-fields periodically and uniformly (may be large) Process sites independently and identically (SIMD) A B   A B

ISCA2000 Mapping the lattice into memory A B   A B A B   A B

ISCA2000 Mapping the lattice into memory CAM-8 approach: group adjacent into memory words Shifts can be long but re- aligning messy  chaining Can process word-wide (CAM-8 did site-at-a-time) A B   A B

ISCA2000 Mapping the lattice into memory CAM-8: Adjacent bits of bit-field are stored together =+++=+++

ISCA2000 Mapping the lattice into memory SPACERAM: bits stored as evenly spread skip samples. =+++=+++

ISCA2000 Processing skip-samples A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] Process sets of evenly spaced lattice sites (site groups) Data movement is done by reading shifted data

ISCA2000 Processing skip-samples A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] A[0] B[0]

ISCA2000 Processing skip-samples A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] A[0] B[0] A[1] B[1]

ISCA2000 Processing skip-samples A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] A[0] B[0] A[1] B[1] A[2] B[2]

ISCA2000 Processing skip-samples A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3]

ISCA2000 Processing skip-samples

ISCA2000 Processing skip-samples A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] A[2] B[0]

ISCA2000 Processing skip-samples A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] A[2] B[0] A[3] B[1]

ISCA2000 Processing skip-samples A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] A[2] B[0] A[3] B[1] A[0] B[2]

ISCA2000 Processing skip-samples A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] A[2] B[0] A[3] B[1] A[0] B[2] A[1] B[3]

ISCA2000 Processing skip-samples A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] A[2] B[0] A[3] B[1] A[0] B[2] A[1] B[3]

ISCA2000 Processing skip-samples

ISCA2000 DRAM module Assume hardware word size is 64-bits Assume we can address any word For each site group: address next word we need; rot if necessary Each bit field resides in a different module DRAM block (64 bit words) Barrel Rotator 64 address  shift amt  to processor

ISCA2000 Problem: granularity DRAM block has structure: eg., all words in a row must be used before switching rows Solution: grouping + ( + ) = ( + )

ISCA2000 DRAM module 3 levels of granularity: 2K-bit row, 256-bit word, 64-bit word All handled by data layout, addressing, and rot within 64-word DRAM block 2K x 8 x 256 Barrel Rotator 64 row addr  shift amt  to processor sub-addr  64 x 4:64 col addr 

ISCA2000 Gluing chips together All chips operate in lockstep Shift is periodic within each sector To glue sectors, bits that stick out replace corresponding bits in the next sector

ISCA2000 DRAM module Any data that wraps around is transmitted over the mesh Corresponding bit on corresponding wire is substituted for it This is done sequentially in each of the three directions DRAM block 2K x 8 x 256 Barrel Rotator 64 shift amt  to processor sub-addr  64 x 4:64 Mesh xyz Subst 64 wrap info  row addr  col addr 

ISCA2000 Partitioning a 2D sector

ISCA2000 Partitioning a 2D sector

ISCA2000 The SPACERAM chip.18 µm CMOS DRAM proc. (20W, 215 mm 2 ) 10% of memory bandwidth  control Memory hierarchy (external memory) Single DRAM address space

ISCA2000 The SPACERAM chip module #00 module #01 module #02 module #03 module #19 PE0PE63 … … … … … …

ISCA2000 The SPACERAM chip Module #19 2K rows x 2K cols PE0PE1PE63 /20 /2K32//2K32//2K32/...

ISCA2000 SPACERAM: symbolic PE Computation by LUT Permuter lets any DRAM wire play any role MUX lets LUT output be used conditionally LUT is really a MUX, with data bussed

ISCA2000 SPACERAM: symbolic PE

ISCA2000 Conclusions 18µm CMOS, 10 Mbytes DRAM, 1 Tbit/sec to mem, 215 mm 2, 20W Addressing is powerful Simple hardware provides a general mechanism for lattice data movement We can exploit the parallelism of DRAM access without extra overhead