Buffered Compares: Excavating the Hidden Parallelism inside DRAM Architectures with Lightweight Logic Jinho Lee, Kiyoung Choi, and Jung Ho Ahn Seoul.

Slides:



Advertisements
Similar presentations
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Advertisements

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
Vacuum tubes Transistor 1948 –Smaller, Cheaper, Less heat dissipation, Made from Silicon (Sand) –Invented at Bell Labs –Shockley, Brittain, Bardeen ICs.
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.
1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.
Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.
C.S. Choy95 COMPUTER ORGANIZATION Logic Design Skill to design digital components JAVA Language Skill to program a computer Computer Organization Skill.
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.
MEMORY ORGANIZTION & ADDRESSING Presented by: Bshara Choufany.
BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
Computer Hardware What is a CPU.
CS 704 Advanced Computer Architecture
Presented by: Nick Kirchem Feb 13, 2004
Prof. Hsien-Hsin Sean Lee
Seth Pugsley, Jeffrey Jestes,
Reducing Memory Interference in Multicore Systems
Lecture: Large Caches, Virtual Memory
Process Management Process Concept Why only the global variables?
Control Unit Lecture 6.
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)
Lecture: Large Caches, Virtual Memory
Assistant Professor, EECS Department
Multiscalar Processors
Architecture & Organization 1
Course Name: Computer Application Topic: Central Processing Unit (CPU)
Cache Memory Presentation I
Ambit In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology Vivek Seshadri Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali.
BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.
Accelerating Linked-list Traversal Through Near-Data Processing
Accelerating Linked-list Traversal Through Near-Data Processing
Lecture 15: DRAM Main Memory Systems
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Lecture: Memory, Multiprocessors
Virtually Pipelined Network Memory
The Microarchitecture of the Pentium 4 processor
Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.
Part V Memory System Design
The Main Memory system: DRAM organization
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Milad Hashemi, Onur Mutlu, Yale N. Patt
Architecture & Organization 1
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
DRAM Bandwidth Slide credit: Slides adapted from
Ka-Ming Keung Swamy D Ponpandi
Lecture: Memory Technology Innovations
MICROPROCESSOR MEMORY ORGANIZATION
Lecture 24: Memory, VM, Multiproc
COMP 1321 Digital Infrastructure
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Computer Architecture
15-740/ Computer Architecture Lecture 19: Main Memory
DRAM Hwansoo Han.
Active-Routing: Compute on the Way for Near-Data Processing
Ka-Ming Keung Swamy D Ponpandi
Tesseract A Scalable Processing-in-Memory Accelerator
Presentation transcript:

Buffered Compares: Excavating the Hidden Parallelism inside DRAM Architectures with Lightweight Logic Jinho Lee, Kiyoung Choi, and Jung Ho Ahn Seoul National University

Outline Introduction Our Approach Buffered Compare Architecture Evaluation Summary

Introduction – Memory Wall CPU The number of cores in a chip is increasing The memory bandwidth is not as much… --> “memory wall” problem Emerging big data applications require even more bandwidth Actually, much of the bandwidth is wasted!

Introduction – Table Scan Which items are made out of wood? Which items are heavier than 5kg? Item# Material Weight A Wood 10kg B Metal 1.5kg C 7kg D Stone 3kg E 2kg …

Introduction – Table Scan Core Key Search key ③ ② Cmp Result D0 D1 D2 D3 ① D0 D1 D2 D3 Data in table DRAM Data are read and the comparisons are done We only need the result – waste in bandwidth!

Introduction – Table Scan Core Key Result ❶ ❸ Key ❷ Cmp D0 D1 D2 D3 DRAM Do compare within the memory Only two transfers needed instead of many Essentially a PIM (processing-in-memory) approach

PIM research was active late 90’s ~ early 00’s Introduction - PIM PIM research was active late 90’s ~ early 00’s EXECUBE, IRAM, FlexRAM, Smart memory, Yukon, DIVA, etc. Multiple cores in DRAM --> hard to integrate Re-gaining interests for various reasons Big data workloads Limited improvement of processor speed Limited improvement of memory bandwidth 3D stacked memory (HMC, HBM, etc.)

PIM with 3D stacked memory Introduction - PIM PIM with 3D stacked memory Out-Of-Order Core L1 Cache L2 Cache Last-Level Cache HMC Controller Crossbar Network DRAM Controller Host Processor HMC … PCU PIM Directory Locality Monitor PMU Host Processor PEI (PIM enabled instructions) [J. Ahn et al., ISCA 2015] DRAM Controller NI In-Order Core Crossbar Network … ListPref. Prefetch Buffer Tesseract Mes.Trig. Pref. [J. Ahn et al., ISCA 2015] Message Queue

Our Approach - DRAM Architecture & Motivation DRAM Chip Bank Global Row Decoder Global Sense Amp. (Bank I/O) Mat … Global wordline Column Decoder Global Dataline Mat Local Row Decoder Local Wordline 512 x 512 cells Local Bitline Local Sense Amp. (Row Buffer) Global Dataline Row Bank … Activated Activated Activated Activated Off-chip link Activated Internal Shared Bus Chip I/O Buffered A single chip is comprised of 8-16 banks When accessing data, a row in a bank is “activated” and stored in a row buffer A cache line (64B) is fetched at one burst

Our Approach - DRAM Architecture & Motivation DRAM Chip Bank Global Row Decoder Global Sense Amp. (Bank I/O) Mat … Global wordline Column Decoder Global Dataline Mat Local Row Decoder Local Wordline 512 x 512 cells Local Bitline Local Sense Amp. (Row Buffer) Global Dataline Bank … Off-chip link Zzz.. Activated Internal Shared Bus Chip I/O Zzz.. Zzz.. One bank can fill up the bandwidth for the off-chip link Since the time required to activate a row is very long and thus multiple banks are used  We have 8x-16x internal bandwidth Most of the internal bandwidth is wasted

Our Approach - DRAM Architecture & Motivation DRAM Chip Bank Global Row Decoder Global Sense Amp. (Bank I/O) Mat … Global wordline Column Decoder Global Dataline Mat Local Row Decoder Local Wordline 512 x 512 cells Local Bitline Local Sense Amp. (Row Buffer) Global Dataline Bank … Internal Shared Bus Off-chip link Compute Activated Chip I/O Compute Compute Compute inside each bank to utilize the excess bandwidth

Our Approach – Goal Utilize the unused internal bandwidth Minimal area overhead to DRAM Less-invasive to the existing ecosystem (i.e., leave the DDR3/4 protocol intact as much as possible)

Our Approach – Goal All PIM operations have deterministic latency DRAM Chip Bank Global Row Decoder Global Sense Amp. (Bank I/O) Mat … Global wordline Column Decoder Global Dataline Mat Local Row Decoder Local Wordline 512 x 512 cells Local Bitline Local Sense Amp. (Row Buffer) Global Dataline Bank … Internal Shared Bus Off-chip link Chip I/O All PIM operations have deterministic latency All DRAM CMDs (ACT, RES, …) have pre-determined latencies DDR protocols have no mechanism for a memory to signal processors No branching, caching, or pipelining allowed Preserves existing DDR interface and makes logic lightweight

Our Approach – Goal All PIM operations have deterministic latency DRAM Chip Bank Global Row Decoder Global Sense Amp. (Bank I/O) Mat … Global wordline Column Decoder Global Dataline Mat Local Row Decoder Local Wordline 512 x 512 cells Local Bitline Local Sense Amp. (Row Buffer) Global Dataline Bank … Internal Shared Bus Activated Activated Activated Activated Off-chip link Activated Chip I/O All PIM operations have deterministic latency Single-row restriction Inter-bank communication is expensive Activation of other rows incur another overhead Allows to use bank I/O as an operand register

Our Approach - What to compute with PIM? We focus on ‘compare-n-op’ patterns over a long range of data DRAM D0 D1 D2 DN … CMP Key

Our Approach - What to compute with PIM? Compare-n-read Returns the match results for each item DRAM D0 D1 D2 DN … CMP Key Result: (=, <, =, … , >)

Our Approach - What to compute with PIM? Compare-n-select Returns the min/max among each item DRAM D0 D1 D2 DN … CMP Max Max: (D7)

Our Approach - What to compute with PIM? Compare-n-increment Increments matching items DRAM K0, V0 K1, V1 K2, V2++ KN, VN … CMP K2

Buffered Compare Architecture DRAM Chip Bank Mat Bank Bank Mat Local Row Decoder Local Wordline 512 x 512 cells Local Bitline Local Sense Amp. (Row Buffer) … … Global wordline … Global Row Decoder Global Dataline Chip I/O Internal Shared Bus Column Decoder Bank Bank CGEN Bank I/O … Global Dataline Result Queue Key Buffer Arithmetic Unit Key buffer: Holds a value written by the processor Arithmetic unit: Performs computation (cmp, add, etc.) using Bank I/O and Key buffer as operands Result queue: Stores compare results CGEN: Repeats the bank-local commands The datapath is 64 bits wide 0.53% overhead in DRAM area

Buffered Compare Architecture Key Buffer Mask Arithmetic Unit Result Queue 64 Mats … To/from internal shared bus Cmd Gen Bank IO Data cells Bank Mat … Global wordline … Global Row Decoder Global Dataline Column Decoder CGEN Bank I/O Result Queue Key Buffer Arithmetic Unit

Buffered Compare Architecture - Compare-n-read Data cells To/from internal shared bus 64 64 64 ❶ A DRAM row is activated Mats Bank IO Key Buffer Key Buffer Mask … 64 Mats Arithmetic Unit Cmd Gen Result Queue A DRAM row is activated and the data becomes ready

Buffered Compare Architecture - Compare-n-read Data cells To/from internal shared bus 64 64 64 ❶ A DRAM row is activated Mats ❷ Key buffer is filled with the search key Bank IO Key Buffer Key Buffer Mask … 64 Mats Arithmetic Unit Cmd Gen Result Queue The host writes the search key to the key buffer

Buffered Compare Architecture - Compare-n-read Data cells To/from internal shared bus 64 64 64 ❶ A DRAM row is activated Mats ❷ Key buffer is filled with the search key Bank IO Key Buffer Key Buffer Mask … ❸ target data are fetched to the bank I/O 64 Mats Arithmetic Unit Cmd Gen Result Queue 64B data are read to the Bank I/O

Buffered Compare Architecture - Compare-n-read Data cells To/from internal shared bus 64 64 64 ❶ A DRAM row is activated Mats ❷ Key buffer is filled with the search key Bank IO Key Buffer Key Buffer Mask … ❸ target data are fetched to the bank I/O 64 Mats Arithmetic Unit ❹ Arithmetic unit performs comparison and queues the results Cmd Gen Result Queue Comparison is performed on the arithmetic unit, and the results are queued

Buffered Compare Architecture - Compare-n-read Data cells To/from internal shared bus 64 64 64 ❶ A DRAM row is activated Mats ❷ Key buffer is filled with the search key Bank IO Key Buffer Key Buffer Mask … ❸ target data are fetched to the bank I/O 64 Mats Arithmetic Unit ❹ Arithmetic unit performs comparison and queues the results ❺ Repeat ③, ④ using the command generator Cmd Gen Result Queue Repeat for the determined range

Buffered Compare Architecture - Compare-n-read Data cells To/from internal shared bus 64 64 64 ❶ A DRAM row is activated Mats ❷ Key buffer is filled with the search key Bank IO Key Buffer Key Buffer Mask … ❸ target data are fetched to the bank I/O ❻ Send the results to the host 64 Mats Arithmetic Unit ❹ Arithmetic unit performs comparison and queues the results ❺ Repeat ③, ④ using the command generator Cmd Gen Result Queue Send the results to the host

Buffered Compare Architecture - Problems and Solutions Virtual address cannot be handled Physical address should be used or Virtual address should be translated within DRAM Cache coherence problem Processor cache and the DRAM has to be coherent Solutions Direct segment with non-cacheable region Keep base, limit, and offset registers for a large memory segment Translation can be done by simple additions Data are kept non-cacheable within the segment

Buffered Compare Architecture - Problems and Solutions Data placement A 64bit word is distributed over multiple chips of a rank and interleaved in units of 8bits, but we need a whole word Chip 0 Chip 1 Chip 7 … A0 B0 … A1 B1 … A7 B7 … Word ‘A’ is distributed Solutions: Use word-interleaving within the segment Chip 0 Chip 1 Chip 7 … A0 A1 … B0 B1 … H0 H1 … Critical-word-first is disabled within the segment

Buffered Compare Architecture - Programming Model SW code __kernel search(keys[], searchkey, d[]){ int id = get_global_id(0) if (keys[id] == searchkey) return d[id] = 1 } Instruction BC_cmp_read(searchkey, keys, N) … DRAM cmd CMP_RD(searchkey, addr, range) OpenCL based programming model Programmers need not be aware of DRAM parameters (page size, number of banks, …)

Evaluation - Setup McSimA+ simulator Processor Memory 22nm, 16 cores running at 3GHz 16KB private L1 32MB S-NUCA L2 Directory-based MESI coherence Memory 28nm DDR4-2000 4 ranks per channel 16 banks per chip PAR-BS (parallelism-aware batch scheduling)

BC was evaluated against baseline and AMO (Active Memory Operation) Evaluation - Setup Six workloads TSC : In-memory linear scan (Column-store) TSR : In-memory linear scan (Row-store) BT : B+ tree traversal (index scan) MAX : MAX aggregation SA : Sequence assembly KV : Key-value store BC was evaluated against baseline and AMO (Active Memory Operation)

Evaluation - Speedup BC performs 3.62 times better than the baseline AMO

Evaluation – Bandwidth Usage BC can utilize internal bandwidth by more than 8.64x on geomean

Evaluation – Sensitivity Usually, the more aggregate banks, the more speedup Sometimes introducing more ranks degrades speed

Experimental Result Energy consumption reduced by 73.3% on average Proc : 77.2% Mem: 43.9%

Summary We proposed Buffered Compare, a processing-in-memory approach to utilize internal bandwidth of DRAM Minimal overhead to the DRAM area Less invasive to existing DDR protocols 3.62X speedup and 73.3% energy reduction Limitations Restricted within a single large segment When using x4 devices, only up to 32bits are supported for the operands

The End Thank you!