Assistant Professor, EECS Department

Slides:

Advertisements

Similar presentations

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

Advertisements

5-1 Memory System. Logical Memory Map. Each location size is one byte (Byte Addressable) Logical Memory Map. Each location size is one byte (Byte Addressable)

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Chapter 9 Memory Basics Henry Hexmoor1. 2 Memory Definitions  Memory ─ A collection of storage cells together with the necessary circuits to transfer.

1 Lecture 16B Memories. 2 Memories in General Computers have mostly RAM ROM (or equivalent) needed to boot ROM is in same class as Programmable Logic.

S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 31: Array Subsystems (SRAM) Prof. Sherief Reda Division of Engineering,

Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.

1 Lecture 16B Memories. 2 Memories in General RAM - the predominant memory ROM (or equivalent) needed to boot ROM is in same class as Programmable Logic.

Physical Memory and Physical Addressing By: Preeti Mudda Prof: Dr. Sin-Min Lee CS147 Computer Organization and Architecture.

55:035 Computer Architecture and Organization

12/1/2004EE 42 fall 2004 lecture 381 Lecture #38: Memory (2) Last lecture: –Memory Architecture –Static Ram This lecture –Dynamic Ram –E 2 memory.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.

1 Memory Technology Comparison ParameterZ-RAMDRAMSRAM Size11.5x4x StructureSingle Transistor Transistor + Cap 6 Transistor Performance10.5x2x.

Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 12.1 EE4800 CMOS Digital IC Design & Analysis Lecture 12 SRAM Zhuo Feng.

CPEN Digital System Design

Advanced VLSI Design Unit 06: SRAM

+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Computer Memory Storage Decoding Addressing 1. Memories We've Seen SIMM = Single Inline Memory Module DIMM = Dual IMM SODIMM = Small Outline DIMM RAM.

Introduction to Computer Organization and Architecture Lecture 7 By Juthawut Chantharamalee wut_cha/home.htm.

EE 466/586 VLSI Design Partha Pande School of EECS Washington State University

Charles Kime & Thomas Kaminski © 2008 Pearson Education, Inc. (Hyperlinks are active in View Show mode) Chapter 8 – Memory Basics Logic and Computer Design.

Lecture # 10 Processors Microcomputer Processors.

DRAM Tutorial Lecture Vivek Seshadri. Vivek Seshadri – Thesis Proposal DRAM Module and Chip 2.

Dynamic Zero Compression for Cache Energy Reduction Luis Villa Michael Zhang Krste Asanovic

CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.

CS161 – Design and Architecture of Computer Main Memory Slides adapted from Onur Mutlu (CMU)

Prof. Hsien-Hsin Sean Lee

Lecture 3. Lateches, Flip Flops, and Memory

COMP211 Computer Logic Design

ESE532: System-on-a-Chip Architecture

Lecture: Large Caches, Virtual Memory

Modern Computer Architecture

NVIDIA’s Extreme-Scale Computing Project

Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)

SRAM Memory External Interface

Introduction to Computer Architecture

Assembly Language for Intel-Based Computers, 5th Edition

Organizers Adwait Jog EJ Kim

CS-301 Introduction to Computing Lecture 17

Basic Computer Organization

Computer Architecture

Buffered Compares: Excavating the Hidden Parallelism inside DRAM Architectures with Lightweight Logic Jinho Lee, Kiyoung Choi, and Jung Ho Ahn Seoul.

Samira Khan University of Virginia Oct 9, 2017

CS775: Computer Architecture

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Prof. Gennady Pekhimenko University of Toronto Fall 2017

Lecture 12: Cache Innovations

Subject Name: Embedded system Design Subject Code: 10EC74

The Main Memory system: DRAM organization

Lecture 23: Cache, Memory, Virtual Memory

Computer Architecture

Introduction to Computing

Digital Logic & Design Dr. Waseem Ikram Lecture 40.

Discovering Computers 2014: Chapter6

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Lecture 24: Memory, VM, Multiproc

2.C Memory GCSE Computing Langley Park School for Boys.

Lecture 22: Cache Hierarchies, Memory

15-740/ Computer Architecture Lecture 19: Main Memory

Lecture: Cache Hierarchies

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Literature Review A Nondestructive Self-Reference Scheme for Spin-Transfer Torque Random Access Memory (STT-RAM) —— Yiran Chen, et al. Fengbo Ren 09/03/2010.

Cache Automaton Arun Subramaniyan Jingcheng Wang Ezhil R.M.Balasubramanian David Blaauw Dennis Sylvester Reetuparna Das University of.

CSE 502: Computer Architecture

Computer Architecture Lecture 30: In-memory Processing

Presentation transcript:

Assistant Professor, EECS Department In-Situ Compute Memory Systems Reetuparna Das Assistant Professor, EECS Department

Massive amounts of data generated each day

Near Data Computing Move compute near storage We can solve the data movememnt problem huge volumes of data

Evolution... Processing in Memory (PIM) IRAM, DIVA, Active Pages, etc ... 1997 Evolution... Emergence of Big Data Data movement dominates Joules/Op 3D Memories available Resurgence of Processing in Memory (PIM) Logic layer near Memory enabled by 3D Technology 2012 Automaton Processor Associative memory with custom interconnects 2014 2015 Computer Memories (bit-line computing) In-situ computation inside memory arrays

Problem 1: Memory are big and inactive Memory consumes most of aggregate die area Cache Haswell Core i7-5960X

Problem 2: Significant energy wasted is moving data through memory hierarchy 1000-2000 pJ 1-50 pJ Data movement Operation 20-40x

Problem 3: General purpose processors are inefficient for data parallel applications Scalar Small vector (32 bytes) Inefficient More efficient Very large vector 1000x larger?

Problem summary: Conventional systems process data inefficiently Energy Problem 3 Memory Memory Problem 1 Area CORE CORE

Key Idea Memory = Storage + In-place compute

Proposal: Repurpose memory logic for compute Energy Massive Parallelism (up to 100X) Memory Memory Memory Energy Efficiency (up to 20X) Depicting dynamic energies here.. With total, even base is dominated by caches.. Area CORE CORE CORE

Compute Caches for Efficient Very Large Vector Processing PIs: Reetuparna Das, Satish Narayanasamy, David Blaauw Students: Shaizeen Aga, Supreet Jeloka, Arun Subramanian

Proposal: Compute Caches Memory disambiguation support In-place compute SRAM A B = A op B Bank Sub-array CORE0 CORE3 L1 L1 L2 L2 Interconnect Challenges: Data orchestration Managing parallelism Coherence and Consistency Cache controller extension L3-Slice0 L3-Slice3

Opportunity 1: Large vector computation Operation width = row size L3 Many smaller sub-arrays Each sub-array can compute in parallel L2 Parallelism available (16 MB L3) 512 sub-arrays * 64B 32KB Operand 128X saved L1

Opportunity 2: Save data movement energy L3 Significant portion of cache energy is wire energy (60-80%) H-tree Save wire energy Save energy in moving data to higher cache levels L2 L1

Compute Cache Architecture Memory disambiguation for large vector operations CORE0 CORE3 Cache controller extension L1 L1 More details in upcoming HPCA paper L2 L2 Interconnect L3-Slice0 L3-Slice3 In-place compute SRAM

SRAM array operation Read Operation Precharge Bitlines Address BLB0 BL0 BLBn BLn Precharge Row Decoder 0 1 0 1 sub-array Wordlines Read: – Precharge bit, bit_b – Raise wordline Write: – Drive data onto bit, bit_b – Raise wordline SA SA Differential Sense Amplifiers 1 1

In-place compute SRAM Changes Bitlines Row Decoder-O Row Decoder BLB0 BL0 BLBn BLn Row Decoder-O Row Decoder Wordlines Single-ended Sense Amplifiers SA Vref SA Vref SA SA Differential Sense Amplifiers

In-place compute SRAM A AND B A B 1 A AND B B A Row Decoder-O BLB0 BL0 BLBn BLn Row Decoder-O Row Decoder A 0 1 0 1 B 1 0 0 1 SA Vref SA Vref Single-ended Sense Amplifiers 1 A AND B

SRAM Prototype Test Chip

Compute Cache ISA So Far 𝑐 𝑐 𝒄𝒐𝒑𝒚 𝑎, 𝑏, 𝑛 𝑐 𝑐 𝒔𝒆𝒂𝒓𝒄𝒉 𝑎, 𝑘, 𝑟, 𝑛 𝑐 𝑐 𝒍𝒐𝒈𝒊𝒄𝒂𝒍 𝑎, 𝑏, 𝑐, 𝑛 𝑐 𝑐 𝒄𝒍𝒎𝒖𝒍 𝑎, 𝑘, 𝑟, 𝑛 𝑐 𝑐 𝒃𝒖𝒛 𝑎, 𝑛 𝑐 𝑐 𝒄𝒎𝒑 𝑎, 𝑏, 𝑟, 𝑛 𝑐 𝑐 𝒏𝒐𝒕 𝑎, 𝑏, 𝑛

Applications modeled using Compute Caches Text Processing StringMatch Wordcount In-memory Checkpointing FastBit BitMap Database WC: 10MB SM: 50MB BMM: 256 Bitmap ~256KB of bitmaps 100,000 ins checkpointing interval. Bit Matrix Multiplication

Compute Cache Results Summary 1.9X 2.4X 4%

Compute Cache Summary Empower caches to compute In-place compute SRAM Performance: Large vector parallelism Energy: Reduce on-chip data movement In-place compute SRAM Data placement and cache geometry for increased operand locality 8% area overhead 2.1X performance, 2.7X energy savings

Future

Compute Memory System Stack Data Analytics Crypto Image Processing Bioinformatics Graphs OS primitives FSA Machine Learning Redesign Applications Express computation using in-situ operations Data-flow languages Adapt PL/Compiler Java/C++ OpenCL Express parallelism Google’s TensorFlow [14] RAPID [20] ISA Data orchestration Coherence & Consistency Design Architecture Data-flow Large SIMD Manage parallelism Design Compute Memories Non-Volatile Volatile Cache Customize hierarchy Re-RAM STT-RAM MRAM Flash DRAM SRAM Where to compute? Rich operation set In-situ technique Bit-line Parallel Automaton Bit-line Locality Operation set Logical, Data migration, Comparison, Search Addition, Multiplication, Convolution, FSM

In-Situ Compute Memory Systems Thank You! Reetuparna Das