Assistant Professor, EECS Department

Slides:



Advertisements
Similar presentations
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
Advertisements

5-1 Memory System. Logical Memory Map. Each location size is one byte (Byte Addressable) Logical Memory Map. Each location size is one byte (Byte Addressable)
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
Chapter 9 Memory Basics Henry Hexmoor1. 2 Memory Definitions  Memory ─ A collection of storage cells together with the necessary circuits to transfer.
1 Lecture 16B Memories. 2 Memories in General Computers have mostly RAM ROM (or equivalent) needed to boot ROM is in same class as Programmable Logic.
S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 31: Array Subsystems (SRAM) Prof. Sherief Reda Division of Engineering,
Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.
1 Lecture 16B Memories. 2 Memories in General RAM - the predominant memory ROM (or equivalent) needed to boot ROM is in same class as Programmable Logic.
Physical Memory and Physical Addressing By: Preeti Mudda Prof: Dr. Sin-Min Lee CS147 Computer Organization and Architecture.
55:035 Computer Architecture and Organization
12/1/2004EE 42 fall 2004 lecture 381 Lecture #38: Memory (2) Last lecture: –Memory Architecture –Static Ram This lecture –Dynamic Ram –E 2 memory.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
1 Memory Technology Comparison ParameterZ-RAMDRAMSRAM Size11.5x4x StructureSingle Transistor Transistor + Cap 6 Transistor Performance10.5x2x.
Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 12.1 EE4800 CMOS Digital IC Design & Analysis Lecture 12 SRAM Zhuo Feng.
CPEN Digital System Design
Advanced VLSI Design Unit 06: SRAM
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Computer Memory Storage Decoding Addressing 1. Memories We've Seen SIMM = Single Inline Memory Module DIMM = Dual IMM SODIMM = Small Outline DIMM RAM.
Introduction to Computer Organization and Architecture Lecture 7 By Juthawut Chantharamalee wut_cha/home.htm.
EE 466/586 VLSI Design Partha Pande School of EECS Washington State University
Charles Kime & Thomas Kaminski © 2008 Pearson Education, Inc. (Hyperlinks are active in View Show mode) Chapter 8 – Memory Basics Logic and Computer Design.
Lecture # 10 Processors Microcomputer Processors.
DRAM Tutorial Lecture Vivek Seshadri. Vivek Seshadri – Thesis Proposal DRAM Module and Chip 2.
Dynamic Zero Compression for Cache Energy Reduction Luis Villa Michael Zhang Krste Asanovic
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
CS161 – Design and Architecture of Computer Main Memory Slides adapted from Onur Mutlu (CMU)
Prof. Hsien-Hsin Sean Lee
Lecture 3. Lateches, Flip Flops, and Memory
Memories.
COMP211 Computer Logic Design
ESE532: System-on-a-Chip Architecture
Lecture: Large Caches, Virtual Memory
Modern Computer Architecture
NVIDIA’s Extreme-Scale Computing Project
Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)
SRAM Memory External Interface
Introduction to Computer Architecture
Assembly Language for Intel-Based Computers, 5th Edition
Organizers Adwait Jog EJ Kim
CS-301 Introduction to Computing Lecture 17
Basic Computer Organization
Computer Architecture
Buffered Compares: Excavating the Hidden Parallelism inside DRAM Architectures with Lightweight Logic Jinho Lee, Kiyoung Choi, and Jung Ho Ahn Seoul.
Samira Khan University of Virginia Oct 9, 2017
CS775: Computer Architecture
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Prof. Gennady Pekhimenko University of Toronto Fall 2017
Lecture 12: Cache Innovations
Subject Name: Embedded system Design Subject Code: 10EC74
The Main Memory system: DRAM organization
Lecture 23: Cache, Memory, Virtual Memory
Computer Architecture
Introduction to Computing
Digital Logic & Design Dr. Waseem Ikram Lecture 40.
Discovering Computers 2014: Chapter6
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Lecture 24: Memory, VM, Multiproc
2.C Memory GCSE Computing Langley Park School for Boys.
Lecture 22: Cache Hierarchies, Memory
15-740/ Computer Architecture Lecture 19: Main Memory
Lecture: Cache Hierarchies
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Literature Review A Nondestructive Self-Reference Scheme for Spin-Transfer Torque Random Access Memory (STT-RAM) —— Yiran Chen, et al. Fengbo Ren 09/03/2010.
Cache Automaton Arun Subramaniyan Jingcheng Wang Ezhil R.M.Balasubramanian David Blaauw Dennis Sylvester Reetuparna Das University of.
CSE 502: Computer Architecture
Computer Architecture Lecture 30: In-memory Processing
Presentation transcript:

Assistant Professor, EECS Department In-Situ Compute Memory Systems Reetuparna Das Assistant Professor, EECS Department

Massive amounts of data generated each day

Near Data Computing Move compute near storage We can solve the data movememnt problem huge volumes of data

Evolution... Processing in Memory (PIM) IRAM, DIVA, Active Pages, etc ... 1997 Evolution... Emergence of Big Data Data movement dominates Joules/Op 3D Memories available Resurgence of Processing in Memory (PIM) Logic layer near Memory enabled by 3D Technology 2012 Automaton Processor Associative memory with custom interconnects 2014 2015 Computer Memories (bit-line computing) In-situ computation inside memory arrays

Problem 1: Memory are big and inactive Memory consumes most of aggregate die area Cache Haswell Core i7-5960X

Problem 2: Significant energy wasted is moving data through memory hierarchy 1000-2000 pJ 1-50 pJ Data movement Operation 20-40x

Problem 3: General purpose processors are inefficient for data parallel applications Scalar Small vector (32 bytes) Inefficient More efficient Very large vector 1000x larger?

Problem summary: Conventional systems process data inefficiently Energy Problem 3 Memory Memory Problem 1 Area CORE CORE

Key Idea Memory = Storage + In-place compute

Proposal: Repurpose memory logic for compute Energy Massive Parallelism (up to 100X) Memory Memory Memory Energy Efficiency (up to 20X) Depicting dynamic energies here.. With total, even base is dominated by caches.. Area CORE CORE CORE

Compute Caches for Efficient Very Large Vector Processing PIs: Reetuparna Das, Satish Narayanasamy, David Blaauw Students: Shaizeen Aga, Supreet Jeloka, Arun Subramanian

Proposal: Compute Caches Memory disambiguation support In-place compute SRAM A B = A op B Bank Sub-array CORE0 CORE3 L1 L1 L2 L2 Interconnect Challenges: Data orchestration Managing parallelism Coherence and Consistency Cache controller extension L3-Slice0 L3-Slice3

Opportunity 1: Large vector computation Operation width = row size L3 Many smaller sub-arrays Each sub-array can compute in parallel L2 Parallelism available (16 MB L3) 512 sub-arrays * 64B 32KB Operand 128X saved L1

Opportunity 2: Save data movement energy L3 Significant portion of cache energy is wire energy (60-80%) H-tree Save wire energy Save energy in moving data to higher cache levels L2 L1

Compute Cache Architecture Memory disambiguation for large vector operations CORE0 CORE3 Cache controller extension L1 L1 More details in upcoming HPCA paper L2 L2 Interconnect L3-Slice0 L3-Slice3 In-place compute SRAM

SRAM array operation Read Operation Precharge Bitlines Address BLB0 BL0 BLBn BLn Precharge Row Decoder 0 1 0 1 sub-array Wordlines Read: – Precharge bit, bit_b – Raise wordline Write: – Drive data onto bit, bit_b – Raise wordline SA SA Differential Sense Amplifiers 1 1

In-place compute SRAM Changes Bitlines Row Decoder-O Row Decoder BLB0 BL0 BLBn BLn Row Decoder-O Row Decoder Wordlines Single-ended Sense Amplifiers SA Vref SA Vref SA SA Differential Sense Amplifiers

In-place compute SRAM A AND B A B 1 A AND B B A Row Decoder-O BLB0 BL0 BLBn BLn Row Decoder-O Row Decoder A 0 1 0 1 B 1 0 0 1 SA Vref SA Vref Single-ended Sense Amplifiers 1 A AND B

SRAM Prototype Test Chip

Compute Cache ISA So Far 𝑐 𝑐 𝒄𝒐𝒑𝒚 𝑎, 𝑏, 𝑛 𝑐 𝑐 𝒔𝒆𝒂𝒓𝒄𝒉 𝑎, 𝑘, 𝑟, 𝑛 𝑐 𝑐 𝒍𝒐𝒈𝒊𝒄𝒂𝒍 𝑎, 𝑏, 𝑐, 𝑛 𝑐 𝑐 𝒄𝒍𝒎𝒖𝒍 𝑎, 𝑘, 𝑟, 𝑛 𝑐 𝑐 𝒃𝒖𝒛 𝑎, 𝑛 𝑐 𝑐 𝒄𝒎𝒑 𝑎, 𝑏, 𝑟, 𝑛 𝑐 𝑐 𝒏𝒐𝒕 𝑎, 𝑏, 𝑛

Applications modeled using Compute Caches Text Processing StringMatch Wordcount In-memory Checkpointing FastBit BitMap Database WC: 10MB SM: 50MB BMM: 256 Bitmap ~256KB of bitmaps 100,000 ins checkpointing interval. Bit Matrix Multiplication

Compute Cache Results Summary 1.9X 2.4X 4%

Compute Cache Summary Empower caches to compute In-place compute SRAM Performance: Large vector parallelism Energy: Reduce on-chip data movement In-place compute SRAM Data placement and cache geometry for increased operand locality 8% area overhead 2.1X performance, 2.7X energy savings

Future

Compute Memory System Stack Data Analytics Crypto Image Processing Bioinformatics Graphs OS primitives FSA Machine Learning Redesign Applications Express computation using in-situ operations Data-flow languages Adapt PL/Compiler Java/C++ OpenCL Express parallelism Google’s TensorFlow [14] RAPID [20] ISA Data orchestration Coherence & Consistency Design Architecture Data-flow Large SIMD Manage parallelism Design Compute Memories Non-Volatile Volatile Cache Customize hierarchy Re-RAM STT-RAM MRAM Flash DRAM SRAM Where to compute? Rich operation set In-situ technique Bit-line Parallel Automaton Bit-line Locality Operation set Logical, Data migration, Comparison, Search Addition, Multiplication, Convolution, FSM

In-Situ Compute Memory Systems Thank You! Reetuparna Das