Untrodden Paths for Near Data Processing

Slides:



Advertisements
Similar presentations
Chris Karlof and David Wagner
Advertisements

Memory Controller Innovations for High-Performance Systems
1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012.
Virtualization and Cloud Computing. Definition Virtualization is the ability to run multiple operating systems on a single physical system and share the.
1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.
UC Berkeley 1 The Datacenter is the Computer David Patterson Director, RAD Lab January, 2007.
Architecture for Protecting Critical Secrets in Microprocessors Ruby Lee Peter Kwan Patrick McGregor Jeffrey Dwoskin Zhenghong Wang Princeton Architecture.
TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.
IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.
Computer Organization & Assembly Language © by DR. M. Amer.
ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.
Introduction Computer System “An electronic device, operating under the control of instructions stored in its own memory unit, that can accept data (input),
Overview High Performance Packet Processing Challenges
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.
Feeding Parallel Machines – Any Silver Bullets? Novica Nosović ETF Sarajevo 8th Workshop “Software Engineering Education and Reverse Engineering” Durres,
WP2: Security aware low power IoT Processor
Hardware-rooted Trust for Secure Key Management & Transient Trust
Lynn Choi School of Electrical Engineering
Seth Pugsley, Jeffrey Jestes,
Oblivious Parallel RAM: Improved Efficiency and Generic Constructions
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 7th Edition
System On Chip.
Architecture & Organization 1
Computer Architecture and Organization
Lecture 23: Cache, Memory, Security
CS : Technology Trends August 31, 2015 Ion Stoica and Ali Ghodsi (
Accelerating Linked-list Traversal Through Near-Data Processing
Accelerating Linked-list Traversal Through Near-Data Processing
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Number Representations and Basic Processor Architecture
Lecture 12: Cache Innovations
Secure Processing On-Chip
Architecture & Organization 1
SPINS: Security Protocols for Sensor Networks
ECEG-3202 Computer Architecture and Organization
Text Book Computer Organization and Architecture: Designing for Performance, 7th Ed., 2006, William Stallings, Prentice-Hall International, Inc.
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 7th Edition
AEGIS: Secure Processor for Certified Execution
ECEG-3202 Computer Architecture and Organization
User-mode Secret Protection (SP) architecture
Ali Shafiee Rajeev Balasubramonian Feifei Li
Lecture 24: Memory, VM, Multiproc
Today’s agenda Hardware architecture and runtime system
ECEG-3202 Computer Architecture and Organization
SPINS: Security Protocols for Sensor Networks
Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee,
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
What is Computer Architecture?
Network-on-Chip Programmable Platform in Versal™ ACAP Architecture
Operating System Introduction.
SCONE: Secure Linux Containers Environments with Intel SGX
What is Computer Architecture?
Lecture: Cache Hierarchies
William Stallings Computer Organization and Architecture 8th Edition
Exascale Programming Models in an Era of Big Computation and Big Data
Shielding applications from an untrusted cloud with Haven
William Stallings Computer Organization and Architecture 7th Edition
A Case for Interconnect-Aware Architectures
Path Oram An Extremely Simple Oblivious RAM Protocol
Outline A. Perrig, R. Szewczyk, V. Wen, D. Culler, and J. D. Tygar. SPINS: Security protocols for sensor networks. In Proceedings of MOBICOM, 2001 Sensor.
CSE378 Introduction to Machine Organization
Rajeev Balasubramonian
University of Illinois at Urbana-Champaign
Tesseract A Scalable Processing-in-Memory Accelerator
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
2019 2학기 고급운영체제론 ZebRAM: Comprehensive and Compatible Software Protection Against Rowhammer Attacks 3 # 단국대학교 컴퓨터학과 # 남혜민 # 발표자.
Presentation transcript:

Untrodden Paths for Near Data Processing Rajeev Balasubramonian School of Computing, University of Utah

Near Data Processing Present Day The Gartner Hype Curve Expectations 1995 2005 Time Peak of Inflated Expectations Plateau of Productivity The Gartner Hype Curve

Zooming In 2011-2013: Many rejected papers The Gartner Hype Curve Expectations Zero novelty  See PIM 2011-2013: Many rejected papers Too costly  See DRAM vendors Time Plateau of Productivity The Gartner Hype Curve

Micron’s Hybrid Memory Cube The Inflection Point Inspired the term “Near Data Processing” Micron’s Hybrid Memory Cube Spawned the Workshop on NDP, 2013-2015 IEEE Micro Article, 2014 IEEE Micro Special Issue on NDP, 2016

Micron’s Hybrid Memory Cube The Inflection Point Micron’s Hybrid Memory Cube A low-cost approach to data/compute co-location

Demands a diversified portfolio … Low-Cost? Demands a diversified portfolio …

Talk Outline In-situ acceleration Feature-rich DIMMs Near-data security MC Processor BoB Image source: gizmodo

Memristors

In-Situ Operations x0 w00 w01 w02 w03 w10 w11 w12 w13 w30 w31 w32 w33 V1 G1 x1 I1 = V1.G1 x2 V2 G2 x3 I2 = V2.G2 y0 y1 y2 y3 I = I1 + I2 = V1.G1 + V2.G2

Machine Learning Acceleration s OR MP S+A eDRAM Buffer OR XB T T T T DAC S+A ADC IR S+H ADC T T T T IMA IMA IMA IMA ADC DAC XB T T T T IMA IMA IMA IMA ADC S+H EXTERNAL IO INTERFACE TILE In-Situ Multiply Accumulate CHIP (NODE) Low leakage No weight fetch No Storage vs. Compute

DaDianNao

Difficult to exploit sparsity Challenges High ADC power Difficult to exploit sparsity Precision and noise Other workloads

Focusing on Cost Commodity DDR3 DRAM 70 pJ/bit Commodity LPDDR2 40 pJ/bit GDDR5 14 pJ/bit HMC data access 10.5 pJ/bit HMC SerDes links 4.5 pJ/bit HBM data access 3.6 pJ/bit HBM interposer link 0.3 pJ/bit References: Malladi et al., ISCA’12 Jeddeloh & Keeth, Symp. VLSI’12 O’Connor et al. MICRO’17 Image source: HardwareZone

Pugsley et al., IEEE Micro 2014 Memory Interconnects Wang et al., HPCA 2018 BoB MC Pugsley et al., IEEE Micro 2014 Processor Interconnect architecture Computation off-loading (what, where) Auxiliary functions: coding, compression, encryption, etc.

Talk Outline In-situ acceleration Feature-rich DIMMs Near-data security MC Processor BoB Image source: gizmodo

Memory Vulnerabilities Malicious OS or hardware can modify data Processor OS VM 1 CORE 1 Victim MC VM 2 CORE 2 Attacker All buses are exposed

if (x < array1_size) y = array2[ array1[x] ]; Spectre Overview x is controlled by attacker Thanks to bpred, x can be anything array1[ ] is the secret if (x < array1_size) y = array2[ array1[x] ]; Victim Code 5 10 20 SECRETS array1[ ] Access pattern of array2[ ] betrays the secret array2[ ]

Prime candidate for NDP! Memory Defenses Memory timing channels Requires dummy memory accesses – overhead of up to 2x Memory access patterns Requires ORAM semantics – overhead of 280x Memory integrity Requires integrity trees and MACs – overhead of 10x Prime candidate for NDP!

InvisiMem, ObfusMem Exploits HMC-like active memory devices MACs, deterministic schedule, double encryption Easily handles integrity, timing channels, trace leakage From Awad et al., ISCA 2017

Path ORAM Leaf 17 Data 0x1 17 Data 0x1 25 Step 1. Check the PosMap for 0x1. CPU Step 2. Read path 17 to stash. 0x0 17 0x1 25 Step 3. Select data and change its leaf. 0x2 Stash Step 4. Write back stash to path 17. 0x3 PosMap

Distributed ORAM with Secure DIMMs Authenticated buffer chip MC ORAM operations shift from Processor to SDIMM. ORAM traffic pattern shifts from the memory bus to on-SDIMM “private” buses. Bandwidth a # SDIMMs. No trust in memory vendor. Commodity low-cost DRAM. Processor All buses are exposed Buffer chip and processor communication is encrypted

SDIMM: Independent Protocol ORAM split into 2 subtrees Steps: ACCESS(addr,DATA) to ORAM0. ORAM0 2 2. Locally perform accessORAM. CPU sends PROBE to check. 1 4 3 ORAM1 CPU sends FETCH_RESULT. New leaf ID assigned in CPU. 4 4. CPU broadcasts APPEND to all SDIMMs to move the block. CPU

SDIMM: Split Protocol SDIMM 1 SDIMM 0 Odd bits of Data/Meta Even bits of Data/Meta

SDIMM: Split Protocol Read a path to local stashes. 1 5 2. Send metadada to CPU. 3. Re-assemble and decide writing order. 2 1 5 4. Send metadata back to SDIMMs. 4 5. Write back the path based on the order determined by CPU. 3

Take-Homes NDP is the key to reduced data movement 3D-stacked memory+logic devices are great, but expensive! Need diversified efforts: New in-situ computation devices Focus on traditional memory and interconnects Focus on auxiliary features: security/privacy, compression, coding Acks: Utah Arch students: Ali Shafiee, Anirban Nag, Seth Pugsley Collaborators: Mohit Tiwari, Feifei Li, Viji Srinivasan, Alper Buyuktosunoglu, Naveen Muralimanohar, Vivek Srikumar Funding: NSF, Intel, IBM, HPE Labs.