Untrodden Paths for Near Data Processing

Slides:

Advertisements

Similar presentations

Chris Karlof and David Wagner

Advertisements

Memory Controller Innovations for High-Performance Systems

1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012.

Virtualization and Cloud Computing. Definition Virtualization is the ability to run multiple operating systems on a single physical system and share the.

1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.

UC Berkeley 1 The Datacenter is the Computer David Patterson Director, RAD Lab January, 2007.

Architecture for Protecting Critical Secrets in Microprocessors Ruby Lee Peter Kwan Patrick McGregor Jeffrey Dwoskin Zhenghong Wang Princeton Architecture.

TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.

IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.

Computer Organization & Assembly Language © by DR. M. Amer.

ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.

Introduction Computer System “An electronic device, operating under the control of instructions stored in its own memory unit, that can accept data (input),

Overview High Performance Packet Processing Challenges

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.

Feeding Parallel Machines – Any Silver Bullets? Novica Nosović ETF Sarajevo 8th Workshop “Software Engineering Education and Reverse Engineering” Durres,

WP2: Security aware low power IoT Processor

Hardware-rooted Trust for Secure Key Management & Transient Trust

Lynn Choi School of Electrical Engineering

Seth Pugsley, Jeffrey Jestes,

Oblivious Parallel RAM: Improved Efficiency and Generic Constructions

William Stallings Computer Organization and Architecture 8th Edition

William Stallings Computer Organization and Architecture 7th Edition

System On Chip.

Architecture & Organization 1

Computer Architecture and Organization

Lecture 23: Cache, Memory, Security

CS : Technology Trends August 31, 2015 Ion Stoica and Ali Ghodsi (

Accelerating Linked-list Traversal Through Near-Data Processing

Accelerating Linked-list Traversal Through Near-Data Processing

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Number Representations and Basic Processor Architecture

Lecture 12: Cache Innovations

Secure Processing On-Chip

Architecture & Organization 1

SPINS: Security Protocols for Sensor Networks

ECEG-3202 Computer Architecture and Organization

Text Book Computer Organization and Architecture: Designing for Performance, 7th Ed., 2006, William Stallings, Prentice-Hall International, Inc.

William Stallings Computer Organization and Architecture 8th Edition

William Stallings Computer Organization and Architecture 7th Edition

AEGIS: Secure Processor for Certified Execution

ECEG-3202 Computer Architecture and Organization

User-mode Secret Protection (SP) architecture

Ali Shafiee Rajeev Balasubramonian Feifei Li

Lecture 24: Memory, VM, Multiproc

Today’s agenda Hardware architecture and runtime system

ECEG-3202 Computer Architecture and Organization

SPINS: Security Protocols for Sensor Networks

Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee,

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

What is Computer Architecture?

Network-on-Chip Programmable Platform in Versal™ ACAP Architecture

Operating System Introduction.

SCONE: Secure Linux Containers Environments with Intel SGX

What is Computer Architecture?

Lecture: Cache Hierarchies

William Stallings Computer Organization and Architecture 8th Edition

Exascale Programming Models in an Era of Big Computation and Big Data

Shielding applications from an untrusted cloud with Haven

William Stallings Computer Organization and Architecture 7th Edition

A Case for Interconnect-Aware Architectures

Path Oram An Extremely Simple Oblivious RAM Protocol

Outline A. Perrig, R. Szewczyk, V. Wen, D. Culler, and J. D. Tygar. SPINS: Security protocols for sensor networks. In Proceedings of MOBICOM, 2001 Sensor.

CSE378 Introduction to Machine Organization

Rajeev Balasubramonian

University of Illinois at Urbana-Champaign

Tesseract A Scalable Processing-in-Memory Accelerator

Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.

2019 2학기 고급운영체제론 ZebRAM: Comprehensive and Compatible Software Protection Against Rowhammer Attacks 3 # 단국대학교 컴퓨터학과 # 남혜민 # 발표자.

Presentation transcript:

Untrodden Paths for Near Data Processing Rajeev Balasubramonian School of Computing, University of Utah

Near Data Processing Present Day The Gartner Hype Curve Expectations 1995 2005 Time Peak of Inflated Expectations Plateau of Productivity The Gartner Hype Curve

Zooming In 2011-2013: Many rejected papers The Gartner Hype Curve Expectations Zero novelty  See PIM 2011-2013: Many rejected papers Too costly  See DRAM vendors Time Plateau of Productivity The Gartner Hype Curve

Micron’s Hybrid Memory Cube The Inflection Point Inspired the term “Near Data Processing” Micron’s Hybrid Memory Cube Spawned the Workshop on NDP, 2013-2015 IEEE Micro Article, 2014 IEEE Micro Special Issue on NDP, 2016

Micron’s Hybrid Memory Cube The Inflection Point Micron’s Hybrid Memory Cube A low-cost approach to data/compute co-location

Demands a diversified portfolio … Low-Cost? Demands a diversified portfolio …

Talk Outline In-situ acceleration Feature-rich DIMMs Near-data security MC Processor BoB Image source: gizmodo

Memristors

In-Situ Operations x0 w00 w01 w02 w03 w10 w11 w12 w13 w30 w31 w32 w33 V1 G1 x1 I1 = V1.G1 x2 V2 G2 x3 I2 = V2.G2 y0 y1 y2 y3 I = I1 + I2 = V1.G1 + V2.G2

Machine Learning Acceleration s OR MP S+A eDRAM Buffer OR XB T T T T DAC S+A ADC IR S+H ADC T T T T IMA IMA IMA IMA ADC DAC XB T T T T IMA IMA IMA IMA ADC S+H EXTERNAL IO INTERFACE TILE In-Situ Multiply Accumulate CHIP (NODE) Low leakage No weight fetch No Storage vs. Compute

DaDianNao

Difficult to exploit sparsity Challenges High ADC power Difficult to exploit sparsity Precision and noise Other workloads

Focusing on Cost Commodity DDR3 DRAM 70 pJ/bit Commodity LPDDR2 40 pJ/bit GDDR5 14 pJ/bit HMC data access 10.5 pJ/bit HMC SerDes links 4.5 pJ/bit HBM data access 3.6 pJ/bit HBM interposer link 0.3 pJ/bit References: Malladi et al., ISCA’12 Jeddeloh & Keeth, Symp. VLSI’12 O’Connor et al. MICRO’17 Image source: HardwareZone

Pugsley et al., IEEE Micro 2014 Memory Interconnects Wang et al., HPCA 2018 BoB MC Pugsley et al., IEEE Micro 2014 Processor Interconnect architecture Computation off-loading (what, where) Auxiliary functions: coding, compression, encryption, etc.

Talk Outline In-situ acceleration Feature-rich DIMMs Near-data security MC Processor BoB Image source: gizmodo

Memory Vulnerabilities Malicious OS or hardware can modify data Processor OS VM 1 CORE 1 Victim MC VM 2 CORE 2 Attacker All buses are exposed

if (x < array1_size) y = array2[ array1[x] ]; Spectre Overview x is controlled by attacker Thanks to bpred, x can be anything array1[ ] is the secret if (x < array1_size) y = array2[ array1[x] ]; Victim Code 5 10 20 SECRETS array1[ ] Access pattern of array2[ ] betrays the secret array2[ ]

Prime candidate for NDP! Memory Defenses Memory timing channels Requires dummy memory accesses – overhead of up to 2x Memory access patterns Requires ORAM semantics – overhead of 280x Memory integrity Requires integrity trees and MACs – overhead of 10x Prime candidate for NDP!

InvisiMem, ObfusMem Exploits HMC-like active memory devices MACs, deterministic schedule, double encryption Easily handles integrity, timing channels, trace leakage From Awad et al., ISCA 2017

Path ORAM Leaf 17 Data 0x1 17 Data 0x1 25 Step 1. Check the PosMap for 0x1. CPU Step 2. Read path 17 to stash. 0x0 17 0x1 25 Step 3. Select data and change its leaf. 0x2 Stash Step 4. Write back stash to path 17. 0x3 PosMap

Distributed ORAM with Secure DIMMs Authenticated buffer chip MC ORAM operations shift from Processor to SDIMM. ORAM traffic pattern shifts from the memory bus to on-SDIMM “private” buses. Bandwidth a # SDIMMs. No trust in memory vendor. Commodity low-cost DRAM. Processor All buses are exposed Buffer chip and processor communication is encrypted

SDIMM: Independent Protocol ORAM split into 2 subtrees Steps: ACCESS(addr,DATA) to ORAM0. ORAM0 2 2. Locally perform accessORAM. CPU sends PROBE to check. 1 4 3 ORAM1 CPU sends FETCH_RESULT. New leaf ID assigned in CPU. 4 4. CPU broadcasts APPEND to all SDIMMs to move the block. CPU

SDIMM: Split Protocol SDIMM 1 SDIMM 0 Odd bits of Data/Meta Even bits of Data/Meta

SDIMM: Split Protocol Read a path to local stashes. 1 5 2. Send metadada to CPU. 3. Re-assemble and decide writing order. 2 1 5 4. Send metadata back to SDIMMs. 4 5. Write back the path based on the order determined by CPU. 3

Take-Homes NDP is the key to reduced data movement 3D-stacked memory+logic devices are great, but expensive! Need diversified efforts: New in-situ computation devices Focus on traditional memory and interconnects Focus on auxiliary features: security/privacy, compression, coding Acks: Utah Arch students: Ali Shafiee, Anirban Nag, Seth Pugsley Collaborators: Mohit Tiwari, Feifei Li, Viji Srinivasan, Alper Buyuktosunoglu, Naveen Muralimanohar, Vivek Srikumar Funding: NSF, Intel, IBM, HPE Labs.