Mohsen Imani, Saransh Gupta, Tajana S. Rosing

Accelerating Multiplication and Parallelizing Operations in Non-Volatile Memory
Mohsen Imani, Saransh Gupta, Tajana S. Rosing University of California San Diego System Energy Efficiency Lab

General Purpose Processor
Big Data Processing Internet of Things (IoT): Billions-trillions of interconnected devices Critical requirement of IoT applications: Big Data processing e.g., signal processing, machine learning, graph processing Can the today’s system process Big Data? General Purpose Processor Core Core Core Core → Large energy consumption & performance degradation Inefficient! Core Core Core Core Large Memory for Big Data Memory Data Movements

Ref: Dally, Tutorial, NIPS’15
Cost of Operations DRAM consumes 170x more energy than FPU Mult Ref: Dally, Tutorial, NIPS’15

General Purpose Processor
Processing In Memory Processing In Memory (PIM): Performing a part of computation tasks inside the memory General Purpose Processor Core Core Core Core Core Core Core Core Large Memory for Big Data Computational logic Large Memory for Big Data

Supporting In-Memory Operations
Bitwise Addition/ Multiplication Search Operation Supported Operations OR, AND, XOR Multiple row Matrix Multiplication Search/ Nearest Search Example of Operations HD computing Graph processing Query processing Deep learning Security Multimedia Classification Clustering Database Applications

PIM for Addition/Multiplication
First work to support in-memory multiplication Enables in-memory addition/multiplication using emerging NVM technology (memristor devices) Does not need to change memory sense amplifiers Significantly speeds up in-memory processing Works on both precise and approximate mode Ref: Imani et al. DAC’17

Crossbar NOR Operation
Out = NOR (in1, in2, … , inn) Z = NOR (W, X, Y) 0: high resistance (ROFF~inf) 1: low resistance (RON~0) V0 Ref: Kvatinsky et al. TCASII’14

Ref: Talati et al. TNano’16
NOR-based Addition Crossbar memory supports NOR operation Can we implement 1-bit full adder only using only NOR operation? Cout: 4 NOR S: 3 NOT, 5 NOR NOT operation is implemented as a NOR operation with 1 input 12N + 1 cycles to add two N-bit numbers Ref: Talati et al. TNano’16

In-Memory Multiplication
N×N multiplication: Partial product generation: creates partial products for multiplication Fast addition: reduces N numbers to 2 Product generation: adds two numbers generated by fast adder and outputs the product of N ×N multiplication I’ll go over fast addition and product gene over the next few slides Fast addition and product generation

Fast Addition Carry Save Adder (CSA):
Makes additions independent and in parallel with no carry propagation Propagates carry at the last stage 3 inputs to 2 outputs (3:2) reduction provides the same latency as 1-bit addition (13 cycles) Last stage depends on N and runs in 12N+1 cycles Focus on carry prop delayed till last stage 13 cycles 12N+14 cycles 12N+1 cycles

Configurable Interconnect
Problem with CSA: Too many shift operations Divide the crossbar into multiple blocks connected via configurable interconnects Use interconnects to perform shift operations

Product Generation Propagation of carry in order to generate the final answer Final operands are 2N-bit long Requires 13*2N cycles to compute the result: latency is dominant  Approximate product generation: dramatically speeds it up if a fully accurate result is not desired

[*] Kvatinsky et al. TCASII’15
Experimental Setup C++ cycle-accurate simulator to model the APIM functionality Circuit level simulation is performed using 45 nm CMOS technology using Cadence Virtuoso VTEAM memristor model [*] for simulation of our memory design: RON and ROFF of 10kΩ and 10MΩ respectively Six general OpenCL applications: Sobel, Robert, Fast Fourier transform (FFT), DwHaar1D, Sharpen, Quasi Random Compared with state-of-the-art AMD Radeon R9 390 GPU with 8GB memory Hioki 3334 power meter to measure the power consumption of GPU [*] Kvatinsky et al. TCASII’15

APIM Efficiency Average improvement over six applications as compared to the GPU: Performance speedup: by reducing data movement Energy improvement: both data movement and computation efficiency On average over six applications: 28× energy efficiency and 4.8x speedup Robert Filter DwtHaar1D 34.5x less energy 4.6x speedup 24.7x less energy 3.6x speedup

Supporting In-Memory Operations
Bitwise Addition/ Multiplication Search Operation Supported Operations OR, AND, XOR Multiple row Matrix Multiplication Search/ Nearest Search Example of Operations HD computing Graph processing Query processing Deep learning Security Multimedia Classification Clustering Database Applications

Nearest Search In-Memory
Conventionally content addressable memories (CAMs) just support exact matches Cannot be used to implement even simple queries like min/max We enable nearest distance search in usual crossbar memory Our new CAM now supports Hamming distance search: Hyperdimensional computing Absolute distance search: kNN, kmeans, query processing simple queries like min/max Ref: Imani et al. HPCA’17, ISLPED’17, TCAD’18

In-Memory Computing Accelerator
Classification Clustering Hyperdimensional Classification [HPCA’17] Support both Training and Testing Kmeans Adaboost [ICCAD’17] Hyperdimensional Clustering DNN, CNN [DATE’17] Decision Tree kNN [ICRC’17] Graph Processing Database Query Processing [ISLPED’17] [TCAD’18] Graph Processing

Neural Network PIM (NNPIM)
Uses simple crossbar memory and 2-level memristor devices rather than multi-level memory cells Supports all neural networks operations in-memory including: Weighted Accumulation Activation function Pooling Software support (weight sharing) reduces computations Can achieve on an average 4.9x energy efficiency and 5.7x speedup as compared to the state-of-the-art accelerators.

Query Processing in NVMs
A novel query processing accelerator (NVQuery) Uses a memristor-based memory to process queries including comparison, aggregation, prediction functions among others Provides 49.3x performance speedup and 32.9x energy savings as compared to traditional processor. Ref: Imani et al. ISLPED’17, TCAD’18

Hyperdimensional Computing
Training Encoding Cat hypervector Encoding Dog hypervector Similarity check: nvm connect Similarity Check Brain Checks for Similarity! Testing Encoding Query hypervector In-memory implementation: Provides 746x EDP improvement as compared to ASIC implementation Ref: Imani et al. HPCA’17

Conclusion We are working to accelerate a wide range of applications in memory At circuit level, we are working to support more operations in memory and to make the existing operations more efficient At architecture and system level, we are designing new application specific accelerators At application level, we are designing libraries which provides interface to programmers for accelerating their applications using PIM

References Dally Tutorial, NIPS’15
Mohsen Imani, Saransh Gupta, and Tajana Rosing. "Ultra-efficient processing in-memory for data intensive applications." In Proceedings of the 54th Annual Design Automation Conference 2017, p. 6. ACM, 2017. Shahar Kvatinsky, Dmitry Belousov, Slavik Liman, Guy Satat, Nimrod Wald, Eby G. Friedman, Avinoam Kolodny, and Uri C. Weiser. "MAGIC—Memristor-aided logic." IEEE Transactions on Circuits and Systems II: Express Briefs 61, no. 11 (2014): Nishil Talati, Saransh Gupta, Pravin Mane, and Shahar Kvatinsky. "Logic design within memristive memories using memristor-aided loGIC (MAGIC)." IEEE Transactions on Nanotechnology 15, no. 4 (2016): Shahar Kvatinsky, Misbah Ramadan, Eby G. Friedman, and Avinoam Kolodny. "VTEAM: A general model for voltage-controlled memristors." IEEE Transactions on Circuits and Systems II: Express Briefs 62, no. 8 (2015): Mohsen Imani, Abbas Rahimi, Deqian Kong, Tajana Rosing, and Jan M. Rabaey. "Exploring hyperdimensional associative memory." In High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on, pp IEEE, 2017. Mohsen Imani, Saransh Gupta, Atl Arredondo, and Tajana Rosing. "Efficient query processing in crossbar memory." In Low Power Electronics and Design (ISLPED, 2017 IEEE/ACM International Symposium on, pp IEEE, 2017.

References Mohsen Imani, Saransh Gupta, Sahil Sharma, and Tajana Rosing. ”NVQuery: Efficient query processing in non-volatile memory." IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems (2018), in press. Yeseong Kim, Mohsen Imani, and Tajana Rosing. "Orchard: Visual object recognition accelerator based on approximate in-memory processing." In Computer-Aided Design (ICCAD), 2017 IEEE/ACM International Conference on, pp IEEE, 2017. Mohsen Imani, Daniel Peroni, Yeseong Kim, Abbas Rahimi, and Tajana Rosing. "Efficient neural network acceleration on gpgpu using content addressable memory." In 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp IEEE, Mohammad Samragh Razlighi, Mohsen Imani, Farinaz Koushanfar, and Tajana Rosing. "Looknn: Neural network with no multiplication." In 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp IEEE, 2017. Mohsen Imani, Yeseong Kim, and Tajana Rosing. "NNgine: Ultra-Efficient Nearest Neighbor Accelerator Based on In-Memory Computing." In Rebooting Computing (ICRC), 2017 IEEE International Conference on, pp IEEE, 2017. Mohsen Imani, Deqian Kong, Abbas Rahimi, and Tajana Rosing. "Voicehd: Hyperdimensional computing for efficient speech recognition." In Rebooting Computing (ICRC), 2017 IEEE International Conference on, pp IEEE, 2017.

Mohsen Imani, Saransh Gupta, Tajana S. Rosing

Similar presentations

Presentation on theme: "Mohsen Imani, Saransh Gupta, Tajana S. Rosing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mohsen Imani, Saransh Gupta, Tajana S. Rosing

Similar presentations

Presentation on theme: "Mohsen Imani, Saransh Gupta, Tajana S. Rosing"— Presentation transcript:

Similar presentations

About project

Feedback