‡University of California Berekely

Slides:

Advertisements

Similar presentations

Chapter 11 – Virtual Memory Management

Advertisements

LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.

Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.

Yasuhiro Fujiwara (NTT Cyber Space Labs)

Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

 Understanding the Sources of Inefficiency in General-Purpose Chips.

High-Level Constructors and Estimators Majid Sarrafzadeh and Jason Cong Computer Science Department

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.

JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Accuracy-Configurable Adder for Approximate Arithmetic Designs

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.

Accelerating image recognition on mobile devices using GPGPU

Luca Benini/ UNIBO and ETHZ

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

QCAdesigner – CUDA HPPS project

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

Adaptive Sleep Scheduling for Energy-efficient Movement-predicted Wireless Communication David K. Y. Yau Purdue University Department of Computer Science.

Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Sunpyo Hong, Hyesoon Kim

-1- UC San Diego / VLSI CAD Laboratory Optimization of Overdrive Signoff Tuck-Boon Chan, Andrew B. Kahng, Jiajia Li and Siddhartha Nath Tuck-Boon Chan,

IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.

Philipp Gysel ECE Department University of California, Davis

Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.

Exploiting Graphics Processors for High-performance IP Lookup in Software Routers Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu IEEE INFOCOM.

SketchVisor: Robust Network Measurement for Software Packet Processing

A Case for Toggle-Aware Compression for GPU Systems

Mohsen Imani†, Abbas Rahimi‡, Yeseong Kim†, Tajana S. Rosing†

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

High-Speed Stochastic Circuits Using Synchronous Analog Pulses M

Memory Segmentation to Exploit Sleep Mode Operation

Exploring Hyperdimensional Associative Memory

Microarchitecture.

Exploiting Sharing for Data Center Consolidation

Supervised Learning Based Model for Predicting Variability-Induced Timing Errors Xun Jiao, Abbas Rahimi, Balakrishnan Narayanaswamy, Hamed Fatemi, Jose.

Deep Neural Network with Stochastic Computing

Cache Memory Presentation I

BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Adaptation Behavior of Pipelined Adaptive Filters

Experiment Evaluation

Mohsen Imani, Saransh Gupta, Tajana S. Rosing

Stripes: Bit-Serial Deep Neural Network Computing

Yeseong Kim, Mohsen Imani, Tajana Rosing

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Centar ( Global Signal Processing Expo

NVIDIA Fermi Architecture

Ann Gordon-Ross and Frank Vahid*

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Kiran Subramanyam Password Cracking 1.

8-3 RRAM Based Convolutional Neural Networks for High Accuracy Pattern Recognition and Online Learning Tasks Z. Dong, Z. Zhou, Z.F. Li, C. Liu, Y.N. Jiang,

†UCSD, ‡UCSB, EHTZ*, UNIBO*

Post-Silicon Calibration for Large-Volume Products

Current-Sensing Efficient Adder for Processing-in-Memory Design

Exploring Hyperdimensional Associative Memory

6- General Purpose GPU Programming

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

Learning and Memorization

Address-Stride Assisted Approximate Load Value Prediction in GPUs

Mohsen Imani, Saransh Gupta, Yeseong Kim, Tajana Rosing

Presentation transcript:

‡University of California Berekely ACAM: Approximate Computing Based on Adaptive Associative Memory with Online Learning Mohsen Imani†, Yeseong Kim†, Abbas Rahimi‡, Tajana S. Rosing† †University of California San Diego ‡University of California Berekely

Motivation Internet of Things (IoT): Billions-trillions of interconnected devices (large scale problem) 1.8 zettabytes of data generated in 2015, increased by 50% in 2020! Battery powered  tight energy efficiency requirements Solution: Approximate computing gets results faster, at lower energy cost and with sufficient accuracy! Inaccuracy is inherent => most applications do not need exact results E.g. machine learning, speech recognition, search, graphics, etc.

Associative Memory Associative memories: a promising solution to reduce energy consumption of parallel processors Prestores high frequency patterns and their corresponding outputs Input operands Processor Pipeline Result of computation Clock gating TCAM: Ternary Content Addressable Memory Look up Table MEM Matching? Requirements: low area, low energy consumption, high performance

Related Work Energy efficient computation Energy efficient computing units Associative memory (“Computation with memory”) Explain non-volatile memory Which non-volatile did u use? Define VOS Which NVM, why?! Focuse on tunable CMOS-based [Ullah TCAS’12] [Pagiamtzis ISSCC’06][Arsovski JSSC’03] NVM-based [Li ISSCC’14], [Chang ISSCC’15], [Huang VLSIC’14] [Govindaraj JLPED’15] MTJ Memristor Exact/Approximate matching [Zhang DAET’15],[Rahimi DAC’14] Memoization Approximate Computing Associative Computing [Guo ISCA’13] ,[Rahimi DATE’15] [Imani DATE’16],[Sharad DAC’13]

Associative Memory Energy Solution Hit rate Energy efficient TCAM Online profiling NVM TCAM Low activity Voltage Overscaling Guo ISCA’13 Yavits TACO’14 Sharad DAC’13 Zhang DAET’15 Imani, DATE’16 Imani, ISQED’16 Cost of profiling? Accuracy! Rahimi, DATE’15 Imani, TETC’16 Imani, DATE’16

Base GPU Architecture for Approximate Design AMD Radeon HD 7970 device from Southern Islands family 32 compute units 4 SIMD units 16 stream cores (parallel lanes) 2048 stream cores per device Thermal design power: 300W! Floating Point Units Integer Units Ultra-threaded Dispatcher Compute Unit 1 Compute Unit 32 Global Memory Compute Device … Local Memory Wavefront Scheduler Compute Unit SIMD Unit 1 SIMD Unit 2 SIMD Unit 3 SIMD Unit 4 IF/ID Vector/Scalar RF SIMD Unit Stream Core 16 Stream Core 1 Stream Core 2 Stream Core 3 …

Associative Memory Integration We add MASC within each floating point unit; it consists of TCAM and 1t-1r memory. If data is found in MASC TCAM, we clock gate the CPU. The search is done in 1 cycle of FPU.

Framework of Proposed Online Profiling Machine learning: finds the image of interest from input dataset based on pixel similarities (most represented data) Approximate concurrent state machine: profile the selected images of interest approximately by keeping track of the number of repeated computation Associative memory: reduce the redundant computations beside processor cores

Online Learning Algorithm 𝑉 𝑡 𝑆 𝑖 = 𝑅 𝑡 𝑆 𝑖 +𝛾∙ 𝑉 𝑡−1 𝑆 𝑖 Temporal difference (TD) learning TD-LRU maintains 𝑘 states, representing 𝑘 image groups of interest 𝑅 𝑡 𝑆 𝑖 is the reword value for the pixel similarity 𝛾 balances the impact of the obtained reward TD-LRU considers the images with less than a threshold reward as non-seen images

TD-LRU Learning Tacc=0.35 Tnew=10 Input Image Pixel Similarity Image 1 Image M 𝑹 𝒕 𝑺 𝒊 : 0.3 0.5 0.1 0.9 0.5 0.2 𝑸 𝒕 𝑺𝒊 : +1 Reset 0 +1 Reset 0 Reset 0 +1 𝑹 𝒕 𝑺𝒊 = 0 0.5 0 0.9 0.5 ... 0 𝑸 𝒕 𝑺𝒊 = 𝑄 𝑡−1 𝑆 1 +1 0 𝑄 𝑡−1 𝑆 3 +1 0 0 ... 𝑄 𝑡−1 𝑆 𝑀 +1 𝑒.𝑔 , 𝑄 𝑡 𝑆𝑖 = 14 0 4 0 0 ... 7 Tacc=0.35 Tnew=10

TD-LRU Learning If top K states change, we start approximate profiling The allocated of associative memory from each state 𝑆 𝑖 : For example: 𝑉 𝑡 𝑆 𝑎 =0.4 will take 2x more number of rows than those of another image state with 𝑉 𝑡 𝑆 𝑏 =0.2. 𝑷 𝑺 𝒊 = 𝑽 𝒕 𝑺 𝒊 𝒋=𝟏 𝑴 𝑽 𝒕 𝑺 𝒋

Approximate Profiler Profiler identifies frequent operand patterns in an approximated manner Minimize the profiling overhead at the expense of the accuracy Concurrent state machine: exploits a bloom filter, implemented with k hash functions to generate an input signature (m-bits vector) False positive error: define as a number of operands might have a same signature value f = (1-e-nk/m)k Parameters n: degree of memberships No control m: vector size Larger  accuracy  Energy  k: number of hash functions

Approximate Profiler Generate 60K different signatures of 128-bit length, we can profile the frequency pattern operands 5x lower memory space than the exact profiling case. Acceptable error rate of 5.3%

Experimental Setup AMD GPUs FPU ASIC flow ACAM design flow Multi2Sim for AMD Southern Islands GPU, Radeon HD 7970 device OpenCL applications: Sobel, Robert, Sharpen, Shift FPU ASIC flow Balanced FPUs generated by FloPoCo Synthesized and mapped using a 45-nm TSMC Optimized for power and a clock period based on TCAM delay: Synopsys Design Compiler FPU power estimation: Synopsys PrimeTime (1.0V) CELL!!!!!!!!!!!!!!! ACAM design flow Transistor-level HSPICE simulations for power and delay using 5T-4MTJ TCAM cell [Hanyu, DATE’15]

Results: GPU Energy Consumption Trade-off between FPU and TCAM energy consumption Minimum point: tradeoff between FPU and TCAM consumption Energy saving compared to GPU alone: Hit Rate  FPU Energy  TCAM Energy 

Hit Rate Comparasion Offline/Online Offline Profiling: sampling data based on the offline observation  cannot identify the proper images to be profiled  low hit rate Online profiling: adaptively update pre-stored the TCAM values in time by considering the data locality. Frequent profiling until the selected images are good-enough to represent the dataset

Hit Rate & Energy Saving Comparison The learning and profiling energy is fixed (in fix dataset) Minimum energy is in small TCAM  thus online learning is important! ACAM saves 2.9x more energy on average than offline profiling Average energy saving over 4 applications GPU + Online ACAM 35% GPU + Offline ACAM 12%

Energy in Different Dataset Size Small dataset: Low hit rate difference of online and offline techniques Offline technique overtaken in energy saving due to zero profiling energy Large dataset: Offline profiling cannot find high-frequency patterns  low ACAM hit rate Online learning adaptive update the TCAM based on the temporal locality Online profiling: 3.3x higher energy saving over 4 applications

ACAM Robustness to Dataset GPGPU energy difference metric for the random and locality dataset cases Online learning makes better decisions for image selections based on the locality Online learning still outperforms 2.1x over offline profiling Energy difference increase Dataset size 100 500 1000 2000 3000 4000 Robert 0.04 0.07 0.09 0.11 0.13 Sobel 0.02 0.06 0.15 0.16 0.18 Sharpen 0.05 0.08 0.14 0.17 0.19 0.21 Shift 0.12

Quality Comparison Robert Application

Conclusion Associative memory exploits data locality to reduce the redundant computations For IoT workload with large dataset, it is essential to use online profiling. Our framework learns the workload and dramatically update the ACAM values based on recent represented data. 35% GPGPU energy savings on average 3.3X lower search energy consumption vs. state-of-the-art techniques.