APPROX-NoC: A Data Approximation Framework for Network-On-Chip Architectures Rahul Boyapati, Jiayi Huang, Pritam Majumder, Ki Hwan Yum, Eun Jung Kim.

Slides:



Advertisements
Similar presentations
Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu
Advertisements

A Novel 3D Layer-Multiplexed On-Chip Network
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,
Do We Need Wide Flits in Networks-On-Chip? Junghee Lee, Chrysostomos Nicopoulos, Sung Joo Park, Madhavan Swaminathan and Jongman Kim Presented by Junghee.
Destination-Based Adaptive Routing for 2D Mesh Networks ANCS 2010 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California,
HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.
L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
Architecture for Network Hub in 2011 David Chinnery Ben Horowitz.
Design of a High-Throughput Distributed Shared-Buffer NoC Router
1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.
1 Near-Optimal Oblivious Routing for 3D-Mesh Networks ICCD 2008 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering Department University.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
McRouter: Multicast within a Router for High Performance NoCs
Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu, Todd C. Mowry Phillip B.
Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,
LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.
Building Expressive, Area-Efficient Coherence Directories Michael C. Huang Guofan Jiang Zhejiang University University of Rochester IBM 1 Lei Fang, Peng.
Embedded System Lab. 김해천 Linearly Compressed Pages: A Low- Complexity, Low-Latency Main Memory Compression Framework Gennady Pekhimenko†
A Lightweight Fault-Tolerant Mechanism for Network-on-Chip
Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.
Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, ​ Kevin Chang, Greg Nazario, Reetuparna.
A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
80-Tile Teraflop Network-On- Chip 1. Contents Overview of the chip Architecture ▫Computational Core ▫Mesh Network Router ▫Power save features Performance.
|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005.
Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.
University of Michigan, Ann Arbor
Improving NoC-based Testing Through Compression Schemes Érika Cota 1 Julien Dalmasso 2 Marie-Lise Flottes 2 Bruno Rouzeyre 2 WNOC
Microprocessors and Microsystems Volume 35, Issue 2, March 2011, Pages 230–245 Special issue on Network-on-Chip Architectures and Design Methodologies.
Yu Cai Ken Mai Onur Mutlu
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.
Virtual-Channel Flow Control William J. Dally
Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.
FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations 5/3/2011 Michael K. Papamichael, James C.
University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian.
SketchVisor: Robust Network Measurement for Software Packet Processing
A Case for Toggle-Aware Compression for GPU Systems
5/3/2018 3:51 AM Memory Efficient Loss Recovery for Hardware-based Transport in Datacenter Yuanwei Lu1,2, Guo Chen2, Zhenyuan Ruan1,2, Wencong Xiao2,3,
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
‘99 ACM/IEEE International Symposium on Computer Architecture
Rachata Ausavarungnirun, Kevin Chang
Exploring Concentration and Channel Slicing in On-chip Network Router
Cache Memory Presentation I
OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel
Building Expressive, Area-Efficient Coherence Directories
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Energy-Efficient Address Translation
Rahul Boyapati. , Jiayi Huang
11/13/ :11 PM Memory Efficient Loss Recovery for Hardware-based Transport in Datacenter Yuanwei Lu1,2, Guo Chen2, Zhenyuan Ruan1,2, Wencong Xiao2,3,
Milad Hashemi, Onur Mutlu, Yale N. Patt
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Using Packet Information for Efficient Communication in NoCs
Approximate Fully Connected Neural Network Generation
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
1CECA, Peking University, China
Active-Routing: Compute on the Way for Near-Data Processing
Sculptor: Flexible Approximation with
Presentation transcript:

APPROX-NoC: A Data Approximation Framework for Network-On-Chip Architectures Rahul Boyapati, Jiayi Huang, Pritam Majumder, Ki Hwan Yum, Eun Jung Kim

Leveraging inaccuracy to provide high throughput NoC Motivation Perfect accuracy is not required Computer vision Machine learning Graph processing Large amount of data movement across NoC Video frame Neuron weights Graph weights Leveraging inaccuracy to provide high throughput NoC

Hardware Approximation Compute Approximation Variable voltage based ALUs [Esmaeilzadeh et al. ASPLOS’12] Analog based circuit designs [St. Amant et al. ISCA’14] Neural network acceleration [Esmaeilzadeh et al. MICRO’12, Moreau et al. HPCA’15] Storage Approximation Approximate main memory [Sampson et al. MICRO’13, Liu et al. ASPLOS’11] Approximate cache [San Miguel et al. MICRO’15, MICRO’16] No previous research on approximation in NoCs

Approximation in NoCs Why do we need approximation in NoCs? Higher throughput Mitigate memory bandwidth bottleneck Approximation increase data similarity to improve compression rate Leveraging inaccuracy tolerance of applications to improve effective bandwidth

APPROX-NoC

Main Idea Cache block 0xA 0xB 0xC 0xD 0xE 0xF VAXX Source Approximated block 0xA 0xB 0xE 0xD Compr Network Network Representation e0+0xA e1 e2 e0+0xD Decompr Destination Decompressed block 0xA 0xB 0xE 0xD e0 uncompressed e1 0xB e2 0xE Uncompressed Precise Encoding/Decoding Approximate Encoding

Should be a Light-Weight Design Challenges Value approximation and compression not cheap Latency overhead (on the critical path) Hardware cost Quality control is important Error calculation for every word Power and latency overhead for error compute Should be a Light-Weight Design

APPROX-NoC Architecture Overview Tile Tile …… NI … NI NI … NI … Router Router Ejection Q NI Core To Processor or MC Eject Inject From Processor or MC Injection Q

APPROX-NoC Architecture Overview Tile Tile …… NI … NI NI … NI … Router Router Ejection Q NI Core To Processor or MC Eject Decompr Inject Compr From Processor or MC Injection Q

APPROX-NoC Architecture Overview Tile Tile …… NI … NI NI … NI … Router Router Ejection Q NI Core To Processor or MC Eject Decompr Approx? Inject Compr VAXX From Processor or MC Injection Q Approximate to similar data to improve compression rate.

APPROX-NoC Operation Flow Chart Compressor Cache Block Approximable? Y Approximate Logic Int or float? Mantissa extraction N float int Approximate Value Compute Logic (AVCL) Data type aware approximation Bypass approximation logic to reduce overhead on critical path Seamlessly integrated with compression unit in plug-and-play manner

Integer Approximation Datapath Simple for integer The complete word passed for approximation Abstraction u Calculate the error budget based on the threshold v Detect number of bits for the error budget, e.g. n bits w Approximate least significant (n-1) don’t care bits for compression-friendly data patterns 31 integer Approximate Logic 31 Approximated integer

Floating-Point Approximation Datapath Representation IEEE 754 (−1)sign × (1 + .mantissa) × 2(exponent−bias) Abstraction sign exponent mantissa 31 30 23 22 0 s exponent mantissa No Floating-Point Operation for FP Approximation u Extract the mantissa bits and normalized as an integer v Approximate like integer w Concatenate exponent to recover approximate float value u 24 23 22 0 0 …….. 0 1 mantissa v Approximate Logic 0 …….. 0 1 approx mantissa w s exponent approx mantissa

Approximate Value Compute Logic (AVCL) 31 one word data Unified logic for both integer and floating point Fast error budget compute e: error threshold (0-100) error_budget = given_value × (e/100) = given_value/(100/e) 100/e predefined (100/25 = 4 = B’100) Only shifting bits 32 23 24 23 22 0 0 ……. 0 1 mantissa 32 32 0 1 int/float? 8 Approximate Logic Float Exponent Detection 32 9 9 int/float? 23 1 0 32 int/float? 23 22 0 32 0 1 approx?

APPROX-NoC Implementation Cases Plug VAXX approximate engine with compression units Frequent pattern compression (FP-COMP) [Das et al. HPCA’08] Dictionary-based compression (DI-COMP) [Jin et al. MICRO’08] Frequent Pattern Based VAXX (FP-VAXX) Approximate the value Compressed approximated pattern Dictionary-Based VAXX (DI-VAXX) Use TCAM to store approximated tracked patterns Approximation off the critical path

Frequent Pattern VAXX (FP-VAXX) Given word Approximate Value Compute Logic (AVCL) Error threshold Frequent Pattern Compressor Approximate pattern Encoded pattern First approximate the value with AVCL Compressed the approximate pattern using frequent pattern compression

Dictionary-Based VAXX (DI-VAXX) Update Approximate Value Compute Logic (AVCL) Fill and Update Error threshold Approx Pattern Encoded Idx 010X e0 1001 10XX 10XX e1 1010 Given word Lookup Match? Use TCAM to store approximated patterns Precompute approximate patterns while update and fill the dictionary Approximation off critical path Encoded index e1

Evaluation

Methodology Workloads Architecture NoC Tools Parsec 3.0 SSCA2 graph application Synthetic workload from benchmark traces Architecture 32 Out-of-Order cores at 2 GHz 32 KB L1I$ and 64 KB L1D$, 2-way 2 MB L2-bank and 16 directories NoC 4x4 2D concentrated-mesh 2 GHz, 3-stage router 4 virtual channels, 4-flit buffer 64-bit flit, X-Y routing Tools Gem5 for full system performance Pin-based simulator for application output error In house NoC simulator for synthetic study

Packet Latency and Data Quality Synthetic study: benchmark traces permutations, 75% approximable data packet and 10% error threshold

Packet Latency and Data Quality Synthetic study: benchmark traces permutations, 75% approximable data packet and 10% error threshold

Packet Latency and Data Quality Synthetic study: benchmark traces permutations, 75% approximable data packet and 10% error threshold DI-VAXX reduces latency by 11% and 40% compared to DI-COMP and Baseline FP-VAXX reduces latency by 21% and 46% over FP-COMP and Baseline For data intensive benchmark SSCA2, DI-VAXX outperforms DI-COMP by 22%, FP-VAXX outperforms FP-COMP by 36%

Packet Latency and Data Quality Synthetic study: benchmark traces permutations, 75% approximable data packet and 10% error threshold DI-VAXX reduces latency by 11% and 40% compared to DI-COMP and Baseline FP-VAXX reduces latency by 21% and 46% over FP-COMP and Baseline For data intensive benchmark SSCA2, DI-VAXX outperforms DI-COMP by 22%, FP-VAXX outperforms FP-COMP by 36% Data value quality is higher than 97% (< 3% error)

Compression Ratio Synthetic study: benchmark traces permutations, 75% approximable data packets and 10% error threshold Approximation can improve compression ratio up to 41% DI-VAXX and FP-VAXX improve compression ratio by 10% and 30% in geomean Higher compression ratio reduces flits, thus reduces queuing and contention

Throughput - Uniform Random Synthetic study: Streamcluster traces permutations, 1: data to control packet ratio 75% approximable data packets and 10% error threshold3 VAXX improves the throughput by up to 40%

Throughput - Transpose Synthetic study: Streamcluster traces permutations, 1:3 data to control packet ratio 75% approximable data packets and 10% error threshold VAXX improves the throughput by up to 69%

Application Error and Full System Performance Application errors are less than 5% except for streamcluster and swaptions

Application Error and Full System Performance Application errors are less than 5% except for streamcluster and swaptions performance is improved by up to 10% and 14% in swaptions and SSCA2

Power Consumption and Area Overhead Approximation power consumption is compensated by flit reduction Schemes DI-VAXX FP-VAXX Area Overhead (45 nm) 0.0037 mm2 0.0029 mm2

Conclusions NoC data approximation framework for leveraging inaccuracy to provide high throughput. Light-weight Approximate Compute to support both integer and floating-point. Low cost microarchitecture implementations of VAXX. APPROX-NoC achieves up to 21% average packet latency reduction and 69% throughput improvement.

Thank You & Questions Jiayi Huang jyhuang@cse.tamu.edu