APPROX-NoC: A Data Approximation Framework for Network-On-Chip Architectures Rahul Boyapati, Jiayi Huang, Pritam Majumder, Ki Hwan Yum, Eun Jung Kim
Leveraging inaccuracy to provide high throughput NoC Motivation Perfect accuracy is not required Computer vision Machine learning Graph processing Large amount of data movement across NoC Video frame Neuron weights Graph weights Leveraging inaccuracy to provide high throughput NoC
Hardware Approximation Compute Approximation Variable voltage based ALUs [Esmaeilzadeh et al. ASPLOS’12] Analog based circuit designs [St. Amant et al. ISCA’14] Neural network acceleration [Esmaeilzadeh et al. MICRO’12, Moreau et al. HPCA’15] Storage Approximation Approximate main memory [Sampson et al. MICRO’13, Liu et al. ASPLOS’11] Approximate cache [San Miguel et al. MICRO’15, MICRO’16] No previous research on approximation in NoCs
Approximation in NoCs Why do we need approximation in NoCs? Higher throughput Mitigate memory bandwidth bottleneck Approximation increase data similarity to improve compression rate Leveraging inaccuracy tolerance of applications to improve effective bandwidth
APPROX-NoC
Main Idea Cache block 0xA 0xB 0xC 0xD 0xE 0xF VAXX Source Approximated block 0xA 0xB 0xE 0xD Compr Network Network Representation e0+0xA e1 e2 e0+0xD Decompr Destination Decompressed block 0xA 0xB 0xE 0xD e0 uncompressed e1 0xB e2 0xE Uncompressed Precise Encoding/Decoding Approximate Encoding
Should be a Light-Weight Design Challenges Value approximation and compression not cheap Latency overhead (on the critical path) Hardware cost Quality control is important Error calculation for every word Power and latency overhead for error compute Should be a Light-Weight Design
APPROX-NoC Architecture Overview Tile Tile …… NI … NI NI … NI … Router Router Ejection Q NI Core To Processor or MC Eject Inject From Processor or MC Injection Q
APPROX-NoC Architecture Overview Tile Tile …… NI … NI NI … NI … Router Router Ejection Q NI Core To Processor or MC Eject Decompr Inject Compr From Processor or MC Injection Q
APPROX-NoC Architecture Overview Tile Tile …… NI … NI NI … NI … Router Router Ejection Q NI Core To Processor or MC Eject Decompr Approx? Inject Compr VAXX From Processor or MC Injection Q Approximate to similar data to improve compression rate.
APPROX-NoC Operation Flow Chart Compressor Cache Block Approximable? Y Approximate Logic Int or float? Mantissa extraction N float int Approximate Value Compute Logic (AVCL) Data type aware approximation Bypass approximation logic to reduce overhead on critical path Seamlessly integrated with compression unit in plug-and-play manner
Integer Approximation Datapath Simple for integer The complete word passed for approximation Abstraction u Calculate the error budget based on the threshold v Detect number of bits for the error budget, e.g. n bits w Approximate least significant (n-1) don’t care bits for compression-friendly data patterns 31 integer Approximate Logic 31 Approximated integer
Floating-Point Approximation Datapath Representation IEEE 754 (−1)sign × (1 + .mantissa) × 2(exponent−bias) Abstraction sign exponent mantissa 31 30 23 22 0 s exponent mantissa No Floating-Point Operation for FP Approximation u Extract the mantissa bits and normalized as an integer v Approximate like integer w Concatenate exponent to recover approximate float value u 24 23 22 0 0 …….. 0 1 mantissa v Approximate Logic 0 …….. 0 1 approx mantissa w s exponent approx mantissa
Approximate Value Compute Logic (AVCL) 31 one word data Unified logic for both integer and floating point Fast error budget compute e: error threshold (0-100) error_budget = given_value × (e/100) = given_value/(100/e) 100/e predefined (100/25 = 4 = B’100) Only shifting bits 32 23 24 23 22 0 0 ……. 0 1 mantissa 32 32 0 1 int/float? 8 Approximate Logic Float Exponent Detection 32 9 9 int/float? 23 1 0 32 int/float? 23 22 0 32 0 1 approx?
APPROX-NoC Implementation Cases Plug VAXX approximate engine with compression units Frequent pattern compression (FP-COMP) [Das et al. HPCA’08] Dictionary-based compression (DI-COMP) [Jin et al. MICRO’08] Frequent Pattern Based VAXX (FP-VAXX) Approximate the value Compressed approximated pattern Dictionary-Based VAXX (DI-VAXX) Use TCAM to store approximated tracked patterns Approximation off the critical path
Frequent Pattern VAXX (FP-VAXX) Given word Approximate Value Compute Logic (AVCL) Error threshold Frequent Pattern Compressor Approximate pattern Encoded pattern First approximate the value with AVCL Compressed the approximate pattern using frequent pattern compression
Dictionary-Based VAXX (DI-VAXX) Update Approximate Value Compute Logic (AVCL) Fill and Update Error threshold Approx Pattern Encoded Idx 010X e0 1001 10XX 10XX e1 1010 Given word Lookup Match? Use TCAM to store approximated patterns Precompute approximate patterns while update and fill the dictionary Approximation off critical path Encoded index e1
Evaluation
Methodology Workloads Architecture NoC Tools Parsec 3.0 SSCA2 graph application Synthetic workload from benchmark traces Architecture 32 Out-of-Order cores at 2 GHz 32 KB L1I$ and 64 KB L1D$, 2-way 2 MB L2-bank and 16 directories NoC 4x4 2D concentrated-mesh 2 GHz, 3-stage router 4 virtual channels, 4-flit buffer 64-bit flit, X-Y routing Tools Gem5 for full system performance Pin-based simulator for application output error In house NoC simulator for synthetic study
Packet Latency and Data Quality Synthetic study: benchmark traces permutations, 75% approximable data packet and 10% error threshold
Packet Latency and Data Quality Synthetic study: benchmark traces permutations, 75% approximable data packet and 10% error threshold
Packet Latency and Data Quality Synthetic study: benchmark traces permutations, 75% approximable data packet and 10% error threshold DI-VAXX reduces latency by 11% and 40% compared to DI-COMP and Baseline FP-VAXX reduces latency by 21% and 46% over FP-COMP and Baseline For data intensive benchmark SSCA2, DI-VAXX outperforms DI-COMP by 22%, FP-VAXX outperforms FP-COMP by 36%
Packet Latency and Data Quality Synthetic study: benchmark traces permutations, 75% approximable data packet and 10% error threshold DI-VAXX reduces latency by 11% and 40% compared to DI-COMP and Baseline FP-VAXX reduces latency by 21% and 46% over FP-COMP and Baseline For data intensive benchmark SSCA2, DI-VAXX outperforms DI-COMP by 22%, FP-VAXX outperforms FP-COMP by 36% Data value quality is higher than 97% (< 3% error)
Compression Ratio Synthetic study: benchmark traces permutations, 75% approximable data packets and 10% error threshold Approximation can improve compression ratio up to 41% DI-VAXX and FP-VAXX improve compression ratio by 10% and 30% in geomean Higher compression ratio reduces flits, thus reduces queuing and contention
Throughput - Uniform Random Synthetic study: Streamcluster traces permutations, 1: data to control packet ratio 75% approximable data packets and 10% error threshold3 VAXX improves the throughput by up to 40%
Throughput - Transpose Synthetic study: Streamcluster traces permutations, 1:3 data to control packet ratio 75% approximable data packets and 10% error threshold VAXX improves the throughput by up to 69%
Application Error and Full System Performance Application errors are less than 5% except for streamcluster and swaptions
Application Error and Full System Performance Application errors are less than 5% except for streamcluster and swaptions performance is improved by up to 10% and 14% in swaptions and SSCA2
Power Consumption and Area Overhead Approximation power consumption is compensated by flit reduction Schemes DI-VAXX FP-VAXX Area Overhead (45 nm) 0.0037 mm2 0.0029 mm2
Conclusions NoC data approximation framework for leveraging inaccuracy to provide high throughput. Light-weight Approximate Compute to support both integer and floating-point. Low cost microarchitecture implementations of VAXX. APPROX-NoC achieves up to 21% average packet latency reduction and 69% throughput improvement.
Thank You & Questions Jiayi Huang jyhuang@cse.tamu.edu