Address-Stride Assisted Approximate Load Value Prediction in GPUs

Slides:



Advertisements
Similar presentations
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Advertisements

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
Optimization on Kepler Zehuan Wang
1 An Efficient, Hardware-based Multi-Hash Scheme for High Speed IP Lookup Hot Interconnects 2008 Socrates Demetriades, Michel Hanna, Sangyeun Cho and Rami.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.
Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
Sampling Dead Block Prediction for Last-Level Caches
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.
Efficiently Prefetching Complex Address Patterns Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian University of Utah Chris Wilkerson, Zeshan.
Sunpyo Hong, Hyesoon Kim
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
My Coordinates Office EM G.27 contact time:
Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads Amir Yazdanbakhsh, Bradley Thwaites, Hadi Esmaeilzadeh Gennady Pekhimenko, Onur Mutlu,
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Zorua: A Holistic Approach to Resource Virtualization in GPUs
Employing compression solutions under openacc
Reza Yazdani Albert Segura José-María Arnau Antonio González
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Controlled Kernel Launch for Dynamic Parallelism in GPUs
Stash: Have Your Scratchpad and Cache it Too
Cache Memory Presentation I
Lecture 5: GPU Compute Architecture
RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Energy-Efficient Address Translation
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Experiment Evaluation
RegLess: Just-in-Time Operand Staging for GPUs
Rachata Ausavarungnirun
Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu
Presented by: Isaac Martin
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Milad Hashemi, Onur Mutlu, Yale N. Patt
Gurunath Kadam (College of William and Mary)
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Lecture 5: GPU Compute Architecture for the last time
Using Dead Blocks as a Virtual Victim Cache
NVIDIA Fermi Architecture
Address-Value Delta (AVD) Prediction
Phase Capture and Prediction with Applications
Architectural Support for Efficient Large-Scale Automata Processing
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Operation of the Basic SM Pipeline
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
General Purpose Graphics Processing Units (GPGPUs)
rePLay: A Hardware Framework for Dynamic Optimization
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Phase based adaptive Branch predictor: Seeing the forest for the trees
Haonan Wang, Adwait Jog College of William & Mary
Presentation transcript:

Address-Stride Assisted Approximate Load Value Prediction in GPUs Haonan Wang, Mohamed Ibrahim (College of William & Mary), Sparsh Mittal (IIT Hyderabad), Adwait Jog (College of William & Mary)

Haonan June 26, DSN Portland, OR presentation (DSN) Haonan Adwait I don’t actually need to go. Wait.. this is exactly the case for Value Approximation. If a value is correlated with existing local values, it does not need to be fetched from the memory! June 26, DSN presentation Portland, OR (DSN) Haonan Adwait I’m correlated with Haonan. I can present for him. Consumes time/energy Also, I might be late! Williamsburg, VA (William & Mary) Phoenix, AZ (FCRC) June 27, ICS presentation

Executive Summary Problem: Data movement across different levels of the memory hierarchy High energy consumption and memory bandwidth utilization Existing value prediction solutions are not optimal Observation: Values can be more accurately predicted by exploiting correlation between the address stride and the value stride Exists in many data inputs processed by GPU applications Solution: ASAP – a novel value approximation technique for GPUs Low hardware overhead Better prediction accuracy

Outline Background & Motivation Design of ASAP Evaluation Conclusion

GPU Architecture with a (Roll-back Free) Value Predictor Core If certain quality loss can be tolerated Rollbacks are not needed R2 R1 Core1 Core2 Core30 … L1 Cache Training/Update Data Value Value Predictor (VP) L1 VP L1 VP L1 VP Predicted Data Value Interconnect L2 L2 … L2 DRAM DRAM DRAM Interconnect L2 Cache After the prediction, if the predicted value is not the correct value, the execution of the program must be rolled back to the state before the prediction. However, if … R1 DRAM

CPUs: Prior Value Prediction Works Context based predictor: Use context table (containing history PCs) to index into prediction table Hash function based predictor: Use instruction info (e.g., PCs) to index into prediction table Not optimized for GPUs: Use per-thread large prediction table – expensive for multi-threading Does not directly consider the memory access order Does not handle memory divergence Roll back is more expensive in GPUs.

GPUs: Prior Value Prediction Works Sub-predictor 1 Sub-predictor 2 Rollback-Free Value prediction (RFVP) .... word 0-15 word 16-31 Value Base0 Hash Fn PC Warp ID Value Stride1 Value Base16 Value Stride2 One Stride Predictor (OSP): Training: Stride = Difference between Values Prediction: Value = Base + Stride Limitations: Does not consider memory addresses and their order -- Limiting the predictability of values Using per Warp and PC information -- May require large number of entries Two Stride Predictor (TSP): In training, it confirms: Current Stride = Previous Stride Mention TSP. – add confidence to training Amir Yazdanbakhsh, et al. “RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads.” TACO’2016.

How can we improve existing GPU value predictors?

Observation: Address Stride & Value Stride Correlation Row-major: Column-major: Nearby pixels are related (e.g., have linearly increasing grayscales). row size Tracking address strides can help in better value predictoion

Sequence1 (𝒂𝒅𝒅𝒓𝟎→𝒂𝒅𝒅𝒓𝟏→𝒂𝒅𝒅𝒓𝟐) Sequence2 (𝒂𝒅𝒅𝒓𝟎→𝒂𝒅𝒅𝒓𝟐→𝒂𝒅𝒅𝒓𝟏) Track Address Strides for Better Value Prediction (1) 𝒗𝒂𝒍 𝒂𝒅𝒅𝒓 2 4 1 Sequence1 (𝒂𝒅𝒅𝒓𝟎→𝒂𝒅𝒅𝒓𝟏→𝒂𝒅𝒅𝒓𝟐) Sequence2 (𝒂𝒅𝒅𝒓𝟎→𝒂𝒅𝒅𝒓𝟐→𝒂𝒅𝒅𝒓𝟏) Training Training 𝒂𝒅𝒅𝒓 𝒗𝒂𝒍 Value Value 𝒂𝒅𝒅𝒓 𝒗𝒂𝒍 Value Value Base Stride Base Stride -- -- 1 2 2 2 2 4 4 4 Trained Trained Prediction Prediction   2 4 4 2 1 2 8 4

Track Address Strides for Better Value Prediction (2) Request order enforced by: Only accepting constant address strides Only accepting address stride = 1 or address stride = row size No enforcement Error Indicator (lower the better) Using address stride improves the predictability of values. We prefer lower stride differences as constant value strides indicates good predictability.

Outline Background & Motivation Design of ASAP Evaluation Conclusion

Challenges of ASAP How to efficiently track address strides? How do we handle irregular memory access patterns? How do we ensure the design of the value predictor is area-efficient?

Address Stride Assisted Approximate Load Value Predictor (ASAP) Sub-predictor 1 Sub-predictor 2 Address Stride Short Address Stride Long Value Stride Short1 Value Stride Long1 Value Stride Short2 Value Stride Long2 Address Base Value Base0 Value Base16 Entries with LRU Eviction 2 1 2 4 2 4 8 2 4 .... .... MUX MUX match match word 0-15 word 16-31 Upcoming Address = 3 Upcoming Address = 4 Value = 6 Value = 8 Value = 10 Value = 12

Walkthrough Example – Effectiveness of ASAP over Prior Work

Operation of RFVP (Prior GPU Value Predictor) Address Sequence: 0 → 1 → 2 → 4 → 3 → 5 2 4 𝒗𝒂𝒍 8 6 10 𝒂𝒅𝒅𝒓 𝒗𝒂𝒍 Value Value Base Stride Training -- 1 Coverage = 4 / 6 Accurate Predictions = 2 / 4 2 2 2 Prediction  2 4 4 2  4 8 6 2  3 6 8 2  5 10 10 2

Operation and Advantage of our new ASAP Predictor Address Sequence: 0 → 1 → 2 → 4 → 3 → 5 2 4 𝒗𝒂𝒍 8 6 10 ASAP: Coverage = 3 / 6 Accurate Predictions = 3 / 3 Address Address Value Value Address Value Stride Stride Stride Stride Base 𝒂𝒅𝒅𝒓 𝒗𝒂𝒍 Base Short Long Short Long Training -- -- -- -- 1 1 -- 1 2 2 2 -- RFVP Coverage = 4 / 6 Accurate Predictions = 2 / 4 Prediction  2 1 2 2 4 4 2 4  4 1 2 4 8 8 2 4 Not Applicable 3 6 No Prediction 5  5 1 2 10 10 2 4

Outline Background & Motivation Design of ASAP Evaluation Conclusion

Evaluation Methodology Evaluated using GPGPU-Sim A cycle-level GPU simulator Baseline Configuration (More details in the paper) 30 SMs, 32-SIMT Lanes, 32 Threads/Warp, 48 Warps/SM 16KB L1 (4-way, 128B Cache Block) + 32KB Shared Memory per SM 256KB L2 (8-way, 128B Cache Block) per Memory Partition 6 GDDR5 Memory Partitions, 16 Banks/Partition 1 Crossbar/Direction Workloads -- 12 Applications From CUDA SDK, Polybench Divided into different groups based on their characteristics

Application Error 10% Coverage: 20% Coverage: ASAP achieves better prediction accuracy with low entry requirement . ASAP leads to more improvements at higher prediction coverages. The error increase from 10% to 20% coverage is much higher in RFVP predictors than in ASAP predictors.

Image Quality at 10% Coverage No Prediction App. Error: 0% RFVP-TSP-8 App. Error: 40.1% RFVP-TSP-Unlimited App. Error: 16.9% ASAP-TSP-8 App. Error: 13.6% Utilizing address stride & value stride correlation for value approximation is effective in improving the output quality.

IPC & Energy ASAP achieves: Similar performance and energy improvements with less application error. More performance and energy improvements with same application error budget.

Conclusions Goal: Design low overhead, more accurate value approximation technique for GPUs. Contributions: Demonstrate the correlation between address stride and value stride. Design a novel value approximation technique for GPUs (ASAP): Low capacity requirement Adaptive to complex access patterns in GPUs Across a variety of GPGPU applications, ASAP reduces application error by 84% to 95% in different scenarios.

Thank You! Questions? We acknowledge the support of the National Science Foundation (NSF) grants (#1657336, #1717532 and #1750667) and Science & Engineering Research Board(SERB) award (#ECR/2017/000622)

Application Characterization

Case Study: Effective of Address Stride Long Miss Match Rate: The maximum percentage of missed requests that can be matched with ASAP. Address stride long effectively improves the miss match rate. Miss Match Rate: The maximum percentage of missed requests that can be matched with ASAP.

Case Study: Effect of Number of Entries Generally, 8 entries can provide enough Miss Match Rate for ASAP. The prediction table size requirement for ASAP is limited. Miss Match Rate: The maximum percentage of missed requests that can be matched with ASAP.

Case Study: Effect of Warp Scheduler Miss Match Rate increases with more a regular scheduling scheme. Miss Match Rate: The maximum percentage of missed requests that can be matched with ASAP.

CPUs: Prior Value Prediction Works Context based predictor: Use context table (containing history PCs) to index into prediction table Hash function based predictor: Use instruction info (e.g., PCs) to index into prediction table Not optimized for GPUs: Use per-thread large prediction table – expensive for multi-threading Does not directly consider the memory access order Does not handle memory divergence CPU GPU Multi-threading Limited Highly multi-threaded Instruction Order Limited out-of-order Determined by warp scheduler Memory Divergence Limited Yes Roll back is more expensive in GPUs.

Compiler Assisted Code Annotation Annotated CUDA code: Generated PTX code: #pragma add_pred{fetch, 9, predict , 1} ... #pragma approx{B} C[i] = A[i] + B[i]; .fetch 9 .predict 1 ... ld.global.u32.approx %r0, [%r1]