Augsburg University, February 18 th 2010 Anticipatory Techniques in Advanced Processor Architectures Professor Lucian N. VINŢAN, PhD Lucian Blaga University.

Slides:



Advertisements
Similar presentations
1 Aashish Phansalkar & Lizy K. John Performance Prediction Using Program Similarity The University of Texas at Austin.
Advertisements

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
NC STATE UNIVERSITY Transparent Control Independence (TCI) Ahmed S. Al-Zawawi Vimal K. Reddy Eric Rotenberg Haitham H. Akkary* *Dept. of Electrical & Computer.
Topics Left Superscalar machines IA64 / EPIC architecture
Lecture 9 – OOO execution © Avi Mendelson, 5/ MAMAS – Computer Architecture Lecture 9 – Out Of Order (OOO) Dr. Avi Mendelson Some of the slides.
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Revisiting Load Value Speculation:
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Energy saving in multicore architectures Assoc. Prof. Adrian FLOREA, PhD Prof. Lucian VINTAN, PhD – Research.
Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Pipelining and Parallelism Mark Staveley
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Branch Hazards and Static Branch Prediction Techniques
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.
PipeliningPipelining Computer Architecture (Fall 2006)
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Multiscalar Processors
Simultaneous Multithreading
/ Computer Architecture and Design
Flow Path Model of Superscalars
Computer Architecture Lecture 4 17th May, 2006
Module 3: Branch Prediction
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture: SMT, Cache Hierarchies
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Lecture: SMT, Cache Hierarchies
The University of Adelaide, School of Computer Science
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Presentation transcript:

Augsburg University, February 18 th 2010 Anticipatory Techniques in Advanced Processor Architectures Professor Lucian N. VINŢAN, PhD Lucian Blaga University of Sibiu (RO), Computer Engineering Department, Advanced Computer Architecture & Processing Systems Lab: Academy of Technical Sciences from Romania:

ILP Paradigm. A Typical Superscalar Microarchitecture PHASES: IFetch IDecode Dispatch Issue ALU Mem Wr_Back Commit

Instructions Pipeline Processing. Branch Prediction IFetch IDecode Dispatch Issue ALU Mem WBack Commit IFetch IDecode Dispatch Issue ALU Mem WB Commit / BR, Addr IFetch IDecode Dispatch Issue ALU Mem WBack … IFetch IDecode Dispatch Issue ALU Mem …. BRANCH PREDICTION NECESSITY INCREASES WITH: Pipeline Depth Superscalar Factor (maximum achievable ILP) Branch Prediction significantly increases performance

A Dynamic Adaptive Branch Predictor

A typical per Branch FSM Predictor (2 Prediction Bits)

Fetch Rate is limited by the basic-blocks dimension (7-8 instructions in SPEC 2000); Fetch Bottleneck is due to the programs intrinsic characteristics. FETCH BOTTLENECK Solutions Trace-Cache & Multiple (M-1) Branch Predictors; A TC entry contains N instructions or M basic-blocks (N>M) written at the time they were executed; Branch Prediction increases ILP by predicting branch directions and targets and speculatively processing multiple basic-blocks in parallel; As instruction issue width and the pipeline depth are getting higher, accurate branch prediction becomes more essential. Fundamental Limits of ILP Paradigm. Solutions Some Challenges Identifying and solving some Difficult-to-Predict Branches (unbiased branches); Helping the computer architect to better understand branches predictability and also if the predictor should be improved related to Difficult-to-Predict Branches.

Trace-Cache with a multiple-branch predictor

ISSUE BOTTLENECK (DATA-FLOW) Conventional processing models are limited in their processing speed by the dynamic programs critical path (Amdahl); It is due to the intrinsic sequentially of the programs. 2 Solutions Dynamic Instruction Reuse (DIR) is a non-speculative technique. It comprises program critical path by reusing (dependent chains of) instructions; Value Prediction (VP) is a speculative technique. It comprises program critical path by predicting the instructions results during their fetch or decode pipeline stages, and unblocking dependent waiting instructions. Value Locality. Fundamental Limits of ILP Paradigm. Solutions Challenge Exploiting Selective Instruction Reuse and Value Prediction in a Superscalar / Simultaneous Multithreaded (SMT) Architecture to anticipate Long-Latency Instructions Results Selective Instruction Reuse (MUL & DIV) Selective Load Value Prediction (Critical Loads)

Dynamic Instruction Reuse vs. Value Prediction

A Dynamic Instruction Reuse Scheme

A Last Value Prediction Scheme

It is unbiased – the branch behavior (taken/not taken) is not sufficiently polarized for that context; The taken/not taken outcomes are highly shuffled. Our Scientific Hypothesis was: Identifying some Difficult-to-Predict Branches A branch in a certain dynamic context (GHR, LHRs, etc.) is difficult-to-predict if:

An Unbiased Branch. Context Extension

Identification Methodology Identifying Difficult-to-Predict Branches (SPEC) LHR 16 bits GHR 16 bits LHR 16 bits GHR 20 bits LHR 20 bits GHR 24 bits LHR 24 bits GHR 28 bits LHR 28 bits GHR 32 bits LHR 32 bits Remaining unbiased branches Unbiased

Decreasing the average percentage of unbiased branches by extending the contexts (GHR, LHRs)

Decreasing the average percentage of unbiased branches by adding new information (PATH, PBV)

Predicting Unbiased Branches Even state of the art branch predictors are unable to accurately predict unbiased branches; The problem consists in finding new relevant information that could reduce their entropy instead of developing new predictors; Challenge: adequately representing unbiased branches in the feature space! Accurately Predicting Unbiased Branches is still an Open Problem!

Random Degrees of Unbiased Branches Random Degree Metrics Based on: Hidden Markov Model (HMM) – a strong method to evaluate the predictability of the sequences generated by unbiased branches; Discrete entropy of the sequences generated by unbiased branches; Compression rate (Gzip, Huffman) of the sequences generated by unbiased branches.

Random Degrees of Unbiased Branches Prediction Accuracies using our best evaluated HMM (2 hidden states)

Random Degrees of Unbiased Branches Random Degree Metric Based on Discrete Entropy

Random Degrees of Unbiased Branches Random Degree Metric Based on Discrete Entropy. Results

Random Degrees of Unbiased Branches Space Savings using Gzip and Huffman Algorithms

Long-latency instructions represent another source of ILP limitation; This limitation is accentuated by the fact that about 28% of branches (5.61% being unbiased) subsequently depend on critical Loads; 21% of branches (3.76% being unbiased) subsequently depend on Mul/Div; In such cases misprediction penalty is much higher because the long-latency instruction must be solved first; Therefore we speed-up the execution of long- latency instruction by anticipating their results; We predict critical Loads and reuse Mul/Div instructions (results). Exploiting Selective Instruction Reuse and Value Prediction in a Superscalar Architecture

Parameters of the simulated superscalar/(SMT) architecture (M-Sim)

Exploiting Selective Instruction Reuse and Value Prediction in a Superscalar Architecture The M-SIM Simulator Cycle-Level Performance Simulator Hardware Configuration SPEC Benchmark Power Models Hardware Access Counts Performance Estimation Power Estimation

Exploiting Selective Instruction Reuse and Value Prediction in a Superscalar Architecture Selective Instruction Reuse (MUL & DIV) Trivial MUL/DIV Detection: V*0, V*1; 0/V, V/1, V/V. The RB is accessed during the issue stage, because most of the MUL/DIV instructions found in the RB during the dispatch stage do not have their operands ready.

Exploiting Selective Instruction Reuse and Value Prediction in a Superscalar Architecture Selective Load Value Prediction (Critical Loads)

LVP Recovery Principle 1: ld.w 0($r2) $r3; miss in D-Cache! 2: add $r4, $r3 $r5 3: add $r3, $r6 $r8 Unless $r3 is known, both 2 and 3 should be serialized; A LVP allows predicting $r3 before instruction 1 has been completed to start both 2 and 3 earlier (and in parallel); Whenever the prediction is verified to be wrong, a recovery mechanism is activated; In the previous case, this consists of squashing both instructions 2 and 3 from the ROB and re- execute them with the $r3 correct value (selective re-issue).

Benchmarking Methodology 7 integer SPEC 2000 benchmarks computation intensive (bzip, gcc, gzip) and memory intensive (mcf, parser,twolf, vpr); 6 floating-point SPEC 2000 benchmarks (applu, equake, galgel, lucas, mesa,mgrid); Results reported on 1 billion dynamic instructions, skipping the first 300 million instructions.

Exploiting Selective Instruction Reuse and Value Prediction in a Superscalar Architecture Relative IPC speedup and relative energy-delay product gain with a Reuse Buffer of 1024 entries, the Trivial Operation Detector, and the Load Value Predictor

Multithreaded Processing Metaphors Superscalar Fine-GrainedCoarse-Grained Multiprocessing Simultaneous Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot

Selective Instruction Reuse and Value Prediction in Simultaneous Multithreaded Architectures Fetch Unit Branch Predictor PC I-CacheDecode Issue Queue Rename Table Physical Register File ROB LVPT Functional Units LSQ D-Cache RB SMT Architecture (M-Sim) enhanced with per Thread RB and LVPT Structures

Selective Instruction Reuse and Value Prediction in Simultaneous Multithreaded Architectures IPCs obtained using the SMT architecture with/without 1024 entries RB & LVPT

Selective Instruction Reuse and Value Prediction in Simultaneous Multithreaded Architectures Relative IPC speedup and EDP gain (enhanced SMT vs. classical SMT)

Superscalar/Simultaneous Multithreaded Architectures with only SLVP

Design Space Exploration in the Superscalar & SLVP Architecture (1/4D+SLVP)

Design Space Exploration in the SMT & SLVP Architecture (1/4D+SLVP)

We developed some random degree metrics to characterize the randomness of sequences produced by unbiased branches. All these metrics are showing that sequences generated by unbiased branches are characterized by high random degrees. They might help the computer architect; We improved superscalar architectures by selectively anticipating long-latency instructions. IPC Speedup: 3.5% on SPECint2000, 23.6% on SPECfp2000. EDP Gain: 6.2% on SPECint2000, 34.5% on SPECfp2000; We analyzed the efficiency of these selective anticipatory methods in SMT architectures. They improve the IPC on all evaluated architectural configurations. Conclusions and Further Work Conclusions (I)

A SLVP reduces the energy consumption of the on-chip memory comparing it with a Non- Selective LVP scheme; It creates room for a reduction of the D-Cache size by preserving performance, thus enabling a reduction of the system cost; 1024 entries SLVP + ¼ D-Cache (16 KBytes/2-way/64B) seem to be a good trade-off in both superscalar and SMT cases. Conclusions and Further Work Conclusions (II)

Conclusions and Further Work Further Work Indexing the SLVP table with the memory address instead of the instruction address (PC); Exploiting an N-value locality instead of 1- value locality; Generating the thermal maps for the optimal superscalar and SMT configurations (and, if necessary, developing a run-time thermal manager); Understanding and exploiting instruction reuse and value prediction benefits in a multicore architecture.

Anticipatory multicore architectures Anticipatory multicores would significantly reduce the pressure on the interconnection network performance/energy; Predicting an instruction value and, later verifying the prediction might be not sufficient. There could appear data consistency errors (e.g. the CPU correctly predict a value representing a D- memory address but it subsequently could read an incorrect value from that speculative memory address!) consistency violation detection and recovery; The inconsistency cause: VP might execute out of order some dependent instructions; Between value prediction, multithreading and the cache coherence/consistence mechanisms there are subtle, not well-understood relationships; Nobody analyzed Dynamic Instruction Reuse in a multicore system. It will supplementary add the Reuse Buffers coherence problems. The already developed cache coherence mechanisms would help solving Reuse Buffers coherency.

Some Refererences L. VINTAN, A. GELLERT, A. FLOREA, M. OANCEA, C. EGAN – Understanding Prediction Limits through Unbiased Branches, Eleventh Asia-Pacific Computer Systems Architecture Conference, Shanghai 6- 8th, September, A. GELLERT, A. FLOREA, M. VINTAN, C. EGAN, L. VINTAN - Unbiased Branches: An Open Problem, The Twelfth Asia-Pacific Computer Systems Architecture Conference (ACSAC 2007), Seoul, Korea, August 23-25th, VINTAN L. N., FLOREA A., GELLERT A. – Random Degrees of Unbiased Branches, Proceedings of The Romanian Academy, Series A: Mathematics, Physics, Technical Sciences, Information Science, Volume 9, Number 3, pp , Bucharest, /13-Vintan.pdf 3/13-Vintan.pdf A. GELLERT, A. FLOREA, L. VINTAN. - Exploiting Selective Instruction Reuse and Value Prediction in a Superscalar Architecture, Journal of Systems Architecture, vol. 55, issues 3, pp , ISSN , Elsevier, GELLERT A., PALERMO G., ZACCARIA V., FLOREA A., VINTAN L., SILVANO C. - Energy-Performance Design Space Exploration in SMT Architectures Exploiting Selective Load Value Predictions, Design, Automation & Test in Europe International Conference (DATE 2010), March 8-12, 2010, Dresden, Germany -

THANK YOU!