"DDMT", Amir Roth, HPCA-71 Speculative Data-Driven Multithreading (an implementation of pre-execution) Amir Roth and Gurindar S. Sohi HPCA-7 Jan. 22, 2001.

Slides:

Advertisements

Similar presentations

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Multiscalar processors

1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science

Architecture Basics ECE 454 Computer Systems Programming

MorphCore: An Energy-Efficient Architecture for High-Performance ILP and High-Throughput TLP Khubaib * M. Aater Suleman *+ Milad Hashemi * Chris Wilkerson.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Pre-Execution via Speculative Data Driven Multithreading Amir Roth University of Wisconsin, Madison August 10, 2001.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Branch Hazards and Static Branch Prediction Techniques

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

A Quantitative Framework for Pre-Execution Thread Selection Gurindar S. Sohi University of Wisconsin-Madison MICRO-35 Nov. 22, 2002 Amir Roth University.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

1 Lecture 5a: CPU architecture 101 boris.

INTRODUCTION TO MULTISCALAR ARCHITECTURE

Amir Roth and Gurindar S. Sohi University of Wisconsin-Madison

Multiscalar Processors

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Simultaneous Multithreading

5.2 Eleven Advanced Optimizations of Cache Performance

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

CDA 3101 Spring 2016 Introduction to Computer Organization

Levels of Parallelism within a Single Processor

Computer Architecture Lecture 4 17th May, 2006

Module 3: Branch Prediction

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Address-Value Delta (AVD) Prediction

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

/ Computer Architecture and Design

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Levels of Parallelism within a Single Processor

rePLay: A Hardware Framework for Dynamic Optimization

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Presentation transcript:

"DDMT", Amir Roth, HPCA-71 Speculative Data-Driven Multithreading (an implementation of pre-execution) Amir Roth and Gurindar S. Sohi HPCA-7 Jan. 22, 2001

"DDMT", Amir Roth, HPCA-72 Pre-Execution Goal: high single-thread performance Problem: μarchitectural latencies of “problem” instructions (PIs) Memory: cache misses, Pipeline: mispredicted branches Solution: decouple μarchitectural latencies from main thread Execute copies of PI computations in parallel with whole program Copies execute PIs faster than main thread  “pre-execute” Why? Fewer instructions Initiate Cache misses earlier Pre-computed branch outcomes, relay to main thread DDMT: an implementation of pre-execution

"DDMT", Amir Roth, HPCA-73 Pre-Execution is a Supplement Fundamentally: tolerating non-execution latencies (pipeline, memory) requires values faster than execution can provide them Ways of providing values faster than execution Old: Behavioral prediction: table-lookup + small effort per value, - less than perfect (“problems”) New: Pre-execution: executes fewer instructions + perfect accuracy, - more effort per value Solution: supplement behavioral prediction with pre-execution Key: behavioral prediction must handle majority of cases Good news: it already does

"DDMT", Amir Roth, HPCA-74 Data-Driven Multithreading (DDMT) DDMT: an implementation of pre-execution Data-Driven Thread (DDT): pre-executed computation of PI Implementation: extension to simultaneous multithreading (SMT) SMT is a reality (21464) Low static cost: minimal additional hardware Pre-execution siphons execution resources Low dynamic cost: fine-grain, flexible bandwidth partitioning Take only as much as you need Minimize contention, overhead Paper: Metrics, algorithms and mechanics Talk: Mostly mechanics

"DDMT", Amir Roth, HPCA-75 Talk Outline Working example in 3 parts Some details Numbers, numbers, numbers

"DDMT", Amir Roth, HPCA-76 Example.1 Identify PIs Use profiling to find PIs for (node=list; node; node = node->next) node->val -= node->neighbor->val * node->coeff; if (node->neighbor != NULL) STATIC CODE ldt f1, 16(r2)I5 ldt f0, 16(r1)I4 ldt f2, 24(r1)I6 ldt f1, 16(r2)I5 ldt f0, 16(r1)I4 ldq r1, 0(r1)I10 stt f0, 16(r1)I9 subt f0, f3, f0I8 mult f1, f2, f3I7 beq r2, I10I3 ldq r2, 8(r1)I2 beq r1, I12I1 br I1I11 beq r2, I10I3 ldq r2, 8(r1)I2 beq r1, I12I1 br I1I5 ldq r1, 0(r1)I3 INSTPC DYNAMIC INSN STREAM Running example: same as the paper Simplified loop from EM3D Few static PIs cause most dynamic “problems” Good coverage with few static DDTs

"DDMT", Amir Roth, HPCA-77 ldt f1, 16(r2)I5 ldt f0, 16(r1)I4 ldt f2, 24(r1)I6 ldt f1, 16(r2)I5 ldt f0, 16(r1)I4 ldq r1, 0(r1)I10 stt f0, 16(r1)I9 subt f0, f3, f0I8 mult f1, f2, f3I7 beq r2, I10I3 ldq r2, 8(r1)I2 beq r1, I12I1 br I1I11 beq r2, I10I3 ldq r2, 8(r1)I2 beq r1, I12I1 br I1I11 ldq r1, 0(r1)I10 INSTPC Example.2 Extract DDTs Use first instruction as DDT trigger Dynamic trigger instances signal DDT fork ldt f1, 16(r2)I5 beq r2, I10I3 ldq r2, 8(r1)I2ldq r1, 0(r1)I10ldq r1, 0(r1)I10 ldt f1, 16(r2)I5 beq r2, I10I3 ldq r2, 8(r1)I2 ldq r1, 0(r1)I10 INSTPC DDTC (static) I10: Examine program traces Start with PIs Work backwards, gather backward-slices Eventually stop. When? (see paper) Pack last N-1 slice instructions into DDT Load DDT into DDTC (DDT$)

"DDMT", Amir Roth, HPCA-78 Example.3 Pre-Execute DDTs Main thread (MT) ldt f0, 16(r1)I4 beq r1, I12I1 br I1I11 ldq r1, 0(r1)I10 ldq r2, 8(r1)I2beq r2, I10I3ldt f1, 16(r2)I5 stt f0, 16(r1)I9 subt f0, f3, f0I8 ldt f1, 16(r2)I5 mult f1, f2, f3I7beq r2, I10I3ldt f2, 24(r1)I6ldq r2, 8(r1)I2 INSTPC ldt f1, 16(r2)I5 ldq r1, 0(r1)I10 DDT ldt f0, 16(r1)I4 beq r2, I10I3 ldq r2, 8(r1)I2 beq r1, I12I1 br I1I11 ldq r1, 0(r1)I10 INSTPC … ldq r1, 0(r1)I10 INSTPC DDTC (static) I10: MT MT integrates DDT results Instr’s not re-executed  reduces contention Shortens MT critical path Pre-computed branch avoids mis-prediction MT, DDT execute in parallel Executed a trigger instr? Fork DDT (μarch) DDT initiates cache miss “Absorbs” latency

"DDMT", Amir Roth, HPCA-79 Details.1 More About DDTs Composed of instr’s from original program Required by integration Should look like normal instr’s to processor Data-driven: instructions are not “sequential” No explicit control-flow How are they sequenced? Pack into “traces” (in DDTC) Execute all instructions (branches too) Save results for integration No runaway threads, better overhead control “Contain” any control-flow (e.g. unrolled loops) ldt f1, 16(r2)I5 beq r2, I10I3 ldq r2, 8(r1)I2 ldq r1, 0(r1)I10 INSTPC I10: ldt f1, 16(r2)I5 beq r2, I10I3 ldq r2, 8(r1)I2 ldq r1, 0(r1)I10 INSTPC I10: ldq r1, 0(r1)I10 ldq r1, 0(r1)I10 ldq r1, 0(r1)I10

"DDMT", Amir Roth, HPCA-710 Details.2 More About Integration Centralized physical register file Use for DDT-MT communication beq r1, I12 br I1 ldq r1, 0(r1) ldq r2, 8(r1) ldq r1, 0(r1) stt f0, 16(r1) subt f0, f3, f0 mult f1, f2, f3 ldt f2, 24(r1) ldt f1, 16(r2) ldt f0, 16(r1) beq r2, I10 ldq r2, 8(r1) beq r1, I12 br I1 ldq r1, 0(r1) ldq r2, 8(r1) ldq r1, 0(r1) r1 ldq r1, 0(r1) ldq r2, 8(r1) More on integration: implementation of squash-reuse [MICRO-33] Fork: copy MT rename map to DDT DDT locates MT values via lookups “Roots” integration Integration: match PC/physical registers to establish DDT-MT instruction correspondence MT “claims” physical registers allocated by DDT Modify register-renaming to do this

"DDMT", Amir Roth, HPCA-711 Details.3 More About DDT Selection Very important problem Fundamental aspects Metrics, algorithms Promising start See paper Practical aspects Who implements algorithm? How do DDTs get into DDTC? Paper: profile-driven, offline, executable annotations Open question

"DDMT", Amir Roth, HPCA-712 Performance Evaluation SPEC2K, Olden, Alpha EV6, –O3 –fast Chose programs with problems SimpleScalar-based simulation environment DDT selection phase: functional simulation on small input DDT measurement phase: timing simulation on larger input 8-wide, superscalar, out-of-order core 128 ROB, 64 LDQ, 32 STQ, 80 RS (shared) Pipe: 3 fetch, 2 rename/integrate, 2 schedule, 2 reg read, 2 load 32KB I$/64KB D$ (2-way), 1MB L2$ (4-way), mem b/w: 8 b/cyc. DDTC: 16 DDTs, 32 instructions (max) per DDT

"DDMT", Amir Roth, HPCA-713 Numbers.1 The Bottom Line Cache misses Speedups vary, 10-15% DDT “unrolling”: increases latency tolerance (paper) Branch mispredictions Speedups lower, 5-10% More PIs, lower coverage Branch integration != perfect branch prediction Effects mix

"DDMT", Amir Roth, HPCA-714 Numbers.2 A Closer Look DDT overhead: fetch utilization ~5% (reasonable) Fewer MT fetches (always) Contention Fewer total fetches Early branch resolution DDT utility: integration rates Vary, mostly ~30% (low) Completed: well done Not completed: a little late Input-set differences Greedy DDTs, no early exits

"DDMT", Amir Roth, HPCA-715 Numbers.3 Is Full DDMT necessary? How important is integration? Important for branch resolution, less so for prefetching How important is decoupling? What if we just priority-scheduled? Extremely. No data-driven sequencing  no speedup

"DDMT", Amir Roth, HPCA-716 Summary Pre-execution: supplements behavioral prediction Decouple/absorb μarchitectural latencies of “problem” instructions DDMT: an implementation of pre-execution An extension to SMT Few hardware changes Bandwidth allocation flexibility reduces overhead Future: DDT selection Fundamental: better metrics/algorithms to increase utility/coverage Practical: an easy implementation Coming to a university/research-lab near you