A Quantitative Framework for Pre-Execution Thread Selection Gurindar S. Sohi University of Wisconsin-Madison MICRO-35 Nov. 22, 2002 Amir Roth University.

Slides:



Advertisements
Similar presentations
The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.
Advertisements

Computer Organization and Architecture
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack Brian FieldsRastislav BodíkMark D. Hill University of Wisconsin-Madison.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.
Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications Piyush Shivam, Shivnath Babu, Jeffrey Chase Duke University.
Microarchitectural Approaches to Exceeding the Complexity Barrier © Eric Rotenberg 1 Microarchitectural Approaches to Exceeding the Complexity Barrier.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
UPC Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya 1999 International.
Multiscalar processors
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
TOWARDS BANDWIDTH- EFFICIENT PREFETCHING WITH SLIM AMPM June 13 th 2015 DPC-2 Workshop, ISCA-42 Portland, OR Vinson Young Ajit Krisshna.
Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.
"DDMT", Amir Roth, HPCA-71 Speculative Data-Driven Multithreading (an implementation of pre-execution) Amir Roth and Gurindar S. Sohi HPCA-7 Jan. 22, 2001.
1 Lecture 25: Advanced Data Prefetching Techniques Prefetching and data prefetching overview, Stride prefetching, Markov prefetching, precomputation- based.
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
ECE 720T5 Fall 2011 Cyber-Physical Systems Rodolfo Pellizzoni.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
MoBS-5 :: June 21, 2009 FIESTA: A Sample-Balanced Multi-Program Workload Methodology Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.
An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.
Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st,
Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2.
ECE 720T5 Winter 2014 Cyber-Physical Systems Rodolfo Pellizzoni.
1 Lecture 5a: CPU architecture 101 boris.
Fall 2012 Parallel Computer Architecture Lecture 13: Multithreading III Prof. Onur Mutlu Carnegie Mellon University 10/5/2012.
Amir Roth and Gurindar S. Sohi University of Wisconsin-Madison
Multiscalar Processors
Simultaneous Multithreading
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
/ Computer Architecture and Design
CSCI1600: Embedded and Real Time Software
Levels of Parallelism within a Single Processor
Lecture 14: Reducing Cache Misses
John-Paul Fryckman CSE 231: Paper Presentation 23 May 2002
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up
Instruction Level Parallelism (ILP)
/ Computer Architecture and Design
Levels of Parallelism within a Single Processor
rePLay: A Hardware Framework for Dynamic Optimization
The University of Adelaide, School of Computer Science
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
CSCI1600: Embedded and Real Time Software
Presentation transcript:

A Quantitative Framework for Pre-Execution Thread Selection Gurindar S. Sohi University of Wisconsin-Madison MICRO-35 Nov. 22, 2002 Amir Roth University of Pennsylvania

"Pre-Execution Thread Selection". Amir Roth, MICRO-352 mt A Quantitative Framework for Pre-Execution Thread Selection Pre-execution: proactive multithreading to tolerate latency pt Effect I: compressed fetch P-thread fetches  initiates  completes miss earlier Effect II: decoupling Miss in p-thread doesn’t stall main thread Sum effect : miss “moved” to p-thread Start with cache miss Extract computation (dynamic backward slice) Launch as redundant thread (p-thread) P-thread = trigger (launch PC) + body (slice)

"Pre-Execution Thread Selection". Amir Roth, MICRO-353 A Quantitative Framework for Pre-Execution Thread Selection This paper is not about pre-execution These papers are Assisted Execution [Song+Dubois, USC-TR98] SSMT [Chappell+, ISCA99, ISCA02, MICRO02] Virtual Function Pre-Computation [Roth+, ICS99] DDMT [Roth+Sohi, HPCA01] Speculative Pre-Computation [Collins+, ISCA01, MICRO01, PLDI02] Speculative Slices [Zilles+Sohi, ISCA01] Software-Controlled Pre-Execution [Luk, ISCA01] Slice Processors [Moshovos+, ICS01]

"Pre-Execution Thread Selection". Amir Roth, MICRO-354 A Quantitative Framework for Pre-Execution Thread Selection P-thread selection: pre-?-execution What p-threads? When to launch? (same question) A B C D Static p-threads Target static problem loads Choose once Launch many instances Hard: antagonistic criteria Miss coverage Latency tolerance per instance Overhead per instance Useless instances Longer p-thread = better, worse

"Pre-Execution Thread Selection". Amir Roth, MICRO-355 A Quantitative Framework for Pre-Execution Thread Selection Quantitative p-thread selection framework Simultaneously optimizes all four criteria Accounts for p-thread overlap (later) Automatic p-thread optimization and merging (paper only) Abstract pre-execution model (applies to SMT, CMP) 4 processor parameters Structured as pre-execution limit study May be used to study pre-execution potential This paper: static p-threads for L2 misses

"Pre-Execution Thread Selection". Amir Roth, MICRO-356 Rest of Talk Propaganda Framework proper Master plan Aggregate advantage P-thread overlap Quick example Quicker performance evaluation

"Pre-Execution Thread Selection". Amir Roth, MICRO-357 Plan of Attack and Key Simplifications Divide P-threads for one static load at a time Enumerate all possible * static p-threads Only p-threads sliced directly from program A priori length restrictions Assign benefit estimate to each static p-thread Number of cycles by which execution time will be reduced Iterative methods to find set with maximum advantage Conquer Merge p-threads with redundant sub-computations

"Pre-Execution Thread Selection". Amir Roth, MICRO-358 Estimating Static P-thread Goodness Key contribution: simplifications for computational traction 1.One p-thread instance executes at a time (framework)  P-thread interacts with main thread only 2.No natural miss parallelism (framework, not bad for L2 misses)  P-thread interacts with one main thread miss only 3.Control-less p-threads (by construction)  Dynamic instances are identical 4.No chaining (by construction)  Fixed number of them Strategy Model interaction of one dynamic instance with main thread Multiply by (expected) number of dynamic instances

"Pre-Execution Thread Selection". Amir Roth, MICRO-359 Aggregate Advantage (ADVagg) ADVagg: static p-thread goodness function Cycles by which static p-thread will accelerate main thread Combines four criteria into one number Collect raw materials from traces DCpt-cm = #misses covered OH = overhead per (SIZE / BWSEQproc) DCtrig = #instances launched LT = latency tolerance per (next slide) ADVagg = (DCpt-cm * LT) – (DCtrig * OH)

"Pre-Execution Thread Selection". Amir Roth, MICRO-3510 Dynamic P-thread Latency Tolerance (LT) LT SCDH: sequencing constrained dataflow height Estimate execution time of instruction sequence For each insn: dataflow + fetch constraint (insn# / BWSEQ) Computation: explicit Other main thread work: sparse insn #’s LT = min(SCDHmt – SCDHpt, Lcm) Starting at trigger SCDHmt Main thread miss execution time (SCDHmt) SCDHpt Minus p-thread miss execution time (SCDHpt) Lcm Bounded by miss latency (Lcm)

"Pre-Execution Thread Selection". Amir Roth, MICRO-3511 SCDH and LT Example Dataflow constraints: Main-thread: miss latency = 8, other latency = 1, serial deps P-thread: same Sequencing constraints: insn# / fetch bandwidth Main thread: sparse / 2 P-thread: dense / 1 SCDHmtSCDHpt addi R5, R5, #16max(0/2, 0) + 1 = 1max(0/1, 0) + 1 = 1 lw R6, 4(R5)max(7/2, 1) + 1 = 5max(1/1, 1) + 1 = 2 slli R7, R6, #2max(10/2, 5) + 1 = 6max(2/1, 2) + 1 = 3 addi R8, R7, #rxmax(11/2, 6) + 1 = 7max(3/1, 3) + 1 = 4 lw R9, 0(R8)max(12/2, 7) + 8 = 15max(4/1, 4) + 8 = 12 LT = min(15 – 12, 8) = 3

"Pre-Execution Thread Selection". Amir Roth, MICRO-3512 A B C D As p-thread length increases… ADVagg Example LT , DCpt-cm  C OH , DCtrig  DCpt-cm * LT = LTagg–DCtrig * OH = OHagg=ADVagg A B C D DCpt-cm * LT = LTagg–DCtrig * OH = OHagg=ADVagg A0 B5 C8 D8 DCpt-cm * LT = LTagg–DCtrig * OH = OHagg=ADVagg A400 B 5 C308 D 8 DCpt-cm * LT = LTagg–DCtrig * OH = OHagg=ADVagg A4002/4 B4053/4 C3084/4 D3085/4 DCpt-cm * LT = LTagg–DCtrig * OH = OHagg=ADVagg A400802/4 B405803/4 C308804/4 D /4 DCpt-cm * LT = LTagg–DCtrig * OH = OHagg=ADVagg A /440–40 B / C / D / DCpt-cm * LT = LTagg–DCtrig * OH = OHagg=ADVagg A /440–40 B / C / D / Params: 40 misses, 8 cycles each Max ADVagg = [40*8] – [40*0] = 320 Impossible to achieve

"Pre-Execution Thread Selection". Amir Roth, MICRO-3513 P-thread Overlap F: B, C, D have positive ADVagg Q: Why not choose all three? A: They cover same misses (LTagg’s overlap) P-thread overlap: framework… Represents it: slice tree (paper) Corrects for it: reduces LTagg by shared component Choose overlapping p-threads if ADVagg-red positive Not in this example OHaggLTaggLTagg-ovlpLTagg-redADVagg-red B6040*5 = –60 = 140 C8030*8 = –80 = 160 OHaggLTaggLTagg-ovlpLTagg-redADVagg-red B6040*5 = 20030*5 = –60 = 140 C8030*8 = –80 = 160 OHaggLTaggLTagg-ovlpLTagg-redADVagg-red B6040*5 = 20030*5 = –150 = 50200–60 = 140 C8030*8 = –80 = 160 OHaggLTaggLTagg-ovlpLTagg-redADVagg-red B6040*5 = 20030*5 = –150 = 5050–60 = –10 C8030*8 = –80 = 160

"Pre-Execution Thread Selection". Amir Roth, MICRO-3514 Performance Evaluation SPEC2000 benchmarks Alpha EV6, –O2 –fast Complete train input runs, 10% sampling Simplescalar-derived simulator Aggressive 6-wide superscalar 256KB 4-way L2, 100 cycle memory latency SMT with 4 threads (p-threads and main thread contend) P-threads for L2 misses Prefetch into L2 only

"Pre-Execution Thread Selection". Amir Roth, MICRO-3515 Contribution of Framework Features Framework accounts for Latency, overhead, overlap Isolate considerations 4 experiments, 3 diagnostics Measured via p-thread simulation GREEDY: as much LT as possible % exec increase (OH) % speedup % misses covered (LT)

"Pre-Execution Thread Selection". Amir Roth, MICRO-3516 Contribution of Framework Features Framework accounts for Latency, overhead, overlap Isolate considerations 4 experiments, 3 diagnostics Measured via p-thread simulation GREEDY: as much LT as possible +LT: as much LT as needed % exec increase (OH) % speedup % misses covered (LT)

"Pre-Execution Thread Selection". Amir Roth, MICRO-3517 Contribution of Framework Features Framework accounts for Latency, overhead, overlap Isolate considerations 4 experiments, 3 diagnostics Measured via p-thread simulation GREEDY: as much LT as possible +LT: as much LT as needed +OH: account for overhead % exec increase (OH) % speedup % misses covered (LT)

"Pre-Execution Thread Selection". Amir Roth, MICRO-3518 Contribution of Framework Features Framework accounts for Latency, overhead, overlap Isolate considerations 4 experiments, 3 diagnostics Measured via p-thread simulation GREEDY: as much LT as possible +LT: as much LT as needed +OH: account for overhead +OVLP: account for overlap % exec increase (OH) % speedup % misses covered (LT)

"Pre-Execution Thread Selection". Amir Roth, MICRO-3519 Pre-Execution Behavior Study Example: max p-thread length Full framework 8, 16, 32 Speedup (%) ADVagg just a model, not completely accurate ADVagg validation: important part of paper V1: ADVagg predictions should match simulated diagonostics V2: lying to ADVagg should produce worse p-threads Encouraging (intuitive) result Flexibility increases performance Also in paper Merging, optimization, input sets

"Pre-Execution Thread Selection". Amir Roth, MICRO-3520 Summary P-thread selection Important and hard Quantitative static p-thread selection framework Enumerate all possible static p-threads Assign a benefit value (ADVagg) Use standard techniques to find maximum benefit set Results Accounting for overhead, overlap, optimization helps Many more results in paper Future ADVagg accurate? Simplifications valid? Non-iterative approximations for real implementations