Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.

Slides:



Advertisements
Similar presentations
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
Advertisements

CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
1 Optimization Optimization = transformation that improves the performance of the target code Optimization must not change the output must not cause errors.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
A scheme to overcome data hazards
Compiler techniques for exposing ILP
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Instruction-Level Parallelism (ILP)
/ Computer Architecture and Design Instructor: Dr. Michael Geiger Summer 2014 Lecture 6: Speculation.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Midterm Thursday let the slides be your guide Topics: First Exam - definitely cache,.. Hamming Code External Memory & Buses - Interrupts, DMA & Channels,
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
Architecture Basics ECE 454 Computer Systems Programming
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
TDC 311 The Microarchitecture. Introduction As mentioned earlier in the class, one Java statement generates multiple machine code statements Then one.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.
1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.
Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
1 Understanding the Energy-Delay Tradeoff of ILP-based Compilation Techniques on a VLIW Architecture G. Pokam, F. Bodin CPC 2004 Chiemsee, Germany, July.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
1 Lecture 5a: CPU architecture 101 boris.
IBM System 360. Common architecture for a set of machines
/ Computer Architecture and Design
Computer Structure Multi-Threading
The University of Adelaide, School of Computer Science
CS203 – Advanced Computer Architecture
Appendix C Pipeline implementation
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Advantages of Dynamic Scheduling
Morgan Kaufmann Publishers The Processor
Instruction Scheduling for Instruction-Level Parallelism
CMSC 611: Advanced Computer Architecture
Lecture 14: Reducing Cache Misses
Hardware Multithreading
Advanced Computer Architecture
Instruction Level Parallelism (ILP)
Instruction Execution Cycle
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
The Vector-Thread Architecture
September 20, 2000 Prof. John Kubiatowicz
Guest Lecturer: Justin Hsia
Presentation transcript:

Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye

Be-Nice Scheduling ITS (Inter-Thread Stall) Introduction Be-Nice Scheduling Some experimental results

Be-Nice Scheduling ITS Introduction –ITS in Out-Of-Order processor –ITS in In-Order processor Be-Nice Scheduling Some experimental results

ITS Introduction –ITS in Out-Of-Order machine A thread holds (or fulfills) shared resources too long, e.g., instruction queue/reservation station/..., and blocks others Flush, … –ITS in In-Order machine A thread holds Functional Units, blocking others 2 examples What can compiler do ? Be-Nice Scheduling

ITS Introduction –ITS In In-Order machine Examples, assume: –SMT, 2 threads –Embedded –2 LS units, and 2 ALU –Separate dispatch buffer Be-Nice Scheduling

ITS Introduction –ITS In In-Order machine Example – 1 (Same FU ITS) –A missed load can block other threads which are using the same LS unit Be-Nice Scheduling

add ld add EXE MEM WB Dispatch Buffer LS1LS2ALU1ALU2 ld add MISS Example - 1 : same-FU block Thread-A Thread-B

ITS Introduction –ITS In In-Order machine Example – 2 (Cross FU ITS) –A missed load can block other threads which are using non-LS Functional Units, e.g., ALU Be-Nice Scheduling

add ld add EXE MEM WB Dispatch Buffer LS1LS2ALU1ALU2 add MISS Example – 2 : cross-FU block add Thread-A Thread-B

ITS Introduction –ITS In In-Order machine Be-Nice Scheduling Assume: 1.Thread-A cache miss, around 1%~2% 2. Thread-B always hit Results: 1. Half of idle cycles are due to ITS 2. Almost 1/3 cycles are idle The effect of ITS, from thread-A to thread-B

ITS Introduction –ITS In In-Order machine What can compiler do ? –Focused on in-order embedded processor –Need a few simple HW supports –Using Open64, in Instruction Scheduling Be-Nice Scheduling

ITS (Inter-Thread Stall) Introduction Be-Nice Scheduling Some experimental results

Be-Nice Scheduling Intuitive thinking –Prefetch : Unacceptable for embedded system –Reduce Cross-FU ITS: Reduce the number of FUs hold by the thread-A –Reduce Same-FU ITS: Avoid issuing instructions from other threads into those blocked FUs Be-Nice Scheduling

add ld EXE MEM WB Dispatch Buffer LS1LS2ALU1ALU2 add Thread-A Thread-B add ld add sched Original Thread-A

Be-Nice Scheduling –Objective Schedule n (>=2) loads back-to-back Issue the n loads to same FU –Compiler + HW solution HW side –Add an extra load, ld.n (n=1,2), saying sending load only to the n th LS unit –Different threads has its prefer LS unit Compiler side –Profile to figure out the loads which are highly possible to miss, saying ‘load_a’ –Schedule another load, saying ‘load_b’, behind ‘load_a’, and glue them as a pseudo OP –Change ‘load_a’ and ‘load_b’ to the thread’s prefer LS unit, e.g., both are changed to ‘ld.1’ Be-Nice Scheduling

–A Compiler + HW solution Be-Nice Scheduling BB1: $r1 = ld $r2 $r2 = $r2 + 4 $r3 = ld $r4 $r3 = $r3 + 4 $r5 = $r1 + $r3 BB1: $r1 = ld $r2 $r3 = ld $r4 $r2 = $r2 + 4 $r3 = $r3 + 4 $r5 = $r1 + $r3 BB1: $r1 = ld $r2 $r2 = $r2 + 4 $r3 = ld $r4 $r3 = $r3 + 4 $r5 = $r1 + $r3 Identified to miss BB1: $r1 = ld.1 $r2 $r3 = ld.1 $r4 $r2 = $r2 + 4 $r3 = $r3 + 4 $r5 = $r1 + $r3

WHIRL CG-expand CGIR Control flow opt. If-conversion Loop optimizations Software pipelinin g Loop unrolling Scheduling pre- pass ( GCM here) Local register alloc Scheduling post-pass Prolog and Epilog Extended block optimizer Code emission.s Global register alloc Be-Nice Scheduling

Be-Nice Scheduling ( In Open64 GCM and LIS ) –The key points during code motion Use GCM to find candidates of pair Moving the pair as a ‘pseudo’ single instruction Be-Nice Scheduling

Some experimental results –Be-Nice Schedule on Thread-A –Performance difference on Thread-B

Be-Nice Scheduling Some experimental results The Number of ITS Cycles in thread-B: w/ Be-Nice vs. w/o Be-Nice

Be-Nice Scheduling Some experimental results IPC Improvement of thread-B with Be-Nice Instruction Scheduling