Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye
Be-Nice Scheduling ITS (Inter-Thread Stall) Introduction Be-Nice Scheduling Some experimental results
Be-Nice Scheduling ITS Introduction –ITS in Out-Of-Order processor –ITS in In-Order processor Be-Nice Scheduling Some experimental results
ITS Introduction –ITS in Out-Of-Order machine A thread holds (or fulfills) shared resources too long, e.g., instruction queue/reservation station/..., and blocks others Flush, … –ITS in In-Order machine A thread holds Functional Units, blocking others 2 examples What can compiler do ? Be-Nice Scheduling
ITS Introduction –ITS In In-Order machine Examples, assume: –SMT, 2 threads –Embedded –2 LS units, and 2 ALU –Separate dispatch buffer Be-Nice Scheduling
ITS Introduction –ITS In In-Order machine Example – 1 (Same FU ITS) –A missed load can block other threads which are using the same LS unit Be-Nice Scheduling
add ld add EXE MEM WB Dispatch Buffer LS1LS2ALU1ALU2 ld add MISS Example - 1 : same-FU block Thread-A Thread-B
ITS Introduction –ITS In In-Order machine Example – 2 (Cross FU ITS) –A missed load can block other threads which are using non-LS Functional Units, e.g., ALU Be-Nice Scheduling
add ld add EXE MEM WB Dispatch Buffer LS1LS2ALU1ALU2 add MISS Example – 2 : cross-FU block add Thread-A Thread-B
ITS Introduction –ITS In In-Order machine Be-Nice Scheduling Assume: 1.Thread-A cache miss, around 1%~2% 2. Thread-B always hit Results: 1. Half of idle cycles are due to ITS 2. Almost 1/3 cycles are idle The effect of ITS, from thread-A to thread-B
ITS Introduction –ITS In In-Order machine What can compiler do ? –Focused on in-order embedded processor –Need a few simple HW supports –Using Open64, in Instruction Scheduling Be-Nice Scheduling
ITS (Inter-Thread Stall) Introduction Be-Nice Scheduling Some experimental results
Be-Nice Scheduling Intuitive thinking –Prefetch : Unacceptable for embedded system –Reduce Cross-FU ITS: Reduce the number of FUs hold by the thread-A –Reduce Same-FU ITS: Avoid issuing instructions from other threads into those blocked FUs Be-Nice Scheduling
add ld EXE MEM WB Dispatch Buffer LS1LS2ALU1ALU2 add Thread-A Thread-B add ld add sched Original Thread-A
Be-Nice Scheduling –Objective Schedule n (>=2) loads back-to-back Issue the n loads to same FU –Compiler + HW solution HW side –Add an extra load, ld.n (n=1,2), saying sending load only to the n th LS unit –Different threads has its prefer LS unit Compiler side –Profile to figure out the loads which are highly possible to miss, saying ‘load_a’ –Schedule another load, saying ‘load_b’, behind ‘load_a’, and glue them as a pseudo OP –Change ‘load_a’ and ‘load_b’ to the thread’s prefer LS unit, e.g., both are changed to ‘ld.1’ Be-Nice Scheduling
–A Compiler + HW solution Be-Nice Scheduling BB1: $r1 = ld $r2 $r2 = $r2 + 4 $r3 = ld $r4 $r3 = $r3 + 4 $r5 = $r1 + $r3 BB1: $r1 = ld $r2 $r3 = ld $r4 $r2 = $r2 + 4 $r3 = $r3 + 4 $r5 = $r1 + $r3 BB1: $r1 = ld $r2 $r2 = $r2 + 4 $r3 = ld $r4 $r3 = $r3 + 4 $r5 = $r1 + $r3 Identified to miss BB1: $r1 = ld.1 $r2 $r3 = ld.1 $r4 $r2 = $r2 + 4 $r3 = $r3 + 4 $r5 = $r1 + $r3
WHIRL CG-expand CGIR Control flow opt. If-conversion Loop optimizations Software pipelinin g Loop unrolling Scheduling pre- pass ( GCM here) Local register alloc Scheduling post-pass Prolog and Epilog Extended block optimizer Code emission.s Global register alloc Be-Nice Scheduling
Be-Nice Scheduling ( In Open64 GCM and LIS ) –The key points during code motion Use GCM to find candidates of pair Moving the pair as a ‘pseudo’ single instruction Be-Nice Scheduling
Some experimental results –Be-Nice Schedule on Thread-A –Performance difference on Thread-B
Be-Nice Scheduling Some experimental results The Number of ITS Cycles in thread-B: w/ Be-Nice vs. w/o Be-Nice
Be-Nice Scheduling Some experimental results IPC Improvement of thread-B with Be-Nice Instruction Scheduling