Colorado Computer Architecture Research Group Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andy Janiszewski Daniel A.

Colorado Computer Architecture Research Group Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andy Janiszewski Daniel A. Connors University of Colorado at Boulder

2 Colorado Architecture Research Group Introduction Shared memory systems of SMT processors limit performance  Threads continuously compete for shared cache resources  Interference between threads causes workload slowdown Detecting thread interference is a challenge for real systems  Low level cache monitoring  Difficult to exploit run-time data Goal  Design the performance monitoring hardware required to capture thread interference information that can be exposed to the operating system scheduler to improve workload performance

3 Colorado Architecture Research Group Simultaneous Multithreading (SMT) Concurrently executes instructions from different contexts  Thread level parallelism (TLP)  Improves instruction level parallelism (ILP) Improves utilization of base processor Intel Pentium 4 Xeon  2 level cache hierarchy Instruction trace cache 8K data cache 4 way associative 64 bytes per line 512K L2 cache – Unified 8 way associative 64 bytes per line  2 way SMT

4 Colorado Architecture Research Group Inter-thread Interference Competition for shared resources  Memory system Buses Physical cache storage  Fetch and issue queues  Functional units Threads evict cache data belonging to other threads  Increase in cache misses  Diminishes processor utilization Inter-thread kick outs (ITKO)  Measured in simulator Thread id of evicted cache line compared to new cache line  Increased ITKO leads to decrease in IPC

5 Colorado Architecture Research Group ITKO to IPC Correlation Level 3 Cache   IPC recorded for each phase interval   High ITKO rate leads to significant drop in IPC   Large variability in IPC over workload lifetime – cache interference

6 Colorado Architecture Research Group Related Work Different levels of addressing the interference problem  Compiler [Kumar,Tullsen; Micro 02] Procedure placement optimization  workload fixed at compile time [J. Lo; Micro 97] Tailoring compiler optimizations for SMT  Effects of traditional optimizations on SMT performance  Static optimizations  Operating System [Tullsen, Snavely; ASPLOS 00] Symbiotic job scheduling  Profile based, simulated OS and architecture [J. Lo; ISCA 98] Data cache address remapping  workload dependent, data base applications  Microarchitecture [Brown; Micro 01] - Issue policy feedback from memory system  Improved fetch and issue resource allocation  Does not tackle inter-thread interference

7 Colorado Architecture Research Group Motivation Improve performance by reducing inter-thread interference  Multi-faceted problem Dependent on thread pairings Occurs at low-level cache line granularity  Difficult to detect at runtime OS scheduling decisions affect microarchitecture performance  Observed on both simulator and real system Observation  Cache access footprints vary over program lifetimes  Accesses are concentrated in small cache regions

8 Colorado Architecture Research Group Concentration of L2-Cache Access  Cache access and miss footprints vary across program phases  Intervals with high access and miss rates are concentrated in small physical regions of the cache (green, red)  Current performance counters can not detect that activity is concentrated in small regions

9 Colorado Architecture Research Group Cache Use Map: Runtime Monitoring   Spatial locality   vertical   Temporal locality   horizontal

10 Colorado Architecture Research Group Benchmark Pairings ITKO  gzip/mesa  equake/perl  mesa/perl  gzip/perl  mesa/equake  gzip/equake  Yellow represents very high interference  Interference is dependent on job mix

11 Colorado Architecture Research Group Performance Guided Scheduling Theory Total ITKOs Best Static: 2.91 Million Dynamic: 2.55 Million Total ITKOs Best Static: 2.91 Million Dynamic: 2.90 Million Total ITKOs Best Static: 7.30 Million Dynamic: 6.70 Million Total ITKOs Best Static: 2.91 Million Dynamic: 2.55 Million  equake  gzip  perl  mesa  Each phase scheduler selects jobs with least interference

12 Colorado Architecture Research Group Solution to Inter-thread Interference Predict future interference  Capture inter-thread interference behavior Introduce cache line activity counters Expose to operating system  Current schedulers use symmetric multiprocessing (SMP) algorithms for SMT processors Activity based job scheduler  Schedule for minimal inter-thread interference

13 Colorado Architecture Research Group 7949 4271 3678 1760 2511 2204 1474 Activity Vectors Interface between OS and microarchitecture Divide cache into Super Sets  Access counters assigned to each super set  One vector bit corresponds to each counter Bit is set when threshold is exceeded Job scheduler  Compare active vector with jobs in run queue  Selects job with fewest common set bits 1 0 0 1 1 1 0 X i >1024? 1 0 1 1 0 X i >2048? 1 0 X i >4096? 1234 526 876 1635 1067 1137 1220 254 Thresholds established through static analysis Global median across all benchmarks

14 Colorado Architecture Research Group Vector Prediction - Simulator Use last vector to approximate next vector  Average accuracy 91%  Simple and effective Activity Vector Use Predictability Miss Predictability D-Cache82.3%93.6% I-Cache94.9%90.3% L2-Cache93.8%94.6% Average90.3%92.8%

15 Colorado Architecture Research Group OS Scheduling Algorithm perlbmkgzipmesa OS task mcfammpparser twolf CPU 0 CPU 1 Physicalprocessor Twolf vector CMP vectors Run queue 0 Run queue 1  Weighted sum of vectors at each level  Vectors from L2 given highest weight

16 Colorado Architecture Research Group Activity Vector Procedure Real system  Modified Linux kernel 2.6.0  Tested on Intel P4 Xeon Hyper-threading Emulated activity counter registers  Generate vectors off-line Valgrind memory simulator Text file output Copy vectors to kernel memory space  Activate vector scheduler Time and run workloads Program Phase D-cache Vector L2-cache Vector 01110011000111011 11100000001111000 20011110111010000 N1110000100011100   Simulator   Vector hardware   Simulated OS

17 Colorado Architecture Research Group Workloads - Xeon WL1gzip.vpr.gcc.mesa.art.mcf.equake.crafty WL2parser.gap.vortex.bzip2.vpr.mesa.crafty.mcf WL3Mesa.twolf.vortex.gzip.gcc.art.crafty.vpr WL4Gzip.twolf.vpr.bzip2.gcc.gap.mesa.parser WL5Equake.crafty.mcf.parser.art.gap.mesa.vortex WL6twolf.bzip2.vortex.gap.parser.crafty.equake.mcf   8 Spec 2000 jobs per workload   Combination of integer and floating point applications   Run to completion in parallel with OS level jobs

18 Colorado Architecture Research Group Comparison of Scheduling Algorithms   Default Linux vs. Activity based   More than 30% of default scheduler decisions could have been improved by the activity based scheduler

19 Colorado Architecture Research Group Activity Vector Performance - Xeon

20 Colorado Architecture Research Group Comparing Activity Vectors to Existing Performance Counters - Simulation Benchmark Mix% Diff. 164.gzip, 164.gzip, 181.mcf, 183.equake0.0% 164.gzip, 164.gzip, 188.ammp, 300.twolf12.0% 164.gzip, 177.mesa, 181.mcf, 183.equake0.0% 164.gzip, 177.mesa, 183.equake, 183.equake0.0% 164.gzip, 197.parser, 253.perlbmk, 300.twolf44.4% 177.mesa, 177.mesa, 197.parser, 300.twolf11.1% 177.mesa, 181.mcf, 253.perlbmk, 256.bzip20.0% 177.mesa, 188.ammp, 253.perlbmk, 300.twolf59.5% 177.mesa, 197.parser, 197.parser, 256.bzip296.2% 181.mcf, 181.mcf, 256.bzip2, 256.bzip20.0% 181.mcf, 183.equake, 253.perlbmk, 300.twolf4.0% 181.mcf, 253.perlbmk, 253.perlbmk, 256.bzip20.0% 183.equake, 188.ammp, 188.ammp, 256.bzip211.1% 188.ammp, 188.ammp, 197.parser, 197.parser96.2% 188.ammp, 300.twolf, 300.twolf, 300.twolf8.0% 197.parser, 197.parser, 253.perlbmk, 256.bzip20.0% Average22.5% On average activity schedule makes different decisions than the performance counter based schedule 23% of the time

21 Colorado Architecture Research Group ITKO Reduction - Simulation Benchmarks%ITKO Reduction%IPC Gain gzip.gzip.mcf.equake54.0%3.6% gzip.gzip.ammp.twolf10.5%4.5% gzip.mesa.mcf.equake39.5%3.0% gzip.mesa.equake.equake47.0%2.4% mesa.mesa.parser.twolf10.3%4.8% mcf.equake.perlbmk.twolf1.7%3.0% mcf.perlbmk.perlbmk.bzip213.0%12.1% ammp.twolf.twolf.twolf1.9%6.1% Average22%5%

22 Colorado Architecture Research Group Contributions Interference analysis of cache accesses Introduce fine grained performance counters General purpose adaptable optimization  Expose microarchitecture to OS  Workload independent Tested on a real SMT machine  Implemented on Linux kernel  2 way SMT core

23 Colorado Architecture Research Group Activity Based Scheduling Summary Prevents inter-thread interference  Monitors cache access behavior  Co-schedules jobs with expected low interference  Adapts to phased workload behavior Performance improvements  Greater than 30% opportunity to improve the default Linux scheduling decisions  22% Reduction in inter-thread interference  5% Improvement in execution time

24 Colorado Architecture Research Group Thank You

25 Colorado Architecture Research Group Super Set Size What happens when we change the number of super sets used. Can we include a graph here? Slide 17 once we have the data… May want to include the tree chart

26 Colorado Architecture Research Group Performance Challenges Difficult to detect interference Inter-thread interference is a multi-faceted problem  Occurs at low-level cache line granularity  Temporal variability in benchmark memory requests  Dependent on thread pairings OS scheduling decisions affect performance Current systems  Increased cache associativity  Could use PMU register feedback

27 Colorado Architecture Research Group Activity Vectors Interface between OS and microarchitecture Divide cache into Super Sets  Access counters assigned to each super set  One vector bit corresponds to each counter Bit is set when threshold is exceeded Job scheduler  Compare active vector with jobs in run queue  Selects job with fewest common set bits 1 0 0 1 1 1 0 0 X i >1024? 1234 526 876 1635 1067 1137 1000 254 Expect interference Expect no interference

28 Colorado Architecture Research Group OS Scheduling OS scheduling important when more jobs than contexts Current schedulers use symmetric multiprocessing (SMP) algorithms for SMT processors Proposed work  For each time interval co-schedule jobs whose cache accesses are in different regions

29 Colorado Architecture Research Group Prevent jobs from running together during program phases where they exhibit high degrees of cache interference Program Phase D-cache Vector L2-cache Vector 01110011000111011 11100000001111000 20011110111010000 N1110000100011100

Colorado Computer Architecture Research Group Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andy Janiszewski Daniel A.

Similar presentations

Presentation on theme: "Colorado Computer Architecture Research Group Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andy Janiszewski Daniel A."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Colorado Computer Architecture Research Group Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andy Janiszewski Daniel A.

Similar presentations

Presentation on theme: "Colorado Computer Architecture Research Group Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andy Janiszewski Daniel A."— Presentation transcript:

Similar presentations

About project

Feedback