Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Outline Helper threads VMT ideas Implementation details Hardware Firmware Compiler Results Conclusion.
Helper threads Used in Multi-threaded architectures to prefetch hard-to-predict delinquent data or compute hard-to-predict branches. Threads share resources as fetch bandwith and functional units.
Hyper Threading (Intel - P4) Each hardware thread context is exposed as logical processor to the OS. OS finds threads for execution and binds them to the logical processor. User has to use OS-visible thread API to create and manage threads.
Helper Threads - Issues Resource contention among multiple helper threads Adaptable invocation for different program phases. Threads have to be self-throttling. OS based thread synchronization is unpredictable and has long latency ( ~micro secs)
Virtual Multithreading Single processor supports multiple thread contexts. Monitors long latency micro-architectural events. Switches to different Instruction in same program in 100 cycles. OS transparent Uses firmware support in Itanium 2 processor to reduce context switch time.
Context switch requires
Advantages with VMT on Itanium Ability to track micro-architectural events without involvement of the OS. Eg: Last level cache misses. Large register set partitioned by compiler for helper threads Register communication is easier Value Synchronization - no memory comm. OS context switches allow threads to be resumed on any processor.
New Instructions Yield Synchronous transfer to VMT thread, similar to branch misprediction Yield conditional Transfer only when pipeline stalls at some later instruction. Execution proceed, instructions retire No pipeline stall instruction behaves as nop
Key Characteristics Self throttling – main and helper threads keep counters to track progress (iteration counter) Helper thread falls behind -> reload value Helper thread runs too far ahead -> relinquish ctrl. Main thread begins execution at instruction that triggered helper thread invocation. VMT preserves thread continuation of helper threads -> helper thread can restart where it stopped.
Key Characteristics VMT has to maintain Initial instruction address. Continuation instruction address Compiler preserves 2 registers for the purpose. Support for multiple helper threads can be done by reprogramming these registers.
Itanium Firmware Programmable debugging hardware support for PAL To enable silicon debugging & validation. PAL can program PMU to monitor and count events of interest - opcode monitoring, instruction addr., Data addr. Debugging hardware can trigger a PAL handler when the monitored even occurs.
Firmware VMT mechanism emulated by firmware infrastructure. Opcode monitoring to simulate yield and yield conditional. PAL programs PMU to track Last level cache misses Pipeline stalls Instructions with special opcodes. Thread switch latency = pipeline flush + overhead for manipulating registers. (~140 cycles giving 60 cycles of computation time when memory miss ~200 cycles)
Experimental machine. 4 way 1.5 Ghz Itanium 2 processor based MP system with 16 GB of RAM. Separate 16 KB 4-way set associate L1 I- and D- cache Shared 256 KB 8-way set associative L2 cache. 6 MB 24-way L3 cache that can be configured as 1 MB 4-way set associative cache.
workloads MCF – combinatorial optimization. VPR – FPGA Circuit Placement and Routing DOT – graph layout optimization tool DSS system running on 100 GB IBM DB2 database 6 queries with long run time and span large portions of database. 95% cpu utilization, 40 concurrent threads
Compiler and optimizations Electon –O3, IPA, Profile guided opt., Itanium2 specific opt. Recompiled to obtain threads and linked with original binaries. Register partitioning to minimize VMT context switch Aggressive software prefetching with profile feedback. Ld.s, chk, predication, branch prediction hints.
SpeedUp A few helper threads give good speedup Significant fraction of L3 misses are removed from main thread (avg. 48%). Capacity misses in L3 are due to pointer chasing. Helper thread size is small. Helper thread can contain control flow dependencies also. Throughput improved by reducing latency of individual threads
Conclusion Fly-weight context switching 5.8 – 38.5% increase for SPEC2000 INT 5-12% speedup on DSS workload. VMT threads are to be invoked based on program behavior depending on number of cache misses.
My view on limitations Requires large register files and firmware support. Too Itanium (not adaptable to other architectures). Scalability of helper threads.(# helper threads running at one time…to complex)