Avoiding Initialization Misses to the Heap Jarrod Lewis, Bryan Black, and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison Intel Labs http://www.ece.wisc.edu/~pharm
Motivation Memory bandwidth is expensive Shouldn’t waste on useless traffic Can be put to better use Multithreading, prefetching, MLP, etc. Search and destroy useless traffic Focus of this talk: heap initialization Detect and optimize initialization of newly allocated memory 23% of misses in 2MB cache are invalid April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Dynamically Allocated Memory Invalid Unallocated Invalid Heap Space malloc() free() initializing store free() Allocated Valid load or store Invalid memory need not be transferred Provide interface that expresses this directly? April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Avoiding Initialization Misses to the Heap – Mikko Lipasti Talk Outline Motivation Analysis of Heap Behavior Detecting Initializing Writes Performance Analysis Conclusions April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Avoiding Initialization Misses to the Heap – Mikko Lipasti Allocation Analysis Two main modes Single dominant allocation (up to 100MB) or Numerous moderate allocations Initialization of allocations 88% initialized with store miss Little temporal reuse of free’d memory Phase behavior Start of program often dominates Even SPEC has counterexamples (gcc, vortex) April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Avoiding Initialization Misses to the Heap – Mikko Lipasti Cache Miss Behavior Init stores cause up to 60% of misses (avg 23%) These are 35% of all compulsory misses April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Avoiding Initialization Misses to the Heap – Mikko Lipasti Talk Outline Motivation Analysis of Heap Behavior Detecting Initializing Writes Performance Analysis Conclusions April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Detecting Initializing Writes Annotate malloc() Record base, size in allocation range cache Key questions What is working set? How are ranges represented? Valid bits? Not scalable for 100M allocation Base + bound How are ranges updated on writes? Split vs. truncate April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Allocation Working Set 4-8 entries sufficient, except parser needs 64 April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Sequential Initialization Tracking Allocated-Invalid Initialized Pattern Scheme Unknown 1. Sequential 1. Forward Sweep A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F Forward sweep captures 90%+ except Bzip, gzip, perl April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Alternating Initialization Tracking Allocated-Invalid Initialized Pattern Scheme Unknown 2. Alternating 2. Bidirectional Sweep A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F Bidirectional captures 90%+ of perl Doesn’t help bzip or gzip April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Striding Initialization Tracking Allocated-Invalid Initialized Pattern Scheme Unknown 3. Striding 3. Interleaving A B C D E F A C E B D F A B C D E F A C E B D F A B C D E F A C E B D F A B C D E F A C E B D F Interleaving captures 90%+ of gzip Still only 60% of bzip Bzip has a large allocation with random initialization April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Avoiding Initialization Misses to the Heap – Mikko Lipasti Talk Outline Motivation Analysis of Heap Behavior Detecting Initializing Writes Performance Analysis Conclusions April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Avoiding Initialization Misses to the Heap – Mikko Lipasti PharmSim Overview PharmSim -OOO Core -Gigaplane Block Simple SimOS-PPC -AIX 4.3.1 -Disk driver -E’net driver Ethernet Device simulation, etc. from SimOS-PPC [IBM ARL] PharmSim replaces functional simulators Full OOO core model, values in rename registers Supports priv. mode, MMU, TLB, exceptions, interrupts, barriers, flushes, etc. Lead developer: Trey Cain (thanks Trey!) April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Operating System Effects Widely accepted for SPECINT: Safe to ignore O/S paths Most popular tool (Simplescalar) Intercepts system calls Emulates on host, updates “flat” memory Returns “magically” with cache contents intact We have found that [CAECW2002]: Omitting system references leads to dramatic error (5.8x L2 miss rate, 100% IPC in worst case) Specifically, AIX page fault handler eliminates many initializing write misses Had we not used PHARMsim? Dramatically overstated performance benefit April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Avoiding Initialization Misses to the Heap – Mikko Lipasti AIX Page Installation Heap manager calls sbrk Malloc returns block < 4KB Program writes to block First reference causes page fault Heap manager calls sbrk Malloc returns block < 4KB Program writes to block Heap manager calls sbrk Malloc returns block < 4KB Program writes to block First reference causes page fault AIX installs entire page using dcbz Heap manager calls sbrk Heap manager calls sbrk Malloc returns block < 4KB Unallocated Unallocated Allocated Valid Data segment April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Block vs. Page Installation Practically free as part of page fault Shortcomings of page installation Pollutes cache Not scalable to superpages (AIX v5.1) Does not work for heap reuse Our short simulations don’t show this benefit I.e. high overlap between initializing writes and first reference to extended data segment April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Avoiding Initialization Misses to the Heap – Mikko Lipasti Integrating ARC April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Avoiding Initialization Misses to the Heap – Mikko Lipasti Speedup Very aggressive core model Still can’t tolerate all store miss latency Block mode slightly better than page mode Cache pollution, less coverage April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Program Phase Behavior Only benefits initialization program phase Some programs initialize throughout execution April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Avoiding Initialization Misses to the Heap – Mikko Lipasti Conclusions Initializing writes Cause 23% of all misses in 2MB L2 Avoid miss with block or page mode install Up to 41% performance improvement Subject to initialization:computation ratio Tracking allocation ranges Working set very small (4-8, 64) Forward/bidirectional/interleaved sweep enables range truncation April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Avoiding Initialization Misses to the Heap – Mikko Lipasti Acknowledgments Originated as course project: Gordie Bell, Trey Cain, Kevin Lepak PHARMsim infrastructure Lead developer: Trey Cain Financial and equipment support IBM and Intel Corp National Science Foundation University of Wisconsin April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Avoiding Initialization Misses to the Heap – Mikko Lipasti Questions? April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Avoiding Initialization Misses to the Heap – Mikko Lipasti Backup Slides April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Invalid Memory Traffic Real data traffic that transfers invalid data Initializing Store Initial write to a storage location that contains invalid data April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Avoiding Initialization Misses to the Heap – Mikko Lipasti Allocation Analysis Single dominant allocation vs. Numerous moderate allocations April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Initialization of Heap 88% initialized by store miss Relatively little temporal reuse of freed memory April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Avoiding Initialization Misses to the Heap – Mikko Lipasti PharmSim Pipeline Decode Execute Commit Mem Fetch Translate Substantially similar to IBM Power4 Some instructions “cracked” (1:2 expansion) Others (e.g. lmw) microcode stream Mem Stage Interface to 2-level cache model Sun Gigaplane XB snoopy MP coherence Caches contain values, must remain coherent No cheating! No “flat” memory model for reference/redirect April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Avoiding Initialization Misses to the Heap – Mikko Lipasti Machine Model Unrealistically aggressive model to devalue the impact of store misses. 8-wide, 6-stage pipeline 8K entry combining predictor 128 RUU, 64 LSQ entries, 64 write buffers 256KB 4-way associative L1D cache 64KB 2-way associative L1I 2MB 4-way associative L2 unified cache All cache blocks are 64 bytes L2 latency is 10 cycles Memory latency is 70 cycles. April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti