Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO
Opportunities for Adaptivity Cache organization Cache performance “assist” mechanisms Hierarchy organization Memory organization (DRAM, etc) Data layout and address mapping Virtual Memory Compiler assist
Opportunities - Cont’d Cache organization: adapt what? –Size: NO –Associativity: NO –Line size: MAYBE, –Write policy: YES (fetch,allocate,w-back/thru) –Mapping function: MAYBE
Opportunities - Cont’d Cache “Assist”: prefetch, write buffer, victim cache, etc. between different levels. Adapt what? –Which mechanism(s) to use –Mechanism “parameters”
Opportunities - Cont’d Hierarchy Organization: –Where are cache assist mechanisms applied? Between L1 and L2 Between L1 and Memory Between L2 and Memory –What are the data-paths like? Is prefetch, victim cache, write buffer data written into the cache? How much parallelism is possible in the hierarchy?
Opportunities - Cont’d Memory Organization –Cached DRAM? –Interleave change? –PIM
Opportunities - Cont’d Data layout and address mapping –In theory, something can be done but… –MP case is even worse –Adaptive address mapping or hashing based on ???
Opportunities - Cont’d Compiler assist –Can select initial configuration –Pass hints on to hardware –Generate code to collect run-time info and adjust execution –Adapt configuration after being “called” at certain intervals during execution –Select/run-time optimize code
Opportunities - Cont’d Virtual Memory can adapt –Page size? –Mapping? –Page prefetching/read ahead –Write buffer (file cache) –The above under multiprogramming?
Applying Adaptivity What Drives Adaptivity? Performance impact, overall and/or relative “Effectiveness”, e.g. miss rate Processor Stall introduced Program characteristics When to perform adaptive action –Run time: use feedback from hardware –Compile time: insert code, set up hardware
Where to Implement In Software: compiler and/or OS +(Static) Knowledge of program behavior +Factored into optimization and scheduling -Extra code, overhead -Lack of dynamic run-time information -Rate of adaptivity -requires recompilation, OS changes
Where to Implement - Cont’d Hardware +dynamic information available +fast decision mechanism possible +transparent to software (thus safe) –delay, clock rate limit algorithm complexity –difficult to maintain long-term trends –little knowledge of about program behavior
Where to Implement - Cont’d Hardware/software +Software can set coarse hardware parameters +Hardware can supply software dynamic info +Perhaps more complex algorithms can be used –Software modification required –Communication mechanism required
Current Investigation L1 cache assist –See wide variability in assist mechanisms effectiveness between Individual Programs Within a program as a function of time –Propose hardware mechanisms to select between assist types and allocate buffer space –Give compiler an opportunity to set parameters
Mechanisms Used Prefetching –Stream Buffers –Stride-directed, based on address alone –Miss Stride: prefetch the same address using the number of intervening misses Victim Cache Write Buffer, all after L1
Mechanisms Used - Cont’d A mechanism can be used by itself or All are used at once Buffer space size and organization fixed No adaptivity involved
Observed Behavior Programs exhibit different effect from each mechanism, e.g none a consistent winner Within a program the same holds in the time domain between mechanisms.
Observed Behavior - Cont’d Both of the above facts indicate a likely improvement from adaptivity –Select a better one among mechanisms Even more can be expected from adaptively re- allocating from the combined buffer pool –To reduce stall time –To reduce the number of misses
Proposed Adaptive Mechanism Hardware: –a common pool of 2-4 word buffers –a set of possible policies, a subset of: Stride-directed prefetch PC-based prefetch History-based prefetch Victim cache Write buffer
Adaptive Hardware - Cont’d Performance monitors for each type/buffer –misses, stall time on hit, thresholds Dynamic buffer allocator among mechanisms Allocation and monitoring policy: –Predict future behavior from observed past –Observe over a time interval dT, set for next –Save perform. trends in next-level tags (<8bits)
Further opportunities to adapt L2 cache organization –variable-size line L2 non-sequential prefetch In-memory assists (DRAM)
MP Opportunities Even longer latency Coherence, hardware or software Synchronization Prefetch under and beyond the above –Avoid coherence if possible –Prefetch past synchronization Assist Adaptive Scheduling