Trace Substitution Hans Vandierendonck, Hans Logie, Koen De Bosschere Ghent University EuroPar 2003, Klagenfurt
August 27, 2003Euro-Par Instruction Fetch Wide-issue superscalar processors need to fetch multiple branches per cycle –IPC=8 implies fetching ~16 instructions/cycle and predicting ~3 branches/cycle –Multi-ported instruction cache? Trace cache: –Packs fetch groups in a trace –Trace tagged with PC, path, next fetch PC –Multiple branch predictor (MBP) predicts branch directions
August 27, 2003Euro-Par The Trace Cache instruction cache trace cache MBP MUX select hit pred. trace pred. insn fetch address instructions hit/miss legend pred. path fetch address next addressinstructions fill unit only executed paths!
August 27, 2003Euro-Par Overview Observation –Trace cache misses are (sometimes) branch mispredictions Trace Substitution –How to make use of it Evaluation –Is it worth it? Conclusion
August 27, 2003Euro-Par Observation Multiple branch predictor affects trace cache: –Non-perfect branch predictors reduce the trace cache hit rate –FIPA correlates better with TC hit rate than with MBP accuracy TC: 16K-traces, 4-way set-assoc, path associativity MGAg, Mgshare: 12-bit history repeat: 8Kbit hybrid, accessed 3x
August 27, 2003Euro-Par TC Misses Are a Tell-Tale for MBP misses Trace cache misses coincide with branch mispredictions, e.g.: –16K-entry trace cache, 12-bit MGAg: 84.9% of TC misses are also MBP misses 37.6% of MBP misses are also TC misses –256-entry trace cache, 12 bit MGAg: 25.1% of TC misses are also MBP misses 55.9% of MBP misses are also TC misses This work: use TC misses to detect MBP misses and fix them high accuracy, low coverage low accuracy, higher coverage
August 27, 2003Euro-Par Trace Substitution Assumption: TC miss implies MBP miss –Correlation between branches implies that some paths never occur –TC stores only those paths that do occur If the predicted path is wrong … –Fetch a different trace –Override MBP with MRU trace starting at fetch PC Detect MRU trace from LRU bits stored in TC No trace substitution applied if it does not exist
August 27, 2003Euro-Par Implementation instruction cache trace cache MBP MUX select hit MRU hit MRU pred. trace pred. insn fetch address instructions hit/miss legend pred. path fetch address next addressinstructions fill unit
August 27, 2003Euro-Par Evaluation Setup Benchmarks –SPECint95 (except compress, go), reference inputs –500 million instructions from start of program –Compiled for Alpha ISA, Compaq C compiler, -O4 Fetch Unit –TC: 1 trace = 16 instructions, 3 cond. branches, trace ends at system call, indirect jump –TC: 4-way set-assoc., path associativity –MBP: MGAg, varying history length –Instruction cache: 32K, 2-way, 32byte blocks, LRU Metric –FIPA = fetched instructions per fetch unit access
August 27, 2003Euro-Par Evaluation (1) Observations: –Gap MGAg-perfect increases with TC size –20-40% of gap filled with trace substitution –Only on TC miss, thus performance increase drops with TC size TC: 4-way set-associative MGAg: 12-bit history
August 27, 2003Euro-Par Evaluation (2) Observations: –Compensate poor branch predictor –No history ~ 10 bit history –Improvement drops with more accurate predictor TC: 256 traces, 4-ways
August 27, 2003Euro-Par Accuracy vs. Usage Definitions: –Usage = substitutions per fetch unit access –Accuracy = fraction correct substitutions Note –Accuracy limited because correct-path trace is not always present! TC: 256 traces, 4-way
August 27, 2003Euro-Par Conclusion Proposed trace substitution –TC miss flags MBP miss Not always correct, not all MBP misses found Fetch MRU trace instead: cheap implementation Results in –Consistent performance improvement No history+substitution ~ MGAg with 10-bit history In other cases: 0.2 instructions/access or same performance as with 16 times smaller MBP Most effective when MBP or TC is small