Simulation of Decode Filter Cache using SimpleScalar simulator Presented by Fei Hong
Motivation & Goals Instruction fetches and decodes are the major on-chip power consumers Optimize the power consumption by reducing instruction fetches and decodes Simulate the DFC architecture using simplescalar To test the performance of DFC
Prediction Mechanism Each sector in DFC has the following fields. (tag, sector_valid, next_address) If A is not equal to C, a different control path will be taken tag(A) != tag(C) (1) A and B are consecutively accessed. If they belonged to a small loop tag(A) == tag(B) (2) Based on (1) and (2), the prediction for next fetch : tag(C) == tag(B) (3)
Working Process
The Platform Host computer: ACPI x86-based PC Host computer operating system: Microsoft Windows Vista Ultimate Virtual Machine: VMware Workstation version 6.03 Linux operating system: Fedora Core 6 Simulator: SimpleScalar version 3.0
Work have done so far… Setup the platform Reading the source code of SimpleScalar Apply my DFC structure and working process to SimpleScalar Find benchmarks and compile in the platform Do simulation using given memory hierarchy parameters
MiBench dijkstra: it constructs a large graph in an adjacency matrix representation and then calculates the shortest path between every pair of nodes using repeated applications of Dijkstra’s algorithm. stringsearch: it searches for given words in phrases using a case insensitive comparison algorithm. rijndael encrypt/decrypt: it was selected as the National Institute of Standards and Technologies Advanced Encryption Standard (AES). CRC32: This benchmark performs a 32-bit Cyclic Redundancy Check (CRC) on a file. CRC checks are often used to detect errors in data transmission.
Memory hierarchy parameters ParameterValue Instr. size4B DFCdirect-mapped, 32 secotors, 4 decoded instr. per sector, 8B per decoded instr. L1 I-cache16KB, 2-way, 32B line, 1 cycle hit latency L1 D-cache8KB, 2-way, 32B line, 1-cycle hit latency Memory30-cycle latency
Simulation results % reduction in instruction fetches and decodes
Simulation results Prediction hit rate
Simulation results dijkstrastringsearchrijndaelCRC32 sim_num_insn il1.accesses il1.hits il1.misses il1.miss_rate dfc.accesses dfc.hits dfc.misses dfc.miss_rate
Conclusion The DFC stores decoded instructions and can be very small and energy-efficient. Use of the DFC eliminates both the access to a much larger instruction cache and the entire decoding step. From the simulation results, we can see that most instruction fetch and decode can be eliminated by using DFC. Therefore, it is a very efficient way to optimize the power consumption of embedded processors.
Thank you!