Data Reuse in Embedded Processors Peter Trenkle CPE631 Project Presentation
Problem Statement Multi-media embedded applications have many recurring time consuming and long latency instructions Multi-media embedded applications have many recurring time consuming and long latency instructions –Floating point operations –Time-consuming instructions (Multiplies and Divides) which can cause cycle delays in embedded processors
Problem Statement –Due to the demand for higher portability of computing power, power consumption is a big design constraint in embedded systems; decreased clock speed is important –Long latency instructions have the potential to cause data hazards, thus decreasing performance
Goals Develop a methodology to increase embedded applications performance Develop a methodology to increase embedded applications performance –Decrease the need to go through a complete multiply or divide instruction, opportunities exist for program speed up –Decrease the embedded system’s clock frequency; reducing power consumption –Decrease amount of data hazards due to long latencies
Applications of Solution Image processing Image processing –Low local entropy of processed data sets Speech encoding Speech encoding –Human speech characteristics High Speed Signal processing High Speed Signal processing –Values could change very little over short run, saves duplication of instructions
Solution = Data Reuse Establish a memo table of a set length on an ARM processor that holds the operands and results of past multiply and/or divide instructions Establish a memo table of a set length on an ARM processor that holds the operands and results of past multiply and/or divide instructions Send the operands to both the memo table and multiply/division unit, if hit in the memo table, complete a multi- cycle instruction in one clock cycle Send the operands to both the memo table and multiply/division unit, if hit in the memo table, complete a multi- cycle instruction in one clock cycle
Diagram of Memo Table Multiply/Division UnitMemo Table Operand 1Operand 2 “Operation Complete” “Hit/Miss” Result
Definition of the Memo Table The memo table is set up as a Look Up Table where the most recently used entries are present The memo table is set up as a Look Up Table where the most recently used entries are present The table consists of a long tag, consisting of two operands, and the result The table consists of a long tag, consisting of two operands, and the result Look-up and calculation are done in parallel to avoid adding latency Look-up and calculation are done in parallel to avoid adding latency
Constraints of the Memo Table Trivial Calculations, such as multiplying by 0, are not logged into the table Trivial Calculations, such as multiplying by 0, are not logged into the table Trivial calculations can be handled by the execution unit Trivial calculations can be handled by the execution unit If one of the operands in the table is referenced by a negative of itself, it results in a hit If one of the operands in the table is referenced by a negative of itself, it results in a hit
Current Implementations One paper deals with this concept: “Accelerating Multi-Media Processing by Implementing Memoing in Multiplication and Division Units” (Citron, Feitelson, Larry Rudolph) One paper deals with this concept: “Accelerating Multi-Media Processing by Implementing Memoing in Multiplication and Division Units” (Citron, Feitelson, Larry Rudolph) This paper dealt with Pentium Pro, Alpha 21164, ULTRASparc-II and MIPS R10000; leading microprocessors at the time This paper dealt with Pentium Pro, Alpha 21164, ULTRASparc-II and MIPS R10000; leading microprocessors at the time
Experiment Configuration A modified sim-safe application saves all instructions to a file (safet) A modified sim-safe application saves all instructions to a file (safet) A C program was created to read in the data and simulate hit rates of certain instructions if loaded into the memo table (insomnia) A C program was created to read in the data and simulate hit rates of certain instructions if loaded into the memo table (insomnia) Floating point intensive MI-Bench benchmarks were used (rsynth, lame) Floating point intensive MI-Bench benchmarks were used (rsynth, lame)
Configuration: Safet Modified version of sim-safe (performs functional simulation checking for correct memory reference), in the command line, allows for specifying a log file and number of instructions for data retrieval Modified version of sim-safe (performs functional simulation checking for correct memory reference), in the command line, allows for specifying a log file and number of instructions for data retrieval Creates 300 MB to 4 GB of opcode and operand data; solutions discarded Creates 300 MB to 4 GB of opcode and operand data; solutions discarded Shows most instructions run by the benchmark Shows most instructions run by the benchmark
Configuration: Insomnia Insomnia allows specification of logfile, num of instructions per log file, replacement policy, number of entries in memo table, number of log files, and opcode to be observed Insomnia allows specification of logfile, num of instructions per log file, replacement policy, number of entries in memo table, number of log files, and opcode to be observed Insomnia returns the number of times opcode was called, number of memo table hits, number of zero operands, and number of negative operand hits Insomnia returns the number of times opcode was called, number of memo table hits, number of zero operands, and number of negative operand hits
Configuration: Benchmarks Uses MiBench ARM processor benchmarks Uses MiBench ARM processor benchmarks –Rsynth – Text to Speech Encoder, program executes 82 million instructions to encode to speech a review for “Apocalypse Now” –Lame – Wav to MP3 encoder Both Benchmarks have over 20% of total instructions Floating Point, prime candidates for memo table implementation Both Benchmarks have over 20% of total instructions Floating Point, prime candidates for memo table implementation
Experiments Run The opcode chosen to experiment with was 102 – MUL. The opcode chosen to experiment with was 102 – MUL. It was run with the following table lengths (4,8,16,32,64,128,256) It was run with the following table lengths (4,8,16,32,64,128,256) Three different replacement policies were run (FIFO, LRU, and Random) Three different replacement policies were run (FIFO, LRU, and Random)
Results Opcode 102 (MUL) from rsynth has been tested Opcode 102 (MUL) from rsynth has been tested Rsynth has over 82 million instructions Rsynth has over 82 million instructions 102 has only 134,000 entries 102 has only 134,000 entries
Results from LRU Replacement
Results from FIFO Replacement
Results from Random Replacement
Analysis of Results Order of Multiplications, helped the hit rate results of smaller memo tables Order of Multiplications, helped the hit rate results of smaller memo tables Example: Example: – –….. – –….. With this operand ordering a single entry memo table would have a significant hit rate With this operand ordering a single entry memo table would have a significant hit rate
Analysis of Results For better results; other benchmarks should have more representative operand ordering For better results; other benchmarks should have more representative operand ordering MUL has less than 1% of the total operations; FP ADD has close to 20% of the operations; possibly use memo table to optimize this solution MUL has less than 1% of the total operations; FP ADD has close to 20% of the operations; possibly use memo table to optimize this solution
Conclusions For future tests, number of operands present in code should also be analyzed to determine best instruction to memoize For future tests, number of operands present in code should also be analyzed to determine best instruction to memoize The chance for better performance exists, but needs many different applications to completely verify The chance for better performance exists, but needs many different applications to completely verify