Data Reuse in Embedded Processors Peter Trenkle CPE631 Project Presentation.

Slides:

Advertisements

Similar presentations

Page-replacement policies On some standard algorithms for the management of resident virtual memory pages.

Advertisements

Chapter 4 Memory Management Page Replacement 补充：什么叫页面抖动？

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

The University of Adelaide, School of Computer Science

The Assembly Language Level

CENTRAL PROCESSING UNIT

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.

Computer System Overview

1 Computer System Overview OS-1 Course AA

Informationsteknologi Friday, November 16, 2007Computer Architecture I - Class 121 Today’s class Operating System Machine Level.

Computer System Overview

1 Lecture 9: Virtual Memory Operating System I Spring 2007.

Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.

Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.

Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.

©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan

Basics and Architectures

Computer Systems Overview. Page 2 W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware resources of one.

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.

RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

CSE 241 Computer Engineering (1) هندسة الحاسبات (1) Lecture #3 Ch. 6 Memory System Design Dr. Tamer Samy Gaafar Dept. of Computer & Systems Engineering.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.

ECE 353 Lab 2 Pipeline Simulator. Aims Further experience in C programming Handling strings Further experience in the use of assertions Reinforce concepts.

Lecture Topics: 11/24 Sharing Pages Demand Paging (and alternative) Page Replacement –optimal algorithm –implementable algorithms.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X)  Hit Rate : the fraction of memory access found in.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

1 Lecture 8: Virtual Memory Operating System Fall 2006.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Computer Systems Overview. Lecture 1/Page 2AE4B33OSS W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware.

1 Computer System Overview Chapter 1. 2 Operating System Exploits the hardware resources of one or more processors Provides a set of services to system.

Lecture 2: Performance Evaluation

Chapter 1 Computer System Overview

ECE232: Hardware Organization and Design

Low-power Digital Signal Processing for Mobile Phone chipsets

Adaptive Cache Partitioning on a Composite Core

Multilevel Memories (Improving performance using alittle “cash”)

ECE 353 Lab 3 Pipeline Simulator

Improving Program Efficiency by Packing Instructions Into Registers

Lecture 10: Virtual Memory

Short Circuiting Memory Traffic in Handheld Platforms

Lecture 6: Advanced Pipelines

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

EECE.4810/EECE.5730 Operating Systems

Chapter 9: Virtual-Memory Management

CMPT 886: Computer Architecture Primer

Detailed Analysis of MiBench benchmark suite

Alpha Microarchitecture

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Introduction to Microprocessor Programming

Chapter 1 Computer System Overview

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Lecture 8: Efficient Address Translation

COMP755 Advanced Operating Systems

10/18: Lecture Topics Using spatial locality

Chapter Contents 7.1 The Memory Hierarchy 7.2 Random Access Memory

Presentation transcript:

Data Reuse in Embedded Processors Peter Trenkle CPE631 Project Presentation

Problem Statement Multi-media embedded applications have many recurring time consuming and long latency instructions Multi-media embedded applications have many recurring time consuming and long latency instructions –Floating point operations –Time-consuming instructions (Multiplies and Divides) which can cause cycle delays in embedded processors

Problem Statement –Due to the demand for higher portability of computing power, power consumption is a big design constraint in embedded systems; decreased clock speed is important –Long latency instructions have the potential to cause data hazards, thus decreasing performance

Goals Develop a methodology to increase embedded applications performance Develop a methodology to increase embedded applications performance –Decrease the need to go through a complete multiply or divide instruction, opportunities exist for program speed up –Decrease the embedded system’s clock frequency; reducing power consumption –Decrease amount of data hazards due to long latencies

Applications of Solution Image processing Image processing –Low local entropy of processed data sets Speech encoding Speech encoding –Human speech characteristics High Speed Signal processing High Speed Signal processing –Values could change very little over short run, saves duplication of instructions

Solution = Data Reuse Establish a memo table of a set length on an ARM processor that holds the operands and results of past multiply and/or divide instructions Establish a memo table of a set length on an ARM processor that holds the operands and results of past multiply and/or divide instructions Send the operands to both the memo table and multiply/division unit, if hit in the memo table, complete a multi- cycle instruction in one clock cycle Send the operands to both the memo table and multiply/division unit, if hit in the memo table, complete a multi- cycle instruction in one clock cycle

Diagram of Memo Table Multiply/Division UnitMemo Table Operand 1Operand 2 “Operation Complete” “Hit/Miss” Result

Definition of the Memo Table The memo table is set up as a Look Up Table where the most recently used entries are present The memo table is set up as a Look Up Table where the most recently used entries are present The table consists of a long tag, consisting of two operands, and the result The table consists of a long tag, consisting of two operands, and the result Look-up and calculation are done in parallel to avoid adding latency Look-up and calculation are done in parallel to avoid adding latency

Constraints of the Memo Table Trivial Calculations, such as multiplying by 0, are not logged into the table Trivial Calculations, such as multiplying by 0, are not logged into the table Trivial calculations can be handled by the execution unit Trivial calculations can be handled by the execution unit If one of the operands in the table is referenced by a negative of itself, it results in a hit If one of the operands in the table is referenced by a negative of itself, it results in a hit

Current Implementations One paper deals with this concept: “Accelerating Multi-Media Processing by Implementing Memoing in Multiplication and Division Units” (Citron, Feitelson, Larry Rudolph) One paper deals with this concept: “Accelerating Multi-Media Processing by Implementing Memoing in Multiplication and Division Units” (Citron, Feitelson, Larry Rudolph) This paper dealt with Pentium Pro, Alpha 21164, ULTRASparc-II and MIPS R10000; leading microprocessors at the time This paper dealt with Pentium Pro, Alpha 21164, ULTRASparc-II and MIPS R10000; leading microprocessors at the time

Experiment Configuration A modified sim-safe application saves all instructions to a file (safet) A modified sim-safe application saves all instructions to a file (safet) A C program was created to read in the data and simulate hit rates of certain instructions if loaded into the memo table (insomnia) A C program was created to read in the data and simulate hit rates of certain instructions if loaded into the memo table (insomnia) Floating point intensive MI-Bench benchmarks were used (rsynth, lame) Floating point intensive MI-Bench benchmarks were used (rsynth, lame)

Configuration: Safet Modified version of sim-safe (performs functional simulation checking for correct memory reference), in the command line, allows for specifying a log file and number of instructions for data retrieval Modified version of sim-safe (performs functional simulation checking for correct memory reference), in the command line, allows for specifying a log file and number of instructions for data retrieval Creates 300 MB to 4 GB of opcode and operand data; solutions discarded Creates 300 MB to 4 GB of opcode and operand data; solutions discarded Shows most instructions run by the benchmark Shows most instructions run by the benchmark

Configuration: Insomnia Insomnia allows specification of logfile, num of instructions per log file, replacement policy, number of entries in memo table, number of log files, and opcode to be observed Insomnia allows specification of logfile, num of instructions per log file, replacement policy, number of entries in memo table, number of log files, and opcode to be observed Insomnia returns the number of times opcode was called, number of memo table hits, number of zero operands, and number of negative operand hits Insomnia returns the number of times opcode was called, number of memo table hits, number of zero operands, and number of negative operand hits

Configuration: Benchmarks Uses MiBench ARM processor benchmarks Uses MiBench ARM processor benchmarks –Rsynth – Text to Speech Encoder, program executes 82 million instructions to encode to speech a review for “Apocalypse Now” –Lame – Wav to MP3 encoder Both Benchmarks have over 20% of total instructions Floating Point, prime candidates for memo table implementation Both Benchmarks have over 20% of total instructions Floating Point, prime candidates for memo table implementation

Experiments Run The opcode chosen to experiment with was 102 – MUL. The opcode chosen to experiment with was 102 – MUL. It was run with the following table lengths (4,8,16,32,64,128,256) It was run with the following table lengths (4,8,16,32,64,128,256) Three different replacement policies were run (FIFO, LRU, and Random) Three different replacement policies were run (FIFO, LRU, and Random)

Results Opcode 102 (MUL) from rsynth has been tested Opcode 102 (MUL) from rsynth has been tested Rsynth has over 82 million instructions Rsynth has over 82 million instructions 102 has only 134,000 entries 102 has only 134,000 entries

Results from LRU Replacement

Results from FIFO Replacement

Results from Random Replacement

Analysis of Results Order of Multiplications, helped the hit rate results of smaller memo tables Order of Multiplications, helped the hit rate results of smaller memo tables Example: Example: – –….. – –….. With this operand ordering a single entry memo table would have a significant hit rate With this operand ordering a single entry memo table would have a significant hit rate

Analysis of Results For better results; other benchmarks should have more representative operand ordering For better results; other benchmarks should have more representative operand ordering MUL has less than 1% of the total operations; FP ADD has close to 20% of the operations; possibly use memo table to optimize this solution MUL has less than 1% of the total operations; FP ADD has close to 20% of the operations; possibly use memo table to optimize this solution

Conclusions For future tests, number of operands present in code should also be analyzed to determine best instruction to memoize For future tests, number of operands present in code should also be analyzed to determine best instruction to memoize The chance for better performance exists, but needs many different applications to completely verify The chance for better performance exists, but needs many different applications to completely verify