CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact.

Slides:

Advertisements

Similar presentations

DSPs Vs General Purpose Microprocessors

Advertisements

Instruction Set Design

CPU Review and Programming Models CT101 – Computing Systems.

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.

ENCM 515 Review talk on 2001 Final A. Wong, Electrical and Computer Engineering, University of Calgary, Canada ucalgary.ca.

6/3/20151 ENCM515 Comparison of Integer and Floating Point DSP Processors M. Smith, Electrical and Computer Engineering, University of Calgary, Canada.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Squish-DSP Application of a Project Management Tool to manage low-level DSP processor resources M. Smith, University of Calgary, Canada ucalgary.ca.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.

Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.

Ultra sound solution Impact of C++ DSP optimization techniques.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

Computer Architecture and the Fetch-Execute Cycle

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

DSP Processors We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters.

Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.

Computer Architecture Lecture 26 Fasih ur Rehman.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

Reduced Instruction Set Computers. Major Advances in Computers(1) The family concept —IBM System/ —DEC PDP-8 —Separates architecture from implementation.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.

Systematic development of programs with parallel instructions SHARC ADSP21XXX processor M. Smith, Electrical and Computer Engineering, University of Calgary,

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.

CS203 – Advanced Computer Architecture Pipelining Review.

Computer Orgnization Rabie A. Ramadan Lecture 9. Cache Mapping Schemes.

Advanced Architectures

Overview Introduction General Register Organization Stack Organization

Digital Signal Processors

Morgan Kaufmann Publishers The Processor

TigerSHARC processor General Overview.

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Program Flow on ADSP2106X SHARC Pipeline issues

Overview of SHARC processor ADSP and ADSP-21065L

Overview of SHARC processor ADSP Program Flow and other stuff

Comparing 68k (CISC) with 21k (Superscalar RISC DSP)

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

M. R. Smith, University of Calgary, Canada ucalgary.ca

* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.

-- Tutorial A tool to assist in developing parallel ADSP2106X code

Overheads for Computers as Components 2nd ed.

CS203 – Advanced Computer Architecture

Chapter 12 Pipelining and RISC

Overview of SHARC processor ADSP-2106X Compute Operations

Overview of SHARC processor ADSP-2106X Compute Operations

Overview of SHARC processor ADSP-2106X Memory Operations

Understanding the TigerSHARC ALU pipeline

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

A first attempt at learning about optimizing the TigerSHARC code

* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.

Presentation transcript:

CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact -- ucalgary.ca

Cache-DSP Tool 2/28 Series of Talks and Workshops CACHE-DSP – Talk on a simple process tool to identify cache conflicts in DSP code. SQUISH-DSP – Talk on using a project management tool to automate identification of parallel DSP processor instructions. SHARC Ecology 101 – Workshop showing how to systematically write parallel 2106X code. SHARC Ecology 201 – Workshop on SQUISH-DSP and CACHE-DSP tools.

Cache-DSP Tool 3/28 Concepts to be discussed Concept behind 2106X instruction cache Cache operation Introduction of CACHE THRASHING Solutions to avoid a Cache Thrash without delaying product release Basis of Cache-DSP tool Acknowledgements

Cache-DSP Tool 4/28 Purpose of SHARC instruction cache Harvard Processor Architecture One bus for fetching instructions Another bus for fetching data Twin bus architecture avoids instruction/data fetch conflicts DSP algorithms Addition and multiplication intensive Multiple simultaneous access to data structures are typically needed Twin bus architecture does not avoid data/data fetch conflicts

Cache-DSP Tool 5/28 Solutions to data/data fetch conflicts Cache single instruction Single instruction loop Frees up instruction bus for use as data bus to fetch from separate data memory Very limited in application Three bus processor Expensive to implement for all memory ADSP21XXX approach is to have a three bus processor architecture available for a limited number of instructions on a ‘as needed’ basis – instruction cache

Cache-DSP Tool 6/28 Example C-code Converts temperature array from C to F Assembly code has 6 PM( ) operations

Cache-DSP Tool 7/28 Example FetchDecodeExecute Instr. on PM F1=, r0=dm Instr. on PM F13=,r2=dm, pm= Instr. F1=, r0=dm Instr. on PM F8=, r0=dm Instr. F13=,r2=dm, pm= Data on DM F1=, r0=dm Instr. on PM F12=, r2=dm, pm= Instr. F8=, r0=dm Data on DM, PM F13=,r2=dm, pm= Instr. F12=, r2=dm, pm= Data on DM F8=, r0=dm

Cache-DSP Tool 8/28 Data on DM, PM F13=,r2=dm, pm= Instr. F8=, r0=dm Data on DM F1=, r0=dm Instr. F13=,r2=dm, pm= Instr. on PM F8=, r0=dm Instr. F1=, r0=dm Instr. on PM F13=,r2=dm, pm= Instr. on PM F1=, r0=dm ExecuteDecodeFetch Instr. on PM/To Cache F12=, r2=dm, pm= First Time round loop -- STALL

Cache-DSP Tool 9/28 2nd Time – 3 bus operation Data on DM F8=, r0=dm Instr. F12=, r2=dm, pm= Data on DM, PM F13=,r2=dm, pm= Instr. F8=, r0=dm Instr. From Cache F12=, r2=dm, pm= Data on DM F1=, r0=dm Instr. F13=,r2=dm, pm= Instr. on PM F8=, r0=dm Instr. F1=, r0=dm Instr. on PM F13=,r2=dm, pm= Instr. on PM F1=, r0=dm ExecuteDecodeFetch

Cache-DSP Tool 10/28 Instruction Cache Characteristics 32 cache locations 32 locations looks small in number but is used ONLY when data access on PM bus conflicts with instruction access on PM bus Typically satisfactory for tight DSP algorithm loops up to 100+ atomic operations.

Cache-DSP Tool 11/28 MAJOR LIMITATION POSSIBLE Cache is 2-way associative 32 cache locations grouped in groups of 2 Instruction storage location in cache determined by last 4 bits of address Instruction N stored at Cache location N modulus 16 Also a least recently used bit (LRU) LRU instruction replaced on a cache miss. Possible to induce -- CACHE THRASH

Cache-DSP Tool 12/28 Simple Example Assume that cache is 2-way associative with 8(not 32) locations 6 cache operations to be placed into 8 cache locations 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00

Cache-DSP Tool 13/28 Simple Example -- First Cache Op Instruction 2 forces Instruction 4 into cache line %00 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 Cache line %00

Cache-DSP Tool 14/28 Simple Example Next 2 cache operations place instructions 6 and 9 into cache 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = % % % %01

Cache-DSP Tool 15/28 Simple Example 4 th and 5th Cache operations set LRU bits for cache lines %00 and %10 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = % %00 LRU 6 -- %10 LRU 9 -- %01 10 = %10 12 = %00

Cache-DSP Tool 16/28 Execution of Instruction 12 Execution of instruction 12 occurs during Fetch of instruction 2 in loop 3 rd Cache operation involving cache line %10 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 Instruction 2 to cache % %00 LRU 6 -- %10 LRU 9 -- %01 10 = %10 12 = %00

Cache-DSP Tool 17/28 Summary of Cache Operations First time round loop Instr. 2 pushes Instr. 4 to cache line %00 Instr. 4 pushes Instr. 6 to cache line %10 Instr. 7 pushes Instr. 9 to cache line %01 Instr. 8 pushes Instr. 10 to cache line %10 Instr. 10 pushes Instr. 12 to cache line %00 INSTR. 12 pushes INSTR. 2 to cache line %10 WHERE IT REPLACES INSTR. 4 (LRU)

Cache-DSP Tool 18/28 Cache Thrash starts operating Second time round loop Instr. 4 from cache line %00 Instr. 4 pushes Instr. 6 to cache line %10 REPLACING INSTR. 10 (LRU for %10) Instr. 9 from cache line %01 Instr. 8 pushes Instr. 10 to cache line %10 REPLACING INSTR. 2 (LRU for %10) Instr. 12 from cache line %00 Instr. 12 pushes Instr. 2 to cache line %10 REPLACING INSTR. 6 (LRU for %10) Losing 3 cycles each time around loop

Cache-DSP Tool 19/28 Easy to fix in this example Can delay PM from INSTR. 2 till 3 This forces INSTR 5 to cache (%01) where it does not replace anything 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = % % % % % %01 LRU 10 = %10 11 = %11 12 = %00 PM =

Cache-DSP Tool 20/28 Real Life more difficult Larger number of instructions in Loop Jump operations (conditional or not) Register Dependencies May need to move many PM operations All this takes time Need a systematic approach to gain speed while getting the product out-the-door in shortest time ADD-A-NOP – waste 1 cycle to gain 3

Cache-DSP Tool 21/28 ADD A CACHE FREEZE at end of the loop CACHE THRASH (3 cycles waste) replaced by STALL (instruction can’t go into cache) and Freeze instruction (2 cycles wasted) 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 13 = %01 Instruction 1 stalls 4 -- %00 LRU 6 -- %10 LRU 9 -- %01 LRU 10 = %10 12 = %00 BIT SET MODE2 CAFRZ Cache Freeze BIT CLR MODE2 CAFRZ Cache Unfreeze

Cache-DSP Tool 22/28 ADD A NOP at end of the loop CACHE THRASH (3 cycles waste) IS AVOIDED with a loss of only 1 cycle/loop because of additional NOP instruction 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 13 = %01 Instruction 1 to cache % %00 LRU 6 -- %10 LRU 9 -- %01 LRU 10 = %10 12 = %00 NOP

Cache-DSP Tool 23/28 Cache-DSP tool concept Original Code – Loop Cycles = C1 1, 2, 3, 4, 5, 6, 7, endloop Trial 1 – Loop Cycles = C2 1, 2, 3, 4, 5, 6, 7, NOP, endloop Trial 2– Loop Cycles = C3 1, 2, 3, 4, 5, 6, NOP, 7, endloop Trial 3 – Loop Cycles = C4 1, 2, 3, 4, 5, NOP, 6, 7, endloop

Cache-DSP Tool 24/28 Cache-DSP tool Identifies the number of cache operations and cache thrashes in current code Calculates the advantage of adding NOP after/before each instruction in loop in reducing cache thrashes Remembers the best case scenario Then determines the effect of placing 2 NOPs (3, 4 etc) somewhere in the code (preferably at end of loop).

Cache-DSP Tool 25/28 Advantages Typical DSP loops small Can use brute force approach in identifying where NOPs should be placed If meet time constraints of your project -- then ship with NOPs included If does not meet time constraints then position of NOPs gives hints as to which PM( ) operations to delay Works with any processor architecture

Cache-DSP Tool 26/28 Hint -- Instruction PM( ) Key Reformat loop so that Instr. 1 is outside loop and repeated as Instr. 13 with Instr. 12 PM( ) moved Now we have removed cache thrash with no waste 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 13 = %01 Instruction 1 outside loop Instruction 3 to cache % %00 LRU 6 -- %10 LRU 9 -- %01 10 = %10 12 = %00 F1=, ro=dm( ), pm( ) =

Cache-DSP Tool 27/28 Problems to overcome Jumps inside loops Complicates which instructions get cached Conditional jump changes which instruction gets cached (dynamic effect) Complicated to the effect of placing a NOP into a delay slot and displacing an instruction out of the delay slot Effect of loops inside loops

Cache-DSP Tool 28/28 Concepts discussed Concept behind ADI instruction cache Cache operation Introduction of CACHE THRASHING Solutions to avoid a Cache Thrash without delaying product release Introduction of NOP instructions into code -- wasting one cycle to save 3 cycles Identification of PM( ) operations to move Basis of Cache-DSP tool

Cache-DSP Tool 29/28 Acknowledgements Financial support of Natural Sciences and Engineering Research Council of Canada (NSERC) and the University of Calgary Financial support from Analog Devices through ADI University professorship for 2001/2002 (Dr. Smith) Future work will be financed in part by the Alberta Government through Alberta Software Engineering Research Consortium (ASERC)