Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.

Slides:



Advertisements
Similar presentations
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Advertisements

Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.
1 A few words about the quiz Closed book, but you may bring in a page of handwritten notes. –You need to know what the “core” MIPS instructions do. –I.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Detailed look at the TigerSHARC pipeline Cycle counting for the IALU versionof the DC_Removal algorithm.
Boot Issues Processor comparison TigerSHARC multi-processor system Blackfin single-core.
What are the characteristics of DSP algorithms? M. Smith and S. Daeninck.
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
Chapter 12 Pipelining Strategies Performance Hazards.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter.
Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm.
Goal: Reduce the Penalty of Control Hazards
CS 300 – Lecture 23 Intro to Computer Architecture / Assembly Language Virtual Memory Pipelining.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Chapter 12 CPU Structure and Function. Example Register Organizations.
TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
CMPE 421 Parallel Computer Architecture
Blackfin Array Handling Part 2 Moving an array between locations int * MoveASM( int foo[ ], int fee[ ], int N);
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.
Generating “Rectify( )” Test driven development approach to TigerSHARC assembly code production Assembly code examples Part 1 of 3.
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
CMPE 421 Parallel Computer Architecture
Blackfin Array Handling Part 1 Making an array of Zeros void MakeZeroASM(int foo[ ], int N);
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.
Building a simple loop using Blackfin assembly code If you can handle the while-loop correctly in assembly code on any processor, then most of the other.
Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
Generating a software loop with memory accesses TigerSHARC assembly syntax.
Real-World Pipelines Idea Divide process into independent stages
Chapter Six.
Speed up on cycle time Stalls – Optimizing compilers for pipelining
Software and Hardware Circular Buffer Operations
General Optimization Issues
Pipelining review.
TigerSHARC processor General Overview.
Generating the “Rectify” code (C++ and assembly code)
Generating “Rectify( )”
Systems I Pipelining II
Pipelining in more detail
Trying to avoid pipeline delays
Understanding the TigerSHARC ALU pipeline
Chapter Six.
Chapter Six.
Understanding the TigerSHARC ALU pipeline
Control unit extension for data hazards
Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.
Instruction Execution Cycle
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
Getting serious about “going fast” on the TigerSHARC
General Optimization Issues
Explaining issues with DCremoval( )
General Optimization Issues
Control unit extension for data hazards
Understanding the TigerSHARC ALU pipeline
Control unit extension for data hazards
A first attempt at learning about optimizing the TigerSHARC code
Working with the Compute Block
A first attempt at learning about optimizing the TigerSHARC code
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
Presentation transcript:

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 2 / 31 10/16/2015 Understanding the TigerSHARC ALU pipeline TigerSHARC has many pipelines Review of the COMPUTE pipeline works Interaction of memory (data) operations with COMPUTE operations  What we want to be able to do?  The problems we are expecting to have to solve  Using the pipeline viewer to see what really happens Changing code practices to get better performance  Specialized C++ compiler options and #pragmas (Will be covered by individual student presentation)  Optimized assembly code and optimized C++

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 3 / 31 10/16/2015 Processor Architecture bit data busses 2 Integer ALU 2 Computational Blocks  ALU (Float and integer)  SHIFTER  MULTIPLIER  COMMUNICATIONS CLU

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 4 / 31 10/16/2015 Simple Example IIR -- Biquad For (Stages = 0 to 3) Do  S0 = X in * H5 + S2 * H3 + S1 * H4  Y out = S0 * H0 + S1 * H1 + S2 * H2  S2 = S1  S1 = S0 S0 S1 S2 SO S1 S2

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 5 / 31 10/16/2015 PIPELINE STAGES See page 8-34 of Processor manual 10 pipeline stages, but may be completely desynchronized (happen semi- independently) Instruction fetch -- F1, F2, F3 and F4 Integer ALU – PreDecode, Decode, Integer, Access Compute Block – EX1 and EX2

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 6 / 31 10/16/2015 Instruction 0x17e XFR8 = R8 + R23 is STALLED (waiting) for 0x17d to complete XFR23 = R8 * R4 Bubble B means that the pipeline is doing “nothing” Meaning that the instruction shown is “place holder” (garbage)

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 7 / 31 10/16/2015 Code with stalls shown 8 code lines 5 expected stalls Expect 13 cycles to complete if theory is correct

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 8 / 31 10/16/2015 Analysis approach IS correct Code takes same time whether our “ SHOW-STALL instructions are there or not

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 9 / 31 10/16/2015 Process for coding for improved speed – code re-organization Make a copy of the code so can test iirASM( ) and iirASM_Optimized( ) to make sure get correct result Make a table of code showing ALU resource usage (paper, EXCEL, Project (Gantt chart) ) Identify data dependencies KEY – Make all “temp operations” use different register Move instructions “forward” to fill delay slots, BUT don’t break data dependencies

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 10 / 31 10/16/2015 Show resource usage and data dependencies

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 11 / 31 10/16/2015 Change all temporary registers to use different register names Then check code produces correct answer

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 12 / 31 10/16/2015 Move instructions forward, without breaking data dependencies What appears possible! DO one thing at a time and then check that code still works

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 13 / 31 10/16/2015 CHECK THE PIPELINE AFTER TESTING There are many more COMPUTE pipeline improvements possible. However, let’s not spend too much time here as we are only looking at half of the problem The coefficients are unlikely to be hard-coded, and the state variables can’t be if we are to call IIR( ) in a loop to be able to filter a series of values.

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 14 / 31 10/16/2015 Expect to take 8 cycles to execute This is not real life We must use IIR( ) in a loop in order to be able to filter a series of values. Also IIR( ) will involve multiple stages Means bring in (read) filter coefficients from memory. Means bring in (read) state values from memory and store (write) the changed state values at the end of the function.

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 15 / 31 10/16/2015 Rewrite Tests so that IIR( ) function can take parameters Lets make things real by passing in state variables through an “overloaded” C++ function

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 16 / 31 10/16/2015 Rewrite the “C++ code” I leave the old “fixed” values in until I can get the code to work. Proved useful this time as the code failed Why did it fail to return the correct value?

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 17 / 31 10/16/2015 Explore design issues – 1 What do we expect to have to worry about? XR0 = 0.0;// Set Fsum = 0; XR1 = [J1 += 1];// Fetch a coefficient from memory XFR2 = R1 * R4;// Multiply by Xinput (XR4) XFR0 = R0 + R2;// Add to sum XR3 = [J1 += 1];// Fetch a coefficient from memory XR5 = [J2 += 1];// Fetch a state value from memory XFR5 = R3 * R5;// Multiply coeff and state XFR0 = R0 + R5;// Perform a sum XR5 = XR12;// Update a state variable (dummy) XR12 = XR13 // Update a state variable (dummy) [J3 += 1] = XR12;// Store state variable to memory [J3 += 1] = XR5;// Store state variable to memory

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 18 / 31 10/16/2015 Explore design issues – 2 COMPUTE stalls expected (possible) XR0 = 0.0;// Set Fsum = 0; XR1 = [J1 += 1];// Fetch a coefficient from memory XFR2 = R1 * R4;// Multiply by Xinput (XR4) XFR0 = R0 + R2;// Add to sum XR3 = [J1 += 1];// Fetch a coefficient from memory XR5 = [J2 += 1];// Fetch a state value from memory XFR5 = R3 * R5;// Multiply coeff and state XFR0 = R0 + R5;// Perform a sum XR5 = XR12;// Update a state variable (dummy) XR12 = XR13 // Update a state variable (dummy) [J3 += 1] = XR12;// Store state variable to memory [J3 += 1] = XR5;// Store state variable to memory

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 19 / 31 10/16/2015 Explore design issues – 3 Probable memory stalls expected XR0 = 0.0;// Set Fsum = 0; XR1 = [J1 += 1];// Fetch a coefficient from memory XFR2 = R1 * R4;// Multiply by Xinput (XR4) XFR0 = R0 + R2;// Add to sum XR3 = [J1 += 1];// Fetch a coefficient from memory XR5 = [J2 += 1];// Fetch a state value from memory XFR5 = R3 * R5;// Multiply coeff and state XFR0 = R0 + R5;// Perform a sum XR5 = XR12;// Update a state variable (dummy) XR12 = XR13 // Update a state variable (dummy) [J3 += 1] = XR12;// Store state variable to memory [J3 += 1] = XR5;// Store state variable to memory

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 20 / 31 10/16/2015 Memory pipeline issues expected from COMPUTE pipeline issues seen When you start reading values from memory, how soon is the value fetched available for use within the COMPUTE? When you have adjacent memory accesses (read or write) does the pipeline work better (higher speed) with [J1 += 1];; or with [J1 += J4];; where J4 has been set to 1?

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 21 / 31 10/16/2015 Write a quick test to explore the code example

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 22 / 31 10/16/2015 Code stub – part copy from optimized IIR code

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 23 / 31 10/16/2015 What is the MINIMUM number of ;; that must be added so that the code is valid TigerSHARC (multi-instruction) code?

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 24 / 31 10/16/2015 Code assembles, put when it runs, it crashes the tests Neat mid-term question Explain why this code crashed the processor (corrupted the processor) so that the remaining tests never completed? What are the minimum numbers of lines that must be deleted or added to make the tests run (did not say work)

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 25 / 31 10/16/2015 Switched to simulator #if 0 / #endif around unnecessary tests Break point set at start of test code All ready seeing lots of pipeline issues Bigger picture on next slide

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 26 / 31 10/16/2015 Lots of instruction fetch issues PROBABLY from jumping into new routine and the instruction pipeline not filled by the start of this code Remove from this problem from the analysis by having 10 NOP;; at the beginning of this exploratory code WHY 10 NOPS and not 4?

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 27 / 31 10/16/2015 Looking much better.

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 28 / 31 10/16/2015 Pipeline issues – are they as we expected? COMPUTE operations – 1 cycle delay expected if next instruction needs the result of previous instruction When you start reading values from memory, how soon is the value fetched available for use within the COMPUTE? When you have adjacent memory accesses (read or write) does the pipeline work better with [J1 += 1];; or with [J1 += J4];; where J4 has been set to 1?

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 29 / 31 10/16/2015 Pipeline performance seen

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 30 / 31 10/16/2015 Pipeline performance predicted When you start reading values from memory, 1 cycle delay for value fetched available for use within the COMPUTE COMPUTE operations – 1 cycle delay expected if next instruction needs the result of previous instruction When you have adjacent memory accesses (read or write) does the pipeline work better with [J1 += 1];; or with [J1 += J4];; where J4 has been set to 1? [J1 += 1];; works just fine here (no delay). Worry about [J1 += J4];; another day

Speed IIR -- stage 3, M. Smith, ECE, University of Calgary, Canada 31 / 31 10/16/2015 Understanding the TigerSHARC ALU pipeline TigerSHARC has many pipelines Review of the COMPUTE pipeline works Interaction of memory (data) operations with COMPUTE operations  What we want to be able to do?  The problems we are expecting to have to solve  Using the pipeline viewer to see what really happens Changing code practices to get better performance  Can predict compute and memory stalls  Have enough information to be able to predict performance for real IIR code involving memory fetches and stores  Almost enough information to tackle Lab. 2 with IIR