Understanding the TigerSHARC ALU pipeline

Slides:



Advertisements
Similar presentations
Instruction Set Design
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos What’s the.
Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Detailed look at the TigerSHARC pipeline Cycle counting for the IALU versionof the DC_Removal algorithm.
What are the characteristics of DSP algorithms? M. Smith and S. Daeninck.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
ENCM 515 Review talk on 2001 Final A. Wong, Electrical and Computer Engineering, University of Calgary, Canada ucalgary.ca.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter.
TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada
TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.
Ultra sound solution Impact of C++ DSP optimization techniques.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Presented by: Sergio Ospina Qing Gao. Contents ♦ 12.1 Processor Organization ♦ 12.2 Register Organization ♦ 12.3 Instruction Cycle ♦ 12.4 Instruction.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.
Generating “Rectify( )” Test driven development approach to TigerSHARC assembly code production Assembly code examples Part 1 of 3.
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Blackfin Array Handling Part 1 Making an array of Zeros void MakeZeroASM(int foo[ ], int N);
DIGITAL SIGNAL PROCESSORS. Von Neumann Architecture Computers to be programmed by codes residing in memory. Single Memory to store data and program.
5/13/99 Ashish Sabharwal1 Pipelining and Hazards n Hazards occur because –Don’t have enough resources (ALU’s, memory,…) Structural Hazard –Need a value.
A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.
Generating a software loop with memory accesses TigerSHARC assembly syntax.
Lecture 3 Translation.
Computer Hardware What is a CPU.
Assembly language.
Machine dependent Assembler Features
Speed up on cycle time Stalls – Optimizing compilers for pipelining
CSC 4250 Computer Architectures
5.2 Eleven Advanced Optimizations of Cache Performance
Basics Of X86 Architecture
Pipelining: Advanced ILP
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Software and Hardware Circular Buffer Operations
General Optimization Issues
TigerSHARC processor General Overview.
Generating the “Rectify” code (C++ and assembly code)
Generating “Rectify( )”
DMA example Video image manipulation
Overview of SHARC processor ADSP Program Flow and other stuff
Trying to avoid pipeline delays
Generating a software loop with memory accesses
Understanding the TigerSHARC ALU pipeline
What are the characteristics of DSP algorithms?
Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Understanding the TigerSHARC ALU pipeline
Control unit extension for data hazards
Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.
* From AMD 1996 Publication #18522 Revision E
Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.
Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
Getting serious about “going fast” on the TigerSHARC
General Optimization Issues
Explaining issues with DCremoval( )
General Optimization Issues
Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.
DMA example Video image manipulation
Control unit extension for data hazards
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Control unit extension for data hazards
A first attempt at learning about optimizing the TigerSHARC code
Working with the Compute Block
A first attempt at learning about optimizing the TigerSHARC code
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
Presentation transcript:

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 5 What syntax to make the code more parallel?

Understanding the TigerSHARC Parallel Operations TigerSHARC has many pipelines Review of the COMPUTE pipeline works Interaction of memory (data) operations with COMPUTE operations Specialized C++ compiler options and #pragmas (Will be covered by individual student presentation) Optimized assembly code and optimized C++ 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Processor Architecture 3 128-bit data busses 2 Integer ALU 2 Computational Blocks ALU (Float and integer) SHIFTER MULTIPLIER COMMUNICATIONS CLU 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Use C++ IIR code as comments Things to think about prior to code writing Register name reorganization Keep XR4 for xInput – save a cycle Put S1 and S2 into XR0 and XR1 -- chance to fetch 2 memory values in one cycle using L[ ] Put H0 to H5 in XR12 to XR16 -- chance to fetch 4 memory values in one cycle using Q[ ] followed by one normal fetch -- Problems – if more than one IIR stage then the second stage fetches are not quad aligned There are two sets of multiplications using S1 and S2. Can these by done in X and Y compute blocks in one cycle? float *copyStateStartAddress = state; S1 = *state++; S2 =*state++; *copyStateStartAddress++ = S1; *copyStateStartAddress++ = S2; 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Register name conversion done in steps Setting Xin – XR4 and Yout = XR8 saves one cycle Bulk conversion with no error So many errors made during bulk conversion that went to Find/replace/ test for each register individually 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Fix bringing state variables in QUESTION We have XR18 = [J6 += 1] (load S1) and R19 = [J6 += 1] (load S2) Both are valid What is the difference? 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

That difference – could it be used to our advantage? XR18 = [J6 += 1];; Read the value at memory location [J6], and updates J6 to J6 + 1 after fetch. Stores fetched value in XR18 XYR19 = [J6 += 1];; Read the value at memory location [J6], and updates J6 to J6 + 1 after fetch. Stores fetched value in XR19 AND YR18 XYR19 = L[J6 += 2];; -- concept correct – but executes faster Read value at [J6], updates J6 to J6 + 1, store in XR19. AND Read value at [(new) J6], updates J6 to J6 + 1, store in XY19. PROVIDED J6 was originally aligned on 64-bit boundary 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Send state variables out Go for the gusto – use L[ ] (64-bit) Need to recalculate the test result state[1] is NOT Yout 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada Working solution -- I 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Working Solution -- Part 2 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Working solution – Part 3 I could not spot where any extra stalls would occur because of memory pipeline reads and writes All values were in place when needed Need to check with pipeline viewer 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Lets look at DATA MEMORY and COMPUTE pipeline issues -- 1 No problems here 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Weird stuff happening with INSTRUCTION pipeline Only 9 instructions being fetched but we are executing 21! Why all these instruction stalls? 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada Analysis We are seeing the impact of the processor doing quad-fetches of instructions (128-bits) into IAB (instruction alignment buffer) Once in the IAB, then the instructions (32-bits) are issued to the various execution units as needed. 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada Before we do any further optimization, need to understand about processor parallelism We already know about Parallel multiplications and additions and their associated stalls What about parallel memory fetches? 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Parallel memory fetches What is permissible? Can we do? Parallel fetches into XY at the same time Parallel into X and a Y registers Parallel into two X registers 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Parallel memory syntax – not too difficult Only this syntax is illegal Will need to do more research to discover whether “legal” means that the operation is performed without stalling the memory pipeline NOTE: Need to transfer INPAR3 (J6) into a K-register (K6) in order to be able to use both the J and K data busses during IIR operation 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada Question: How do you (in C++) place IIR coefficients in one memory block and state values into another? 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada Question: How do you (in assembly code) place IIR coefficients in one memory block and state values into another? 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada C++ manual talks about 2 data spaces (dm and pm) for STATIC or GLOBAL variables 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada BAD You can use the VDSP C++ extension pm to specify a different memory space. HOWEVER, there is no such thing as a pm stack so all variable must be declared “static” or “global” dm arrays can be placed on the stack but there may be alignment issues 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

The assembler manual says something similar but different 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada VDSP C++ extensions dm and pm parameters are still being passed into functions via J5 and J6 as before. Notice the very big difference in the “absolute addresses” indicating that the data blocks are in very different memory spaces. Also data memory address is widely different from instruction memory space. Do instruction and 2 data fetches at same time 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

IIR function using TigerSHARC C++ DSP extensions dm and pm 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Using dm and pm shows up a little more parallel than only using dm 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

From TigerSHARC TS201 programming reference manual 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Memory block operation will need to be explored in more detail later 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Understanding the TigerSHARC Parallel Operations TigerSHARC has many pipelines Review of the COMPUTE pipeline works Interaction of memory (data) operations with COMPUTE operations Specialized C++ compiler options and #pragmas (Will be covered by individual student presentation) Optimized assembly code and optimized C++ 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada