TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada

Slides:



Advertisements
Similar presentations
Machine cycle.
Advertisements

Lab 2 – DSP software architecture and the real life DSP characteristics of signals that make it necessary.
DSPs Vs General Purpose Microprocessors
PIPELINE AND VECTOR PROCESSING
Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.
9-6 The Control Word Fig The selection variables for the datapath control the microoperations executed within datapath for any given clock pulse.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Blackfin ADSP Versus Sharc ADSP-21061
Detailed look at the TigerSHARC pipeline Cycle counting for the IALU versionof the DC_Removal algorithm.
Daddy! -- Where do instructions come from? Program Sequencer controls program flow and provides the next instruction to be executed Straight line code,
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
A look at interrupts What are interrupts and why are they needed in an embedded system? Equally as important – how are these ideas handled on the Blackfin.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter.
3/5/2004DSP Applied to GPS Algorithms1 of 14 DSP Applied to GPS Algorithms.
Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm.
A look at interrupts What are interrupts and why are they needed.
TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada
TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
 Send in audio signals and use sharp FIR filter to pick out 42 Hz and 59 Hz signals and send out warning tones ◦ Try FIR filter of 256 taps, down sample.
Ultra sound solution Impact of C++ DSP optimization techniques.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.
Lab. 2 Overview Move the tasks you developed in Lab. 1 into the more controllable TTCOS operating system Manual control of RC car.
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Lab. 2 Overview. Echo Switches to LED Lab1 Task 7 12/4/2015 TDD-Core Timer Library, Copyright M. Smith, ECE, University of Calgary, Canada 2 / 28.
Introduction to MMX, XMM, SSE and SSE2 Technology
A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.
Reduced Instruction Set Computing Ammi Blankrot April 26, 2011 (RISC)
1 SVY 207: Lecture 5 The Pseudorange Observable u Aim of this lecture: –To understand how a receiver extracts a pseudorange measurement from a GPS signal.
Logic Gates Dr.Ahmed Bayoumi Dr.Shady Elmashad. Objectives  Identify the basic gates and describe the behavior of each  Combine basic gates into circuits.
CPIT Program Execution. Today, general-purpose computers use a set of instructions called a program to process data. A computer executes the.
Generating a software loop with memory accesses TigerSHARC assembly syntax.
Riyadh Philanthropic Society For Science Prince Sultan College For Woman Dept. of Computer & Information Sciences CS 251 Introduction to Computer Organization.
System Programming and administration
Embedded Systems Design
A Closer Look at Instruction Set Architectures
Developing a multi-thread Simulation of GPS system You’ll only need to add the threads – all functions (except correlation( )) provided M. Smith Electrical.
Software and Hardware Circular Buffer Operations
General Optimization Issues
TigerSHARC processor General Overview.
Generating “Rectify( )”
Developing a multi-thread product -- Introduction
Trying to avoid pipeline delays
Generating a software loop with memory accesses
Understanding the TigerSHARC ALU pipeline
Developing a multi-thread product -- Introduction
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Convolution, GPS and the TigerSHARC XCORRS instr.
Chapter 5: Computer Systems Organization
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Getting serious about “going fast” on the TigerSHARC
General Optimization Issues
Explaining issues with DCremoval( )
General Optimization Issues
Lab. 4 – Part 2 Demonstrating and understanding multi-processor boot
Understanding the TigerSHARC ALU pipeline
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
A first attempt at learning about optimizing the TigerSHARC code
Working with the Compute Block
Instruction execution and ALU
A first attempt at learning about optimizing the TigerSHARC code
Presentation transcript:

TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada

The practice Suppose we have the vector – in-phase and out-of- phase data gathered over an antenna from a satellite for example. Gain issues make it x j, 16+16j, 16+16j, j 16+16j, 16+16j j, 16+16j, 16+16j, j 16+16j, 16+16j, j 16+16j, 16+16j, etc Question – if the original data from the satellite had this form -1-j,1+j,1+j, -1-j,1+j,1+j, -1-j,1+j,1+j, -1-j,1+j,1+j, -1-j,1+j,1+j, -1-j,1+j,1+j, How is the satellite data delayed? FOR THIS EXAMPLE …….. 0, 3, 6, 9, 12 etc

Tackle the issue with FIR First – modify correlation function to handle complex values Ignore that issue at the moment Imagine 1024 data points PRN Need to do 1024 FIR each of 1024 taps We know how to optimize to do 2 taps every cycle (one in X and one in Y) Cycle time is 1024 * 512 cycles = 1 ms at 500 MHz XCORS can do 8 * 16 taps each cycle in each compute block – 148 times faster

Where does the CLU fit in?

XCORRS definition

THEORY Mathematical definition Uses registers TR D C And something called CUT

Satellite data Quad fetch brings in 8 complex values 8 bits each Pattern here is j, 1 + 0j, 1 + 0j, j, 1 + 0j, 1 + 0j, ……….

PRN code – 2 bit complex number Seems strange to have two dummy bits But actually makes sense PRN j, 1 + j, 1 + j, j, 1 + j, 1 + j, ………. +1, -1 are associated with the PSK – more next lecture Problem BINARY means 1 and 0, so how represent 1 and -1

PRN

0x3 value go in as C15 and C C15 = -1 –j C16 = +1 + j

Loading the THR registers

Standard XCORRS instruction Lower 46 bits ofTHR1:0 R7:3 TR0, TR1, TR2 ……. TR15

TR15:0 = XCORRS(R7:4, THR3:0) TR0 += D7 * C22 + D6 * C21 +… 8 taps TR1 += D7 * C21 + D6 * C20 +… 8 taps ……….. TR15 += D7 * C7 + D6 * C6 + … 8 taps 64 taps each cycles – on both x and y compute blocks – if set up properly 128 taps each cycle – these are “complex taps” compared to 2 real taps / cycle after lab. 3

TR15:0 = XCORRS(R7:4, THR3:0) (CUT -7) TR0 += D7 * C22 + D6 * C21 + … 8 taps TR1 += D7 * C21 + D6 * C20 + … 8 taps ……….. TR14 += D7 * C8 + D6 * C7 2 taps TR15 += D7 * C7 1 taps

TR15:0 = XCORRS(R7:4, THR3:0) (CUT -15) TR0 += D7 * C22 + D6 * C21 … 8 taps TR1 += D7 * C21 + D6 * C20 … 7 taps ……….. TR7 += D7 * C15 … 1 taps TR0 += 0 … 0 taps ……….. TR15 += 0 … 0 taps

TR15:0 = XCORRS(R7:4, THR3:0) (CUT +15) TR0 += 0 … 0 taps TR1 += D0 *C14 1 taps ……….. TR7 += D6 * C14 + D5 * C13 + … 7 taps TR0 += D7 * C14 + D6 * C13 + … 8 taps ……….. TR15 += D7 * C7 + D6 * C7 + … 8 taps

TR15:0 = XCORRS(R7:4, THR3:0) (CUT -15) TR0 += D7 * C22 + D6 * C21 … 8 taps TR1 += D7 * C21 + D6 * C20 … 7 taps ……….. TR7 += D7 * C15 … 1 taps TR0 += 0 … 0 taps ……….. TR15 += 0 … 0 taps

TR15:0 = XCORRS(R7:4, THR3:0) (CUT -7) TR0 += D7 * C22 + D6 * C21 + … 8 taps TR1 += D7 * C21 + D6 * C20 + … 8 taps ……….. TR14 += D7 * C8 + D6 * C7 2 taps TR15 += D7 * C7 1 taps

TR15:0 = XCORRS(R7:4, THR3:0) TR0 += D7 * C22 + D6 * C21 +… 8 taps TR1 += D7 * C21 + D6 * C20 +… 8 taps ……….. TR15 += D7 * C7 + D6 * C6 + … 8 taps 64 taps each cycles – on both x and y compute blocks – if set up properly 128 taps each cycle – these are “complex taps” compared to 2 real taps / cycle after lab. 3

Problem at this point -- THR3:2 empty Need to bring in more PRN values

TR15:0 = XCORRS(R7:4, THR3:0) (CUT +15) TR0 += 0 … 0 taps TR1 += D0 *C14 1 taps ……….. TR7 += D6 * C14 + D5 * C13 + … 7 taps TR0 += D7 * C14 + D6 * C13 + … 8 taps ……….. TR15 += D7 * C7 + D6 * C7 + … 8 taps

Final Result Maximum correlation occurs every 3 shifts – which is what we expect Is it the correct results

Correlation – result expected In step -1 +0j, 1 + 0j, 1 + 0j, … 16 times with -1 - j, 1 + j, 1 + j, … 16 times -1 * * * = 0x30 -- Real component Out of step -1 +0j, 1 + 0j, 1 + 0j, … 16 times with 1 + j, 1 + j, -1 - j, … 16 times -1 * * * = -0x10 = 0xFFF0

Final Result 1) Now have correlation values for 16 shifts in TR registers – store to external memory Repeat for all other necessary shifts – find the maximum 2) Now make parallel in SISD mode 3) Now make parallel in SIMD

Take home Quiz 4 Old requirement Do Lab 4 with FFT and XCORRS Write tests and demonstrate XCORRS used for correlation a)Not parallel instruction format – but in a loop b) Now do in optimized SISD mode c) Now do in optimized SIMD mode