TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada

Slides:



Advertisements
Similar presentations
Lab 2 – DSP software architecture and the real life DSP characteristics of signals that make it necessary.
Advertisements

DSPs Vs General Purpose Microprocessors
Processor Architecture Needed to handle FFT algoarithm M. Smith.
A look at interrupts What are interrupts and why are they needed.
Boot Issues Processor comparison TigerSHARC multi-processor system Blackfin single-core.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada
A look at interrupts What are interrupts and why are they needed in an embedded system? Equally as important – how are these ideas handled on the Blackfin.
Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.
The Logic Machine We looked at programming at the high level and at the low level. The question now is: How can a physical computer be built to run a program?
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter.
3/5/2004DSP Applied to GPS Algorithms1 of 14 DSP Applied to GPS Algorithms.
Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm.
A look at interrupts What are interrupts and why are they needed.
TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.
From Essentials of Computer Architecture by Douglas E. Comer. ISBN © 2005 Pearson Education, Inc. All rights reserved. 7.2 A Central Processor.
Lecture 4 – Examining GPS data Question #1 – How does GPS receiver know where the satellites are? Issac Newton and Johannes Kepler will help to estimate.
 Send in audio signals and use sharp FIR filter to pick out 42 Hz and 59 Hz signals and send out warning tones ◦ Try FIR filter of 256 taps, down sample.
Ultra sound solution Impact of C++ DSP optimization techniques.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
CS 1308 Computer Literacy and the Internet Computer Systems Organization.
Introduction to Computing Systems from bits & gates to C & beyond The Von Neumann Model Basic components Instruction processing.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
DSP Processors We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters.
CSC321 Making a Computer Binary number system → Boolean functions Boolean functions → Combinational circuits Combinational circuits → Sequential circuits.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.
CS 1308 Computer Literacy and the Internet. Objectives In this chapter, you will learn about:  The components of a computer system  Putting all the.
Lecture 5 A Closer Look at Instruction Set Architectures Lecture Duration: 2 Hours.
A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.
Simple ALU How to perform this C language integer operation in the computer C=A+B; ? The arithmetic/logic unit (ALU) of a processor performs integer arithmetic.
1 SVY 207: Lecture 5 The Pseudorange Observable u Aim of this lecture: –To understand how a receiver extracts a pseudorange measurement from a GPS signal.
The Processor & its components. The CPU The brain. Performs all major calculations. Controls and manages the operations of other components of the computer.
Generating a software loop with memory accesses TigerSHARC assembly syntax.
Riyadh Philanthropic Society For Science Prince Sultan College For Woman Dept. of Computer & Information Sciences CS 251 Introduction to Computer Organization.
GCSE Computing - The CPU
Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)
A Closer Look at Instruction Set Architectures
Developing a multi-thread Simulation of GPS system You’ll only need to add the threads – all functions (except correlation( )) provided M. Smith Electrical.
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Software and Hardware Circular Buffer Operations
General Optimization Issues
TigerSHARC processor General Overview.
Developing a multi-thread product -- Introduction
Trying to avoid pipeline delays
Generating a software loop with memory accesses
Understanding the TigerSHARC ALU pipeline
Developing a multi-thread product -- Introduction
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Convolution, GPS and the TigerSHARC XCORRS instr.
Chapter 5: Computer Systems Organization
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Understanding the TigerSHARC ALU pipeline
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Getting serious about “going fast” on the TigerSHARC
General Optimization Issues
Explaining issues with DCremoval( )
General Optimization Issues
Lab. 4 – Part 2 Demonstrating and understanding multi-processor boot
1-2 – Central Processing Unit
Understanding the TigerSHARC ALU pipeline
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
A first attempt at learning about optimizing the TigerSHARC code
GCSE Computing - The CPU
Working with the Compute Block
A first attempt at learning about optimizing the TigerSHARC code
Presentation transcript:

TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada

Overview Recap GPS correlation Look at XCORRS instruction in detail This was part of Take home quiz for 5005 Additional information on the web Xcorrs.asm – assembly code discussed in class Xmain.cpp – demonstrates the use of the xcorrs.asm code XcorrsTest.cpp – demonstrates testing of all the functions being used Additional correlation presentations (not XCORRS) from Analog Devices developers In 2005, we pointed out many errors in TigerSHARC XCORRS explanation – if my figures are not the same as in the manual, then they fixed the manual errors

GPS Positioning Concepts (1) For now make 2 assumptions: We know the distance to each satellite We know where each satellite is With this information from 2 satellites – you know you are on a “plane of intersection. Require 3 satellites for a 3-D position in this “ideal” scenario Requires 4 satellites to account for local receiver clock drift.

Determining Time Use the PRN code to determine time Use time to determine distance to the satellite distance = speed of light * time (1) Signal send by satellite Signal received by you You know the signal sent Perform correlations till you get a match

The practice Suppose we have the vector – in-phase and out-of- phase data gathered over an antenna from a satellite for example. Gain issues make it x j, 16+16j, 16+16j, j 16+16j, 16+16j j, 16+16j, 16+16j, j 16+16j, 16+16j, j 16+16j, 16+16j, etc Question – if the original data from the satellite had this form -1-j,1+j,1+j, -1-j,1+j,1+j, -1-j,1+j,1+j, -1-j,1+j,1+j, -1-j,1+j,1+j, -1-j,1+j,1+j, How much is the satellite data delayed? FOR THIS EXAMPLE …….. 0, 3, 6, 9, 12 etc

Tackle the issue with FIR First – modify correlation function to handle complex values Ignore that issue at the moment – 1 add + 1 multiplication + 2 memory fetches to 3 adds + 4 multiplications plus 4 memory fetches Imagine 1024 data points PRN Need to do 1024 FIR each of 1024 taps We know how to optimize to do 2 taps every cycle (one in X and one in Y) Cycle time is 1024 * 512 cycles = 1 ms at 500 MHz XCORS can do 8 * 16 taps each cycle in each compute block – 148 times faster

Where does the CLU fit in?

XCORRS definition

THEORY Mathematical definition Uses registers TR -- accumulate D -- 8 data? C -- 1 coefficient? And something called CUT – essentially a window operation f cut = 0 -- don’t use

2005 Lab. 4 Satellite data Quad fetch brings in 8 complex values 8 bits each Pattern here is j, 1 + 0j, 1 + 0j, j, 1 + 0j, 1 + 0j, ……….

PRN code – 2 bit complex number Seems strange to have two dummy bits But actually makes sense PRN j, 1 + j, 1 + j, j, 1 + j, 1 + j, ………. +1, -1 are associated with the PSK – more another lecture Problem BINARY means 1 and 0, so how represent 1 and are stored as 1’s, +1 stored as 0’s (DAMY)

PRN

0x3 value go in as C15 and C C15 = -1 –j C16 = +1 + j

Loading the THR registers

Standard XCORRS instruction Lower 46 bits ofTHR1:0 R7:3 TR0, TR1, TR2 ……. TR15

TR15:0 = XCORRS(R7:4, THR3:0) Doing 8 complex taps of 16 correlation at each cycle TR0 += D7 * C22 + D6 * C21 +… 8 taps TR1 += D7 * C21 + D6 * C20 +… 8 taps ……….. TR15 += D7 * C7 + D6 * C6 + … 8 taps 64 taps each cycles – on both x and y compute blocks – if set up properly 128 taps each cycle – these are “complex taps” compared to 2 real taps / cycle after lab. 3

TR15:0 = XCORRS(R7:4, THR3:0) (CUT -7) Because of offsets, sometimes we must only use “some of the taps” TR0 += D7 * C22 + D6 * C21 + … 8 taps TR1 += D7 * C21 + D6 * C20 + … 8 taps ……….. TR14 += D7 * C8 + D6 * C7 2 taps TR15 += D7 * C7 1 taps

TR15:0 = XCORRS(R7:4, THR3:0) (CUT -15) TR0 += D7 * C22 + D6 * C21 … 8 taps TR1 += D7 * C21 + D6 * C20 … 7 taps ……….. TR7 += D7 * C15 … 1 taps TR0 += 0 … 0 taps ……….. TR15 += 0 … 0 taps

TR15:0 = XCORRS(R7:4, THR3:0) (CUT +7?) TR0 += 0 … 0 taps TR1 += D0 *C14 1 taps ……….. TR7 += D6 * C14 + D5 * C13 + … 7 taps TR0 += D7 * C14 + D6 * C13 + … 8 taps ……….. TR15 += D7 * C7 + D6 * C7 + … 8 taps

TR15:0 = XCORRS(R7:4, THR3:0) (CUT -15) TR0 += D7 * C22 + D6 * C21 … 8 taps TR1 += D7 * C21 + D6 * C20 … 7 taps ……….. TR7 += D7 * C15 … 1 taps TR0 += 0 … 0 taps ……….. TR15 += 0 … 0 taps

TR15:0 = XCORRS(R7:4, THR3:0) (CUT -7) TR0 += D7 * C22 + D6 * C21 + … 8 taps TR1 += D7 * C21 + D6 * C20 + … 8 taps ……….. TR14 += D7 * C8 + D6 * C7 2 taps TR15 += D7 * C7 1 taps

TR15:0 = XCORRS(R7:4, THR3:0) TR0 += D7 * C22 + D6 * C21 +… 8 taps TR1 += D7 * C21 + D6 * C20 +… 8 taps ……….. TR15 += D7 * C7 + D6 * C6 + … 8 taps 64 taps each cycles – on both x and y compute blocks – if set up properly 128 taps each cycle – these are “complex taps” compared to 2 real taps / cycle after lab. 3

Problem at this point -- THR3:2 empty Need to bring in more PRN values

TR15:0 = XCORRS(R7:4, THR3:0) (CUT +15) TR0 += 0 … 0 taps TR1 += D0 *C14 1 taps ……….. TR7 += D6 * C14 + D5 * C13 + … 7 taps TR0 += D7 * C14 + D6 * C13 + … 8 taps ……….. TR15 += D7 * C7 + D6 * C7 + … 8 taps

Final Result Maximum correlation occurs every 3 shifts – which is what we expect Is it the correct result?

Correlation – result expected In step -1 +0j, 1 + 0j, 1 + 0j, … 16 times with -1 - j, 1 + j, 1 + j, … 16 times -1 * * * = 0x30 -- Real component Out of step -1 +0j, 1 + 0j, 1 + 0j, … 16 times with 1 + j, 1 + j, -1 - j, … 16 times -1 * * * = -0x10 = 0xFFF0

Final Result 1) Now have correlation values for 16 shifts in TR registers – store to external memory Repeat for all other necessary shifts – find the maximum 2) Now make parallel in SISD mode 3) Now make parallel in SIMD

Overview Recap GPS correlation Look at XCORRS instruction in detail This was part of Take home quiz for 5005 Additional information on the web Xcorrs.asm – assembly code discussed in class Xmain.cpp – demonstrates the use of the xcorrs.asm code XcorrsTest.cpp – demonstrates testing of all the functions being used Additional correlation presentations (not XCORRS) from Analog Devices developers In 2005, we pointed out many errors in TigerSHARC XCORRS explanation – if my figures are not the same as in the manual, then they fixed the manual errors