Ultra sound solution Impact of C++ DSP optimization techniques.

Slides:



Advertisements
Similar presentations
DSPs Vs General Purpose Microprocessors
Advertisements

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
CSCI 4717/5717 Computer Architecture
Lecture 6 Programming the TMS320C6x Family of DSPs.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
A look at interrupts What are interrupts and why are they needed.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Daddy! -- Where do instructions come from? Program Sequencer controls program flow and provides the next instruction to be executed Straight line code,
Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada
ENCM 515 Review talk on 2001 Final A. Wong, Electrical and Computer Engineering, University of Calgary, Canada ucalgary.ca.
CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact.
Generation of highly parallel code for TigerSHARC processors An introduction This presentation will probably involve audience discussion, which will create.
A look at interrupts What are interrupts and why are they needed in an embedded system? Equally as important – how are these ideas handled on the Blackfin.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. Chapter 5 - Functions Outline 5.1Introduction 5.2Program.
Chapter 7 Interupts DMA Channels Context Switching.
Software Design Project
A look at interrupts What are interrupts and why are they needed.
TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada
Midterm Thursday let the slides be your guide Topics: First Exam - definitely cache,.. Hamming Code External Memory & Buses - Interrupts, DMA & Channels,
7/14/20151 Introduction toVisual DSP Kernel VDK for Multi-threaded environment ENCM491 – Real Time (in 1 hour) M. Smith, Electrical and Computer Engineering,
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Digital Signal Processors for Real-Time Embedded Systems By Jeremy Kohel.
Computer Organization and Architecture Reduced Instruction Set Computers (RISC) Chapter 13.
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
 Send in audio signals and use sharp FIR filter to pick out 42 Hz and 59 Hz signals and send out warning tones ◦ Try FIR filter of 256 taps, down sample.
RM2D Let’s write our FIRST basic SPIN program!. The Labs that follow in this Module are designed to teach the following; Turn an LED on – assigning I/O.
TI DSPS FEST 1999 Implementation of Channel Estimation and Multiuser Detection Algorithms for W-CDMA on Digital Signal Processors Sridhar Rajagopal Gang.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.
Ultra sound problem. Tackle today – ideas for Assignment 2 Describe the ultrasound problem XPI lifecycle Stage 1 – Discuss with research team (customer)
Ultra sound solution Profiles and other optimizations.
Developing a DSP Algorithm using a TTD process Application FIR filter.
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Reduced Instruction Set Computers. Major Advances in Computers(1) The family concept —IBM System/ —DEC PDP-8 —Separates architecture from implementation.
1 Final Presentation Group P14345 Team Lead: William Sender Jeffrey Auclair Bryan Beatrez Michael Ferry.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Systematic development of programs with parallel instructions SHARC ADSP21XXX processor M. Smith, Electrical and Computer Engineering, University of Calgary,
A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.
LECTURE 3 Translation. PROCESS MEMORY There are four general areas of memory in a process. The text area contains the instructions for the application.
Lecture 3 Translation.
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
General Optimization Issues
CSCI1600: Embedded and Real Time Software
Microcoded CCU (Central Control Unit)
Overview of SHARC processor ADSP and ADSP-21065L
Overview of SHARC processor ADSP Program Flow and other stuff
TigerSHARC processor and evaluation board
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Lab. 2 Modeling an audio channel with delays on ADSP21061
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
Getting serious about “going fast” on the TigerSHARC
General Optimization Issues
Explaining issues with DCremoval( )
General Optimization Issues
Chapter 12 Pipelining and RISC
Understanding the TigerSHARC ALU pipeline
A first attempt at learning about optimizing the TigerSHARC code
Working with the Compute Block
COMPUTER ORGANIZATION AND ARCHITECTURE
CSCI1600: Embedded and Real Time Software
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
Presentation transcript:

Ultra sound solution Impact of C++ DSP optimization techniques

Research Team discussion Ultra-sound probe (20 MHz) that sends out signals into body that reflect off moving blood cells in (Artery? Vein?) Ultra-sound frequency received is Doppler shifted compared to transmitted frequency Same as sound when ambulance goes by. Higher if approaching, lower if receding They get the positive frequencies (towards) on the left audio channel and negative frequencies (away) on the right audio channel. 9/17/2015.ENCM515 – Ultrasound Problem Copyright 2 / 33

Picture looks like this Note that the display loses all direction information Can I help them to output the maximum frequency? 9/17/2015 ENCM515 – Ultrasound Problem Copyright 3 / 33

Captured audio signal 9/17/2015 ENCM515 – Ultrasound Problem Copyright 4 / 33 Engineering Problems Problem 5 – Different amplitudes common Problem 6 – Why are funny dead spots not lining up in left and right channels? Handling stereo not mono signals Incorrect labeling / misinterpreation Problem 7 – How to remove dead-spots?

Max frequency – definition 1 Frequency below which X% of the frequencies fall Noisy signal for large thresholds > 80% 9/17/2015 ENCM515 – Ultrasound Problem Copyright 5 / 33

After XPI Stage 2 Have a working algorithm concept Engineering problem 1 – Complex math (a + jb) on SHARC! Engineering Problem 2 – Define maximum frequency zillions of blood cells – therefore distribution of frequencies Workable prototype – discuss more with customer Engineering Problem 3 – SHARC D/A can’t handle DC signal Workable prototype – discuss more with customer Engineering Problem 4 – Can SHARC handle all this in real-time? Problem 5 – Is different amplitudes of input channels common? Yes Problem 6 – Why are funny dead spots not lining up in left and right channels? Artifact – mislabeled and misinterpreted sampled Problem 7 – How to remove dead-spots? – Discuss more with customer 9/17/2015 ENCM515 – Ultrasound Problem Copyright 6 / 33

ProcessBlock DONE OUTSIDE INTERRUPT AVOIDS RACE 9/17/2015 ENCM515 – Ultrasound Problem Copyright 7 / 33

Real life problem -- Stereo Minor changes to Audio Premptive Task 9/17/2015 ENCM515 – Ultrasound Problem Copyright 8 / 33

Make “C – code more general Moved buffer[ ] to external files Unknown size of arrays being processed 9/17/2015 ENCM515 – Ultrasound Problem Copyright 9 / 33

Switch to Release mode Switch to optimizing compiler (ReleaseNWC) means can no longer set breakpoints – Fix with these steps 9/17/2015 ENCM515 – Ultrasound Problem Copyright 10 / 33

First look at code Timing -- software loop with r2 as loop counter – test at end N * (10 – 1) cycles (jump is not db) -1 for 1 parallel instruction 9/17/2015 ENCM515 – Ultrasound Problem Copyright 11 / 33

Use Compiler Info button 9/17/2015 ENCM515 – Ultrasound Problem Copyright 12 / 33 3 Stalls – 2 on software jump. 1 on ?

Obvious things to do We are already processing left and right channels in one program Switch to left audio in dm memory and right audio in pm memory Need to do Make right buffers ‘pm’ Change prototype of function to padd pm 9/17/2015 ENCM515 – Ultrasound Problem Copyright 13 / 33

As expected 2 cycles saved Parallel dm and pm reads and writes 9/17/2015 ENCM515 – Ultrasound Problem Copyright 14 / 33

Why software loop? Switch does know what to do about size of loop so can’t oprtimize loop 9/17/2015 ENCM515 – Ultrasound Problem Copyright 15 / 33 THIS PRAGMA IS A CONTRACT BETWEEN THE DEVELOPER AND COMPILE DON’T LIE

This does not compile Pragma variables not handled by preprocessor 9/17/2015 ENCM515 – Ultrasound Problem Copyright 16 / 33

Variable as end of loop Compile will not optimize when loop parameter is declared external, or internal or static 9/17/2015 ENCM515 – Ultrasound Problem Copyright 17 / 33

Loop parameters all constants known to compiler Drop from 8 cycles to 2 cycles as compiler knows enough to switch to hardware loop control – STALLS FROM JUMP GONE 9/17/2015 ENCM515 – Ultrasound Problem Copyright 18 / 33

Where am I getting all my info? 9/17/2015 ENCM515 – Ultrasound Problem Copyright 19 / 33

Can we switch to SIMD mode VECTORIZATION MAY NOT BE POSSIBLE IF COMPILER DOES NOT KNOW ABOUT ALIGNMENT OF ARRAYS (How arrays placed in memory) 9/17/2015 ENCM515 – Ultrasound Problem Copyright 20 / 33

Impact of vectorization Before -- loop count was 0x80 With memory operations of the form r2 = dm(i4, m6) where m6 = 1 meaning code is doing r2 = i4++; 9/17/2015 ENCM515 – Ultrasound Problem Copyright 21 / 33

New instructions – SIMD mode Bit set mode1 0x (bit clr mode 1) Processor doing r2 = dm(i5, 2) Same as r2 = dm(i5, 1) AND s2 = dm(i5, 1) Loading two registers 9/17/2015 ENCM515 – Ultrasound Problem Copyright 22 / 33

Try using #pragma inline BEFORE AFTER (20 cycles faster?) 9/17/2015 ENCM515 – Ultrasound Problem Copyright 23 / 33

C++ showing out of order execution 9/17/2015 ENCM515 – Ultrasound Problem Copyright 24 / 33 WARNING

Lets do “inline” ProcessOneBlock( ) is called by four subroutines – lets in 9/17/2015 ENCM515 – Ultrasound Problem Copyright 25 / 33

Mixed mode view is interesting 9/17/2015 ENCM515 – Ultrasound Problem Copyright 26 / 33

Mixed Mode Out of order execution with 4 copies of the code for DoCopyBlock( ) (one for each of Process 0, Process1, Process2, Process 3) NO CODE OF ProcessOneBlock( ) 9/17/2015 ENCM515 – Ultrasound Problem Copyright 27 / 33

Speed improvement Moving from software loop and using dm and pm memories caused a change from 8 cycles / pt to 2 cycles for two points processed in SIMD (4 CALLS * 7 CYCLES SAVED * N POINTS PROCESSED) Moving to IN_LINE causes a change of around 120 cycles for each subroutine call (4 CALLS * 120 CYCLES SAVED) N = (4 * 1800 to 4 * 120) 480 Mhz processor us to 1 us LESSON LEARNT – SPEND YOUR TIME OPTIMIZING THE LOOPS – REST IS SMALLER AND GETS SMALLER WITH LARGER N 9/17/2015 ENCM515 – Ultrasound Problem Copyright 28 / 33

Other improvements depend on code Characteristics specifics 9/17/2015 ENCM515 – Ultrasound Problem Copyright 29 / 33

9/17/2015 ENCM515 – Ultrasound Problem Copyright 30 / 33

Profile guided optimization 9/17/2015 ENCM515 – Ultrasound Problem Copyright 31 / 33

Memory alignment can be important After first char fetch, system and move to move 8 chars in SIMD 9/17/2015 ENCM515 – Ultrasound Problem Copyright 32 / 33

9/17/2015 ENCM515 – Ultrasound Problem Copyright 33 / 33

Conditional code (manual PGO) 9/17/2015 ENCM515 – Ultrasound Problem Copyright 34 / 33

Correct ways to process loops 9/17/2015 ENCM515 – Ultrasound Problem Copyright 35 / 33

9/17/2015 ENCM515 – Ultrasound Problem Copyright 36 / 33

9/17/2015 ENCM515 – Ultrasound Problem Copyright 37 / 33

#pragma all_aligned #pragma loop_unroll N #pragma SIMD_for #pragma align num #pragma alignment_region( and #pragma alignment_region_end 9/17/2015 ENCM515 – Ultrasound Problem Copyright 38 / 33