S3.kth.se DSP Lecture 30/3-2010 Per Zetterberg. Agenda General. Starting CCS Comparing matlab and DSP results. Profiling when comparing matlab and DSP.

Slides:



Advertisements
Similar presentations
Computer Architecture
Advertisements

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Lecture 6 Programming the TMS320C6x Family of DSPs.
Dr. Rabie A. Ramadan Al-Azhar University Lecture 3
Code Composer Department of Electrical and Computer Engineering
1 Lecture 3: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation.
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Microprocessors General Features To be Examined For Each Chip Jan 24 th, 2002.
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
Computer Organization and Architecture
Computer Organization and Architecture
Processes CSCI 444/544 Operating Systems Fall 2008.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Computer System Overview
Midterm Tuesday October 23 Covers Chapters 3 through 6 - Buses, Clocks, Timing, Edge Triggering, Level Triggering - Cache Memory Systems - Internal Memory.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Microprocessor Systems Design I Instructor: Dr. Michael Geiger Spring 2012 Lecture 2: 80386DX Internal Architecture & Data Organization.
EECE476: Computer Architecture Lecture 27: Virtual Memory, TLBs, and Caches Chapter 7 The University of British ColumbiaEECE 476© 2005 Guy Lemieux.
EE 345S Real-Time Digital Signal Processing Lab Fall 2008
LPC Speech Coder on the TI C6x DSP Mark Anderson, Jeff Burke EE213A / EE298-2 Prof. Ingrid Verbauwhede.
Anne Mascarin DSP Marketing The MathWorks
Ehsan Shams Saeed Sharifi Tehrani. What is DSP ? Digital Signal Processing (DSP) is used in a wide variety of applications, and it is hard to find a good.
1 CSC 2405: Computer Systems II Spring 2012 Dr. Tom Way.
The 6713 DSP Starter Kit (DSK) is a low-cost platform which lets customers evaluate and develop applications for the Texas Instruments C67X DSP family.
Input/Output. Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower.
Digital Signal Processors for Real-Time Embedded Systems By Jeremy Kohel.
Computer Organization
System Calls 1.
Instruction Set Architecture
1 A Simple but Realistic Assembly Language for a Course in Computer Organization Eric Larson Moon Ok Kim Seattle University October 25, 2008.
8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.
The ISA Level The Instruction Set Architecture (ISA) is positioned between the microarchtecture level and the operating system level.  Historically, this.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
DSP Processors We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters.
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
13-Nov-15 (1) CSC Computer Organization Lecture 7: Input/Output Organization.
ECE 456 Computer Architecture Lecture #14 – CPU (III) Instruction Cycle & Pipelining Instructor: Dr. Honggang Wang Fall 2013.
ITC Research Computing Support Using Matlab Effectively By: Ed Hall Research Computing Support Center Phone: Φ Fax:
Computer Architecture and Organization
28/03/2003Julie PRAST, LAPP CNRS, FRANCE 1 The ATLAS Liquid Argon Calorimeters ReadOut Drivers A 600 MHz TMS320C6414 DSPs based design.
Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)
Computer Architecture Lecture 32 Fasih ur Rehman.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Basic Memory Management Chapter 3 C6000 Integration Workshop Copyright © 2005 Texas Instruments. All rights reserved. Technical Training Organization T.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
LECTURE 3 Translation. PROCESS MEMORY There are four general areas of memory in a process. The text area contains the instructions for the application.
EE 345S Real-Time Digital Signal Processing Lab Fall 2008 Lab #3 Generating a Sine Wave Using the Hardware & Software Tools for the TI TMS320C6713 DSP.
Lecture 3 Translation.
CMSC 611: Advanced Computer Architecture
Nios II Processor: Memory Organization and Access
Assembly language.
A Closer Look at Instruction Set Architectures
William Stallings Computer Organization and Architecture 8th Edition
Computer Architecture
Cache Memory Presentation I
Subject Name: Digital Signal Processing Algorithms & Architecture
The TMS320C6x Family of DSPs
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
Introduction to Microprocessor Programming
Digital Signal Processors-1
Superscalar and VLIW Architectures
COMP755 Advanced Operating Systems
Lecture 12 Input/Output (programmer view)
Presentation transcript:

s3.kth.se DSP Lecture 30/ Per Zetterberg

Agenda General. Starting CCS Comparing matlab and DSP results. Profiling when comparing matlab and DSP results. Matlab DSP communication. EDMA QUAD_DAC_ADC (headphones). _empty State-machine using case statement. Data formats. Overlap and add. Stack and heap. Simple optimization rules. Cache Some advices.

DSP Programming Setup in the project course: PC or ”host” DSP or “target” (or DSK)

What is a DSP ? A CPU which is optimized for signal processing: Special instructions for common signal processing operations, e.g. multiply and accumulate. Often on-chip circuits that handle input/output (IO). Low power consumption. Cheap (compared to processors in e.g. desktop computers).

Project Prototype: DSP versus PC Concurrently running programs at both the DSP and the PC. DSP-card used for: Signal processing IO (sampling/playback) PC used for: Graphical User Interface (GUI) Controlling the application, receiving results.

The DSP in the project course You will use a Texas Instruments C6713 floating point digital signal processor. Massively parallel architecture (VLIW) - up to eight 32 bit instructions are executed simultaneously. Running at 225 MHz, giving 1.2 GFlops peak performance. Belongs to the TI C6x family of DSPs Widely used in industry

Software pipelining The processor can be programmed to perform eight operations in paralell (e.g. MULT, ADD, MV) Every instruction has a certain latency. The compiler will pipeline code i.e. perform several instructions in parallell in loops if: –There are no function calls in the loop. –Optimization –o3 is selected. –.. Check that important loops are pipelined.

Technical Requirements of Prototype Real-time functionality DSP-card: signal processing, PC: user interface User interface through a GUI (windows style) implemented in matlab. No unnecessary use of processor time on the PC Well structured and adequately commented source code For more details see

Development Tools Matlab –Algorithm development. –Prototype verification. –User interface development (GUI) –Control of DSP card –Control of code profiling. DSP: Code Composer Studio –Algorithm implementation in C/Assembler –Debugging in conjunction with Matlab implementation –Code profiling.

How to learn … How to Quickly Learn DSP Programming : getting_started.shtml Our web-pages: Ask me: Search on the net, newsgroups, ….

PC programming (GUI) Two methods: Using a GUIDE (a GUI for creating a GUI ) Programmatically.

CCStudio v3.3 is the code development environment. Use Setup CCStudion v3.3 when you need to change between targets. –C6713 DSK-USB –C6713 Device Cycle Accurate Simulator (little endian) –C6416 Device Cycle Accurate Simulator (little endian) Connnect to matlab –cc=ccsdsp; –cc.visible(0), cc.run, cc.isrunning. Starting CCS The hardware When doing tutorial

Comparing matlab and DSP result Principle to test isolated functions e.g. a decoder: Generate input in matlab. Write input to the DSP. Call DSP version of function. Read output from the DSP. Call matlab version of function. Compare results. Let’s have a look at the compare_with_matlab_31 skeleton!

Test important functions by Copy the entire compare_with_matlab_31.pjt project. Replace FuncionToBeTested with your code: –In the C-code. –In the matlab code. Define input and output data&parameters as relevant for your function. Change the matlab code to generate relevant input data. Sometimes called ”test harness” in industry.

Sending data between matlab and DSP when the DSP is not running: Input_obj=createobj(cc,’Input’); % Input is a global % in the DSP code. write(Input_obj,Input); % write data Input=read(Input_obj); % read data Matlab DSP communication 1(2). matlab code

DSP -> PC communication 2(3) When the DSP is running (RTDX): On the DSP side: RTDX_write(&ctrl_chan_dsp2pc, &data_to_matlab, sizeof(float)*NO_FLOATS_TO_MATLAB ); On the matlab side: data_from_DSP=readmsg(cc.rtdx,'ctrl_chan_dsp2p c', 'single') Recommendation: Re-use code in the ”_empty” skeletons.

Matlab DSP communication 3(3) The PC DSP interface is slow  Allowed cheating (if necessary): Pre-read data into memory before real-time processing. Read result from memory, after real-time processing. Large memory areas available in external memory: #pragma DATA_SECTION(Data,".external_mem") // On DSP short Data[1000]; // On DSP write(cc,h_Data.address(1), int16(Data)); % In matlab The data is not cleared when the program is reloaded.

Enhanced Direct Memory Access (EDMA) TX buffer RX buffer DXR McBSP DRR ADC DAC EDMA channel EDMA channel Memory Triggers interrupt HWI_INT8 when ready. Leaves DSP free from moving data back and forth to ADC/DAC!

EDMA PaRAM

Ping-Pong Buffering hEdmaReloadXmtPinghEdmaReloadXmtPong SRC=&gBufferXmtPing SRC=&gBufferXmtPong LINK= hEdmaReloadXmtPong LINK= hEdmaReloadXmtPing DST=DXR Let me show you EDMA_RTDX_GPIO_empty and QUAD_DAC_ADC_empty!

Skeleton programs handling EDMA+RTDX ”Single-antenna” EDMA_RTDX_GPIO_31_empty EDMA_RTDX_GPIO_31. ”Dual-antenna” QUAD_ADC_DAC_31_empty QUAD_ADC_DAC_31. Code development Matlab prototype Code development Matlab prototype

QUAD_DAC_ADC_31 Let’s go through QUAD_DAC_ADC_31_empty Then go through QUAD_DAC_ADC_31 This is the DSP matlab interface to be used in the matlab prototype!! Note: Documentation in “main.c”!

State Machine using Case Statement in appl_Process

Data formats C-types: char=8bits, short=16bits, int=32bits, float 32bits. Integers are signed or unsigned. Float. Sign=1bit, exponent=8bits, fraction 23 bits. In C, conversion is automatic (when pointers are not involved…). However, note the range …..

The buffers in QUAD_DAC_ADC … appl_Process(short *receive_buffer,short *transmit_buffer) The buffers consists of BUFFSIZE shorts (range [-2^15,2^15-1]). BUFFSIZE is defined in EDMA_RTDX_GPIO.h to be The number of bytes is 2*BUFFSIZE=2048. In EDMA_RTDX_GPIO there are 4 channels (i.e. ADC and DAC converters) which are interleaved. Thus the number of 4-dimensional vector samples is BUFFSIZE/2=256. BUFFSIZE can be changed.

Overlap and add Say we want to do implement a FIR filter. The input buffer is 128 samples. The filter is 10 samples. The filtered signal is =137 samples. But the output filter is 128 samples …. Solution: overlap and add. Variant 1: Save the last 9 samples. Add them to the next buffer. Variant 2: Overlap-and-add. See next slide.

Overlap and Add: With additional buffer 128 samples 9 9 Zero these samples Add the new signal Move samples Good if transmit signal is 128 samples and unsynchronized!

Stack and Heap float myfunction(short *buffer) { float internal_buffer[1000]; … This data is stored in the stack. At least 4000 bytes needed. The stack size is set in ”build options”. No warning is given by the compiler of the stack size is to small!!! float *internal_buffer; internal_buffer = (float *) malloc(1000*sizeof(float)); … Allocated in heap The heap size is also set in ”build options”. Also no warning!!!

Code Optimization Let me show you optimization_example.

Simple Optimization Rules 1(2) Turn optimization on. Flags ”-o3”, program mode compilation ”–pm” and ”-op3” if possible. Turn debug off i.e do not use ”-g”. Avoid function calls inside loops! Use of division ”/” is a function call!, use _rcpsp instead. Other intrinsics see table 8-6 in spru187n. Avoid math-functions such as ”sin(x)” use look-up tables instead. Check that all important loops are pipelined by searching for "SOFTWARE PIPELINE INFORMATION“ in generated “.asm” files.

Simple Optimization Rules 2(2) Allocate all time-critical code and data in internal memory (in our skeletons this is default allocating to external memory requires #pragma statement). Use the touch function in an initialization routine to have the most important data structure cached in internal memory. (This function can be copied from the cache_miss_example skeleton) float ImportantData[100]; …. touch(ImportantData,100);

TMS320C6713 cache CPU core L1P. (Program cache) 4kB L1D. (Data cache) 4kB Memory 256kb Internal 16Mb External

One-way cache (L1P) Line 0 Line 1 Line 127 Mem 0x-0x1F Mem 0x20-0x3F Mem 0x0FE0-0x0FFF Mem 0x1000-0x101F Mem 0x1020-0x103F Mem 0x1FE0-0x1FFF Cache SDRAM

Two-way cache (L1D) Line 0A Line 1A Line 63A Line 0B Line 1B Line 63B Mem 0x-0x1F Mem 0x20-0x3F Mem 0x7E0-0x7FF Mem 0x800-0x81F Mem 0x820-0x83F Mem 0x0FE0-0x0FFF

L1D cache TagSet index Offset L1D address allocation: A new line of 32bytes is loaded on a read-miss with a penalty 4 clock-cycles. If two words are loaded per clock-cycle (reading sequentially from a memory segment) the overhead is 8/32*4=1clock-cykle per instruction cycle. A write-miss doesn’t lead to a loading of a new-line. A write buffer of four words handle up to four misses without penalty.

main.c: Illustrates impact of L1D write and read misses (compulsory misses). main2.c: Illustrates the problem with several data objects in the same set (thrashing) Two data objects are in the same set if: Aa = K*2048+ Ab, for some address Aa and Ab in Object A or B respectively, and for some K. Two code objects are in the same set if: Aa = K*4096+ Ab, for some address Aa and Ab in Object A or B respectively, and for some K. cache_miss_example

What to consider when programming to make good use of the cache Align all data buffers on 32byte boundaries. (#pragma DATA_ALIGN). Avoid to allocate more than two objects that map to the same set in the same algorithm. Avoid having two or more computationally complex algorithms that map to the same set. Profile the algorithms with and without cached data and program (see cache_miss_example). Force caching of important data and code before starting the realtime program starts (e.g in appl_Init()) by reading the data (touch) and calling the functions. Test processing data in smaller buffers to see if performance improves.

Some advices 1(2) Start with a skeleton. Only insert functions which have been checked against matlab. Make one change at a time => much easier to find out what went wrong. Save ”before” and ”after” code. Don’t use printf.

Some advices 2(2) Check that all pointers are initialized. If a variable are corrupted, check.map file to se how it could be over-written. Use extern declaration both in the file where variable is declared and where it is used. In real-time debugging. Store results to ”debug- globals”. When using sqrt, log, log10 use ”#include ”.