Presentation is loading. Please wait.

Presentation is loading. Please wait.

DSP Lecture 01 Chapter 1 Introduction.

Similar presentations


Presentation on theme: "DSP Lecture 01 Chapter 1 Introduction."— Presentation transcript:

1 DSP Lecture 01 Chapter 1 Introduction

2 Learning Objectives Why process signals digitally?
Definition of a real-time application. Why use Digital Signal Processing processors? What are the typical DSP algorithms? Parameters to consider when choosing a DSP processor. Programmable vs ASIC DSP. Texas Instruments’ TMS320 family.

3 DSP: Technology Enabler Mixed-signal processor
Present Day Applications Wireless / Cellular Voice-band audio RF codecs Voltage regulation HDD PRML read channel MR pre-amp Servo control SCSI tranceivers DSP: Technology Enabler Consumer Audio Stereo A/D, D/A PLL Mixers Automotive Digital radio A/D/A Active suspension Voltage regulation Multimedia Stereo audio Imaging Graphics palette Voltage regulation DTAD Speech synthesizer Mixed-signal processor

4 Why go digital? Digital signal processing techniques are now so powerful that sometimes it is extremely difficult, if not impossible, for analogue signal processing to achieve similar performance. Examples: FIR filter with linear phase. Adaptive filters.

5 Why go digital? Analogue signal processing is achieved by using analogue components such as: Resistors. Capacitors. Inductors. The inherent tolerances associated with these components, temperature, voltage changes and mechanical vibrations can dramatically affect the effectiveness of the analogue circuitry.

6 Why go digital? With DSP it is easy to: Additionally DSP reduces:
Change applications. Correct applications. Update applications. Additionally DSP reduces: Noise susceptibility. Chip count. Development time. Cost. Power consumption.

7 Why NOT go digital? High frequency signals cannot be processed digitally because of two reasons: Analog to Digital Converters, ADC cannot work fast enough. The application can be too complex to be performed in real-time.

8 Real-time processing DSP processors have to perform tasks in real-time, so how do we define real-time? The definition of real-time depends on the application. Example: a 100-tap FIR filter is performed in real-time if the DSP can perform and complete the following operation between two samples:

9 Real-time processing Waiting Time Processing Time n n+1 Sample Time We can say that we have a real-time application if: Waiting Time  0

10 Why do we need DSP processors?
Why not use a General Purpose Processor (GPP) such as a Pentium instead of a DSP processor? What is the power consumption of a Pentium and a DSP processor? What is the cost of a Pentium and a DSP processor?

11 Why do we need DSP processors?
Use a DSP processor when the following are required: Cost saving. Smaller size. Low power consumption. Processing of many “high” frequency signals in real-time. Use a GPP processor when the following are required: Large memory. Advanced operating systems.

12 What are the typical DSP algorithms?
The Sum of Products (SOP) is the key element in most DSP algorithms:

13 What Problem Are We Trying To Solve?
DAC x Y ADC DSP Most DSP algorithms can be expressed with MAC: count i = 1 Y =  ai * xi for (i = 1; i < count; i++){ sum += m[i] * n[i]; } Digital sampling of an analog signal: A t Over the next 20 slides, we want to provide an example to anchor the presentation and provide context. What better algorithm than the standard sum-of products. The question lead-in is “so, what problem are we trying to solve?” “The basics of DSP involve first sampling an analog signal and converting it to digital. What do we do then? Some type of algorithm to shape, modify, etc the signal. This is easily done in the digital realm. So, the time between samples is our limit to how fast we need to do the algorithm. What’s a typical algorithm look like - this! A simple sum-of products. Let’s look at a typical DSP algorithm and see how the processor is designed to handle it. Spend about 1 minute on this slide. If the group is VERY new to DSP, you might embellish slightly on any areas you feel comfortable with. But remember, the focus is not WHY DSP, it is “assuming you know why you’d want to use this algorithm, let’s see how the processor is built to handle it”. The lead-into the next slide is the Q shown on the slide. Also state that we plan to write the code for this algorithm and see how the architecture is designed to handle it efficiently. OLD INFO What does it take to do this fast … and easy?

14 How does the ‘C6000 achieve such performance from C?
Fast MAC using only C Multiply-Accumulate (MAC) in Natural C Code for (i = 0; i < count; i++){ sum += m[i] * n[i]; } Fastest Execution of MACs The ‘C6x roadmap ... from 200 to 2400 MMACs Ease of C Programming Even using natural C, the ‘C6000 Architecture can perform 2 to 4 MACs per cycle Compiler generates % efficient code How does the ‘C6000 achieve such performance from C?

15 Sample Compiler Benchmarks
Great out-of-box experience Completely natural C code (non ’C6x specific) Code available at: Versus hand-coded assembly based on cycle count HIDDEN SLIDE To view this slide while presenting (in case of customer questions on C efficiency), click the button in the far upper-right corner. How does the ‘C6000 achieve such performance from C?

16 'C6000 Architecture: Built for Speed
Memory ‘C6000 Compiler excels at Natural C While dual-MAC speeds math intensive algorithms, flexibility of 8 independent functional units allows the compiler to quickly perform other types of processing All ‘C6000 instructions are conditional allowing efficient hardware pipelining Instruction set and CPU hardware orthogonality allow the compiler to achieve % efficiency A0 A31 . . A15 .M1 .L1 .D1 .S1 .M2 .L2 .D2 .S2 B0 B31 B15 Controller/Decoder

17 Fastest MAC using Natural C
Memory float mac(float *m, float *n, int count) { int i, float sum = 0; for (i=0; i < count; i++) { sum += m[i] * n[i]; } … A0 A31 . . A15 .M1 .L1 .D1 .S1 .M2 .L2 .D2 .S2 B0 B31 B15 Controller/Decoder ;** * LOOP: ; PIPED LOOP KERNEL LDDW .D1 A4++,A7:A6 || LDDW .D2 B4++,B7:B6 || MPYSP .M1X A6,B6,A5 || MPYSP .M2X A7,B7,B5 || ADDSP .L1 A5,A8,A8 || ADDSP .L2 B5,B8,B8 || [A1] B .S2 LOOP || [A1] SUB .S1 A1,1,A1 SINGLE-CYCLE LOOP KERNEL: The ‘C6000 compiler generates code that performs at the rate of 2 MACs per cycle! It does this by performing two taps (results) per cycle. That is, all 40 results in about 20 cycles. The compiler generates these results from natural ANSI C code - no “tweaking” required. Side Notes: For simplicity and since we were running out of room on the foil, the compiler output was abbreviated. The actual compiler results are slightly different for two reasons Actually it takes something like 28 cycles to calculate 20 terms. 20 iterations (2/cycle) plus 8 cycles of setup. If we were doing 1000 taps, it would take 508 cycles. Due to latency of some of the instructions, the code must be unrolled to achieve maximum performance. That is, the compiler actually generates a four-cycle loop which calculates 8 results. Again, the rate is still 2 MACs per cycle. We’re not ignoring all that needs to be done... but if there is high interest, encourage attendance of 4-day workshop...

18 Looking at the internal buses ...
'C6000 System Block Diagram External Memory .D1 .M1 .L1 .S1 .D2 .M2 .L2 .S2 Register Set B Register Set A CPU P E R I H A L S Internal Buses Internal Memory The point of this slide is to transition from the CPU description (now in the lower-right-hand block) to the internal buses diagram. This slide should only take a couple seconds to present. Looking at the internal buses ...

19 ‘C6000 Internal Buses PC A B DMA Program Addr x32 Program Data x256
Memory External Peripherals DMA DMA Addr - Read DMA Data - Read DMA Addr - Write DMA Data - Write A regs B Data Addr - T x32 Data Data - T x32/64 Data Addr - T2 x32 Data Data - T x32/64 Each of these buses animate in separately: The first bus is program. If asked about 256-bit bus, this allows us to fetch 8 instructions simultaneously, which allows us to execute an instruction on each of our 8 functional units in parallel. Two data buses - one for each register set (A & B). Each ‘C62x data bus can load/store 32-bits/cycle. The ‘C67x can load up to 64 bits per cycle, supporting single-cycle loads of double-float values or the ability to load 4 single-precision floats per cycle. (Stores are still 32-bit - but that’s OK since DSP's perform many more reads than writes). ‘C64 performs 64-bit loads and stores. Read and write buses for DMA: this allows the DMA to support single-cycle transfer rates (a DMA read and write in one cycle). Note, on 6211, 6711, and 6712, EDMA is serviced on-chip by a 64-bit bus. The external bus, though, is 32-bits for the ‘11 devices and 16-bits for the ‘12.

20 Next, the internal memory ...
'C6000 System Block Diagram Internal Memory External Memory Internal Buses Register Set A .D1 .D2 Register Set B .M1 .M2 .L1 .L2 The point of this slide is to transition to the peripherals description. Essentially, the next few slides describe each peripheral. One slide per peripheral with a few bullets to highlight the key features. Don’t get into too much detail on any one peripheral - unless the question is simple/quick to answer. The McBSP and EDMA are covered in more detail later in this workshop. The others cannot be examined further due to limited time. The 4-day workshop spends more time examining other peripherals. .S1 .S2 CPU Next, the internal memory ...

21 ‘C6711 Memory CPU 64KB Internal 4K Program Cache 64K Data Cache
Prog / Data (Level 2) CPU 4K Program Cache Data Cache FFFF_FFFF 0000_0000 64KB Internal On-chip Peripherals 0180_0000 128MB External 2 3 8000_0000 9000_0000 A000_0000 B000_0000 1 The CPU can access two dedicated level-1 caches. A 4K direct-mapped cache for program code and a 2-way data cache. These level-1 caches provide single-cycle access to the CPU. The level-2 memory is larger and a bit slower. It’s accessed whenever there is a level-1 cache miss. Even though it’s a little slower than the level-1 memory, it’s still faster than going off-chip. If the term “level-2 cache” sounds familiar, it’s because many personal computers now employ this same type of mechanism. The level-1 vs. level-2 access is all automatic. YOU, the programmer, don’t have to worry about a thing. Just write your code as you’d normally would and the hardware figures out the quickest way to get the CPU your code and data. What if the code/data isn’t in either the level-1 or level-2 memory? Then ... cache logic cache details

22 ‘C6711 Cache Logic CPU requests data Is data in L1? Is data in L2?
Copy Data from External Mem to L2 from L2 to L1 Send Data to CPU No Yes HIDDEN FOIL This foil is here so that it could be linked into the student notes. If you find this diagram useful, you can either ‘un-hide’ it or click the top arrow on the preceding foil.

23 ‘C6711 Cache Details L1 Prog L2 Unified CPU Data Level 1 Program
(4KB) Data L2 Unified (64KB) 256 8/16/32/64 128 Level 1 Program Always cache 1 way cache (direct mapped) Zero wait-state Line size: 512 bits (or 16 instr) Level 1 Data 2 way cache Line size: 256 bits Level 2 Unified (prog or data) RAM or cache 1-4 way cache 32 data bytes in 4 cycles 16 instr. in 5 cycles Line Size: 1024 bits (or 128 bytes) HIDDEN FOIL This foil was included to add the width of the data paths on the diagram two foils ago. If you want to use this diagram, you can either ‘un-hide’ it or, click on the bottom arrow in the upper right corner of the foil two preceding this one. Note, the data paths are larger than expected. In fact, when there is a transfer from Level-2 to either program or data Level-1, two transfers actually take place. That is, two fetch packets, or 32 bytes of data are transferred to the Level-1 caches. This “look ahead” or “burst” feature was designed to minimize Level-1 cache misses. L1P: 4 Kbytes = 1K instructions = 128 fetch packets (FP) Line size is 512 bits = 16 instructions = 2 FP L2: Line size is 1024 bits = 4 FP (2x L1P line size) = 128 bytes (4x L1D line size) Internal EDMA bus is 64 bits wide, though 6211/6711 devices only have 32-bit external bus. (6712 has 16-bit external bus.)

24 Looking at each peripheral ...
'C6000 System Block Diagram P E R I H A L S Internal Memory External Memory Internal Buses Register Set A .D1 .D2 Register Set B .M1 .M2 .L1 .L2 The point of this slide is to transition to the peripherals description. Essentially, the next few slides describe each peripheral. One slide per peripheral with a few bullets to highlight the key features. Don’t get into too much detail on any one peripheral - unless the question is simple/quick to answer. The McBSP and EDMA are covered in more detail later in this workshop. The others cannot be examined further due to limited time. The 4-day workshop spends more time examining other peripherals. .S1 .S2 CPU Looking at each peripheral ...

25 'C6000 Peripherals External Memory .D1 .M1 .L1 .S1 .D2 .M2 .L2 .S2
Register Set B Register Set A CPU Internal Buses Internal Memory McBSP’s Utopia GPIO VCP TCP DMA, EDMA (Boot) Timers PLL XB, PCI, Host Port EMIF HIDDEN SLIDE

26 Hardware vs. Microcode multiplication
DSP processors are optimised to perform multiplication and addition operations. Multiplication and addition are done in hardware and in one cycle. Example: 4-bit multiply (unsigned). Hardware Microcode 1011 x 1110 1011 x 1110 0000 Cycle 1 1011. Cycle 2 1011.. Cycle 3 Cycle 4 Cycle 5

27 Parameters to consider when choosing a DSP processor
Arithmetic format Extended floating point Extended Arithmetic Performance (peak) Number of hardware multipliers Number of registers Internal L1 program memory cache Internal L1 data memory cache Internal L2 cache 32-bit N/A 40-bit 1200MIPS 2 (16 x 16-bit) with 32-bit result 32 32K 512K 64-bit 1200MFLOPS 2 (32 x 32-bit) with 32 or 64-bit result TMS320C6211 TMS320C6711 C6711 Datasheet: \Links\TMS320C6711.pdf C6211 Datasheet: \Links\TMS320C6211.pdf

28 Parameters to consider when choosing a DSP processor
I/O bandwidth: Serial Ports (number/speed) DMA channels Multiprocessor support Supply voltage Power management On-chip timers (number/width) Cost Package External memory interface controller JTAG 2 x 75Mbps 16 Not inherent 3.3V I/O, 1.8V Core Yes 2 x 32-bit US$ 21.54 256 Pin BGA TMS320C6211 TMS320C6711

29 Floating vs. Fixed point processors
Applications which require: High precision. Wide dynamic range. High signal-to-noise ratio. Ease of use. Need a floating point processor. Drawback of floating point processors: Higher power consumption. Can be more expensive. Can be slower than fixed-point counterparts and larger in size.

30 Floating vs. Fixed point processors
It is the application that dictates which device and platform to use in order to achieve optimum performance at a low cost. For educational purposes, use the floating-point device (C6711) as it can support both fixed and floating point operations.

31 General Purpose DSP vs. DSP in ASIC
Application Specific Integrated Circuits (ASICs) are semiconductors designed for dedicated functions. The advantages and disadvantages of using ASICs are listed below: Advantages Disadvantages High throughput Lower silicon area Lower power consumption Improved reliability Reduction in system noise Low overall system cost High investment cost Less flexibility Long time from design to market

32 General-purpose DSP market in 2003

33 System Considerations
Performance Interfacing Power Size Ease-of Use Programming Interfacing Debugging Integration Memory Peripherals Cost Device cost System cost Development cost Time to market

34 Texas Instruments’ TMS320 family
Different families and sub-families exist to support different markets. Lowest Cost Control Systems Motor Control Storage Digital Ctrl Systems C2000 C5000 Efficiency Best MIPS per Watt / Dollar / Size Wireless phones Internet audio players Digital still cameras Modems Telephony VoIP C6000 Multi Channel and Multi Function App's Comm Infrastructure Wireless Base-stations DSL Imaging Multi-media Servers Video Performance & Best Ease-of-Use

35 Texas Instruments’ TMS320 family
TMS320C64x: The C64x fixed-point DSPs offer the industry's highest level of performance to address the demands of the digital age. At clock rates of up to 1 GHz, C64x DSPs can process information at rates up to 8000 MIPS with costs as low as $ In addition to a high clock rate, C64x DSPs can do more work each cycle with built-in extensions. These extensions include new instructions to accelerate performance in key application areas such as digital communications infrastructure and video and image processing. TMS320C62x: These first-generation fixed-point DSPs represent breakthrough technology that enables new equipments and energizes existing implementations for multi-channel, multi-function applications, such as wireless base stations, remote access servers (RAS), digital subscriber loop (xDSL) systems, personalized home security systems, advanced imaging/biometrics, industrial scanners, precision instrumentation and multi-channel telephony systems. TMS320C67x:  For designers of high-precision applications, C67x floating-point DSPs offer the speed, precision, power savings and dynamic range to meet a wide variety of design needs. These dynamic DSPs are the ideal solution for demanding applications like audio, medical imaging, instrumentation and automotive.

36 C6000 Roadmap Object Code Software Compatibility Performance
Highest Performance Object Code Software Compatibility Floating Point Multi-core C64x™ DSP 1.1 GHz Performance C6412 DM642 2nd Generation C6415 C6416 C6411 C6414 1st Generation C6713 C6203 C6202 C6204 C6205 C6201 C6211 C62x/C64x/DM642: Fixed Point C67x: Floating Point C6701 C6711 C6712 Time

37 ’C6000 Floating-Point Performance Time C67x C6701 C6711 C6712 C33 C31
3 GFLOPS and beyond C6712 600 MFLOPS C6711 900 MFLOPS C6701 1 GFLOPS 150 MFLOPS C32 C31 C30 C33

38 TI Floating-Point Innovation
TI Floating Point - A History of Firsts: First commercially-successful floating-point DSP ‘C30 (1987) First floating-point DSP with multiprocessing support ‘C40 (1991) First $10 floating-point DSP ‘C32 (1995) First 1-GFLOPS DSP ‘C6701 (1998) First $5 floating-point DSP ‘C33 (1999) First 2-level cache floating-point DSP ‘C6711 (1999) First to offer 600 MFLOPS for under $10 ‘C6712 (2000)

39 Useful Links Selection Guide: \Links\DSP Selection Guide.pdf (3Q 2004)

40 Looking for Literature on DSP?
“A Simple Approach to Digital Signal Processing” by Craig Marven and Gillian Ewers; ISBN “DSP Primer (Primer Series)” by C. Britton Rorabaugh; ISBN “Understanding Digital Signal Processing” by Richard G. Lyons; Prentice Hall; 2nd edition (March 15, 2004) ISBN “DSP First : A Multimedia Approach” James H. McClellan, Ronald W. Schafer, and Mark A. Yoder; ISBN

41 Looking for Books on ‘C6000 DSP?
“Digital Signal Processing Implementation using the TMS320C6000TM DSP Platform” by Naim Dahnoun; ISBN “C6x-Based Digital Signal Processing” by Nasser Kehtarnavaz and Burc Simsek; ISBN “Real-Time Digital Signal Processing: Based on the TMS320C6000” by Nasser Kehtarnavaz; Newnes; Book & CD-Rom (July 14, 2004) ISBN “Digital Signal Processing and Applications with the C6713 and C6416 DSK (Topics in Digital Signal Processing)” Wiley-Interscience; Book&CD-Rom (December 3, 2004) by Rulph Chassaing; ISBN

42 Looking for Books on ‘C6000 DSP?
“Real-Time Digital Signal Processing from Matlab to C with the TMS320C6x DSK” by Thad B. Welch; Cameron Wright; Michael Morrow; Book & CD-Rom (2006) ISBN

43 Chapter 1 Introduction - End -


Download ppt "DSP Lecture 01 Chapter 1 Introduction."

Similar presentations


Ads by Google