Download presentation
Presentation is loading. Please wait.
Published byJaylen Sidebotham Modified over 9 years ago
1
Introduction to C6000 Chapter 1 C6000 Integration Workshop Copyright © 2005 Texas Instruments. All rights reserved. Technical Training Organization T TO
2
What Problem Are We Trying To Solve? Digital sampling of an analog signal: A t Most DSP algorithms can be expressed with MAC: count i = 1 Y = coeff i * x i for (i = 1; i < count; i++){ Y += coeff[i] * x[i]; } DAC xY ADC DSP Technical Training Organization T TO
3
What Problem Are We Trying To Solve? Digital sampling of an analog signal: A t Most DSP algorithms can be expressed with MAC: count i = 1 Y = coeff i * x i for (i = 1; i < count; i++){ Y += coeff[i] * x[i]; } DAC xY ADC DSP Technical Training Organization T TO
4
Fastest Execution of MACs The ‘C6x roadmap... from 200 to 4000 MMACs Ease of C Programming Even using natural C, the ‘C6000 Architecture can perform 2 to 4 MACs per cycle Compiler generates 80-100% efficient code Multiply-Accumulate (MAC) in Natural C Code for (i = 1; i < count; i++){ Y += coeff[i] * x[i]; } Fast MAC using only C How does the ‘C6000 achieve such performance from C? Technical Training Organization T TO
5
'C6000 CPU Architecture Memory ‘C6000 Compiler excels at Natural C While dual-MAC speeds math intensive algorithms, flexibility of 8 independent functional units allows the compiler to quickly perform other types of processing All ‘C6000 instructions are conditional allowing efficient hardware pipelining ‘C6000 CPU can dispatch up to eight parallel instructions each cycle A0 A31. A15..S1.D1.L1.S2.M1.M2.D2.L2 B0 B31. B15. Controller/Decoder Dual MACs Technical Training Organization T TO
6
Fastest MAC using Natural C ;** --------------------------------------------------* LOOP:; PIPED LOOP KERNEL LDDW.D1A4++,A7:A6 ||LDDW.D2B4++,B7:B6 ||MPYSP.M1XA6,B6,A5 ||MPYSP.M2XA7,B7,B5 ||ADDSP.L1A5,A8,A8 ||ADDSP.L2B5,B8,B8 || [A1]B.S2LOOP || [A1]SUB.S1A1,1,A1 ;** --------------------------------------------------* float mac(float *m, float *n, int count) { int i, float sum = 0; for (i=0; i < count; i++) { sum += m[i] * n[i]; } … A0 A31. A15..M1.L1.D1.S1.M2.L2.D2.S2 B0 B31. B15. Controller/Decoder Memory The C67x compiler gets two 32-bit floating-point Sum-of-Products per iteration The C67x compiler gets two 32-bit floating-point Sum-of-Products per iteration Can the 'C64x do better? Technical Training Organization T TO
7
Given this simple loop … y = 40 a n x n n = 1 * MVK.S140, cnt loop: LDH.D1*ap++, a LDH.D1*xp++, x MPY.M1a, x, prod ADD.L1 y, prod, y SUB.L1cnt, 1, cnt [cnt]B.S1loop STW.Dy, *yp a x prod y cnt *ap *xp *yp.M1.L1.S1.D1 How many of these instructions can we get in parallel? short mac(short *m, short *n, int count) { for (i=0; i < count; i++) { sum += m[i] * n[i]; } … Technical Training Organization T TO
8
L2: ; PIPED LOOP PROLOG LDW.D1 *A4++,A3 || LDW.D2 *B6++,B7 LDW.D1 *A4++,A3 || LDW.D2 *B6++,B7 [B0] B.S1 L3 || LDW.D1 *A4++,A3 || LDW.D2 *B6++,B7 [B0] B.S1 L3 || LDW.D1 *A4++,A3 || LDW.D2 *B6++,B7 [B0] B.S1 L3 || LDW.D1 *A4++,A3 || LDW.D2 *B6++,B7 MPY.M2B7,A3,B4 ||MPYH.M1B7,A3,A5 || [B0]B.S1L3 ||LDW.D1*A4++,A3 ||LDW.D2*B6++,B7 MPY.M2B7,A3,B4 ||MPYH.M1B7,A3,A5 || [B0]B.S1L3 || LDW.D1*A4++,A3 || LDW.D2*B6++,B7 ;** -----------------------* L3: ; PIPED LOOP KERNEL ADD.L2B4,B5,B5 || ADD.L1A5,A0,A0 || MPY.M2B7,A3,B4 || MPYH.M1B7,A3,A5 || [B0]B.S1L3 || [B0]SUB.S2B0,1,B0 || LDW.D1*A4++,A3 || LDW.D2*B6++,B7 ;** -----------------------* C62x Intense Parallelism What about the ‘C67x? short mac(short *m, short *n, int count) { for (i=0; i < count; i++) { sum += m[i] * n[i]; } … Given this C code The C62x compiler can achieve Two Sum-of-Products per cycle Given this C code The C62x compiler can achieve Two Sum-of-Products per cycle Technical Training Organization T TO
9
Sample Compiler Benchmarks TI C62x™ Compiler Performance Release 4.0: Execution Time in s @ 300 MHz Versus hand-coded assembly based on cycle count AlgorithmUsed In Asm Cycles Assembly Time ( s) C Cycles (Rel 4.0) C Time ( s) % Efficiency vs Hand Coded Block Mean Square Error MSE of a 20 column image matrix For motion compensation of image data 3481.164021.34 87% Codebook Search CELP based voice coders 9773.269613.20 100% Vector Max 40 element input vector Search Algorithms 610.20590.20 100% All-zero FIR Filter 40 samples, 10 coefficients VSELP based voice coders 2380.792800.93 85% Minimum Error Search Table Size = 2304 Search Algorithms 11853.9513184.39 90% IIR Filter 16 coefficients Filter430.14380.13 100% IIR – cascaded biquads 10 Cascaded biquads (Direct Form II) Filter700.23750.25 93% MAC Two 40 sample vectors VSELP based voice coders 610.20580.19 100% Vector Sum Two 44 sample vectors 510.17470.16 100% MSE MSE between two 256 element vectors Mean Sq. Error Computation in Vector Quantizer 2790.932740.91 100% Great out-of-box experience Great out-of-box experience Completely natural C code (non ’C6000 specific) Completely natural C code (non ’C6000 specific) Code available at dspvillage.com Code available at dspvillage.com Great out-of-box experience Great out-of-box experience Completely natural C code (non ’C6000 specific) Completely natural C code (non ’C6000 specific) Code available at dspvillage.com Code available at dspvillage.com Technical Training Organization T TO
10
C67x MAC using Natural C ;** --------------------------------------------------* LOOP:; PIPED LOOP KERNEL LDDW.D1A4++,A7:A6 ||LDDW.D2B4++,B7:B6 ||MPYSP.M1XA6,B6,A5 ||MPYSP.M2XA7,B7,B5 ||ADDSP.L1A5,A8,A8 ||ADDSP.L2B5,B8,B8 || [A1]B.S2LOOP || [A1]SUB.S1A1,1,A1 ;** --------------------------------------------------* float mac(float *m, float *n, int count) { int i, float sum = 0; for (i=0; i < count; i++) { sum += m[i] * n[i]; } … A0 A15..M1.L1.D1.S1.M2.L2.D2.S2 B0 B15. Controller/Decoder Memory Can the 'C64x do better? The C67x compiler gets two 32-bit floating-point Sum-of-Products per iteration The C67x compiler gets two 32-bit floating-point Sum-of-Products per iteration Technical Training Organization T TO
11
C64x gets four MAC’s using DOTP2 short mac(short *m, short *n, int count) { int i, short sum = 0; for (i=0; i < count; i++) { sum += m[i] * n[i]; } … ;** --------------------------------------------------* ; PIPED LOOP KERNEL LOOP: ADD.L2B8,B6,B6 ||ADD.L1A6,A7,A7 ||DOTP2.M2X B4,A4,B8 ||DOTP2.M1X B5,A5,A6 || [ B0]B.S1LOOP || [ B0]SUB.S2B0,-1,B0 ||LDDW.D2T2*B7++,B5:B4 ||LDDW.D1T1*A3++,A5:A4 ;** --------------------------------------------------* A5 B5 A6 A7 x = + m1m0 n1n0 m1*n1 + m0*n0 running sum DOTP2 How many multiplies can the ‘C6x perform? Technical Training Organization T TO
12
MMAC’s How many 16-bit MMACs (millions of MACs per second) can the 'C6201 perform? 400 MMACs (two.M units x 200 MHz) 2.M units x 2 16-bit MACs (per.M unit / per cycle) x 1 GHz ---------------- 4000 MMACs How about 16x16 MMAC’s on the ‘C64x devices? How many 8-bit MMACs on the ‘C64x? 8000 MMACs (on 8-bit data) Technical Training Organization T TO
13
C6415 DSP (1 GHz) L2 Memory PLL Power Down Logic JTAG RTDX Enhanced DMA Controller (64 channels) McBSP 0 McBSP 1 Utopia 2 EMIF 64 EMIF 16 McBSP 2 HPI32 133 MB/s 12.5 MB/s 100 MB/s 1064 MB/s 266 MB/s 12.5 MB/s C64x TM CPU Core 5760 MIPS 16 GB/s 32 GB/s 2.9 GB/s Timer 2Timer 1Timer 0 L1P Cache 32 GB/s L1D Cache How does the DSP fit into a system? 16 GB/s or Technical Training Organization T TO
14
'C6000 Peripherals Summary.D1.M1.L1.S1.D2.M2.L2.S2 Register Set B Register Set A CPU Internal Buses Internal Memory External Memory McBSP GPIO VCP TCP DMA, EDMA ( Boot ) Timers PLL XB, PCI, Host Port EMIF Technical Training Organization T TO
15
Example C6000 System Clockout Timer / Counters HWI PCI HPI Utopia 2 McASP McBSP EMAC C6000 CPU EDMA VCPTCP Boot Loader EMIF Clockin Clockout x PLL ATM Note:Not all ‘C6000 devices have all the various peripherals shown above. Please refer to the C6000 Product Update for a device-by-device listing. Serial Codec (TCP/IP stack avail) Audio Codec /8/8 SDRAM Sync SRAM EPROM PCI / 32 Ethernet Host P / 16 or 32 NMI Reset Ext Interrupts /4/4 GPIO Switches Lamps Latches FPGA Etc. / 0-16+ 16, 32, or 64-bits Technical Training Organization T TO Video Ports DM64x
16
C6416T DSK Diagnostic Utility included with DSK... Technical Training Organization T TO
17
C6416 DSK Diagnostic Utility included with DSK... Technical Training Organization T TO
18
C6416 DSK Memory Map CPLD: LED’s DIP Switches DSK status DSK rev# Daughter Card 0000_0000 Internal RAM: 1MB 0010_0000 Internal Peripherals or reserved 6000_0000 EMIFB CE0 : 64MB CPLD 6400_0000 EMIFB CE1 : 64MB Flash: 512KB 6800_0000 EMIFB CE2 : 64MB 6C00_0000 EMIFB CE3 : 64MB 8000_0000 EMIFA CE0 : 256MB SDRAM: 16MB 9000_0000 EMIFA CE1 : 256MB A000_0000 EMIFA CE2 : 256MB Daughter Card B000_0000 EMIFA CE3 : 256MB TMS320C6416C6416 DSK Technical Training Organization T TO
19
DSK’s Diagnostic Utility DSK Contents... Test/Diagnose DSK hardware Verify USB emulation link Use Advanced tests to facilitate debugging Reset DSK hardware Technical Training Organization T TO
20
DSK Contents ( i.e. what you get… ) 1GHz C6416T DSP or 225 MHz C6713 DSP TI 24-bit A/D Converter (AIC23) External Memory 8 or 16MB SDRAM Flash ROM- C6416 (512KB) - C6713 (256KB) Software Code Composer Studio SD Diagnostic Utility Example Programs LEDs and DIPs Daughter card expansion 1 or 2 additional expansions Power Supply & USB Cable Hardware Documentation DSK Technical Reference eXpressDSP for Dummies Technical Training Organization T TO MISC Hardware
21
Lab 1 CCS 1.Hook up the DSK 2.Supply power and observe POST 1.Run Diagnostic Utility 2.Run CCS Setup 3.Start CCS 4.Configure CCS Options 5.Close CCS HardwareSoftware Time: 20 minutes Technical Training Organization T TO
22
Lab Exercises – C67x vs. C64x Which DSK are you using? We provide instructions and solutions for both C67x and C64x. We have tried to call out the few differences in lab steps as explicitly as possible: Technical Training Organization T TO
23
Optional Topics POST DSK Help Technical Training Organization T TO
24
TestLED4LED 3LED 2LED 1Description 10001DSP’s Internal Memory test 20010External SDRAM test 30011Check manufacture ID of Flash chip 40100McBSP 0 loopback test 50101McBSP 1 loopback test 60110McBSP 2 loopback test 70111Transfer small array with EDMA 81000Codec test (output 1KHz tone) 91001Timer test (cfg and wait for 100 ints) B L I N K A L LAll tests completed successfully C6416 DSK - Power On Self Test (POST) Stored in FLASH memory and runs every time DSK is powered on Source code on DSK CD-ROM When test is performed, index number is shown on LED’s. If test fails, the index of that test will blink continuously. When complete, all LEDs will blink three times, then turn off See C6713 DSK help file for its index of tests. Technical Training Organization T TO
25
DSK Help Technical Training Organization T TO
26
ti Technical Training Organization
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.