Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to C6000 Chapter 1 C6000 Integration Workshop Copyright © 2005 Texas Instruments. All rights reserved. Technical Training Organization T TO.

Similar presentations


Presentation on theme: "Introduction to C6000 Chapter 1 C6000 Integration Workshop Copyright © 2005 Texas Instruments. All rights reserved. Technical Training Organization T TO."— Presentation transcript:

1 Introduction to C6000 Chapter 1 C6000 Integration Workshop Copyright © 2005 Texas Instruments. All rights reserved. Technical Training Organization T TO

2 What Problem Are We Trying To Solve? Digital sampling of an analog signal: A t Most DSP algorithms can be expressed with MAC: count i = 1 Y =  coeff i * x i for (i = 1; i < count; i++){ Y += coeff[i] * x[i]; } DAC xY ADC DSP Technical Training Organization T TO

3 What Problem Are We Trying To Solve? Digital sampling of an analog signal: A t Most DSP algorithms can be expressed with MAC: count i = 1 Y =  coeff i * x i for (i = 1; i < count; i++){ Y += coeff[i] * x[i]; } DAC xY ADC DSP Technical Training Organization T TO

4  Fastest Execution of MACs  The ‘C6x roadmap... from 200 to 4000 MMACs  Ease of C Programming  Even using natural C, the ‘C6000 Architecture can perform 2 to 4 MACs per cycle  Compiler generates 80-100% efficient code Multiply-Accumulate (MAC) in Natural C Code for (i = 1; i < count; i++){ Y += coeff[i] * x[i]; } Fast MAC using only C How does the ‘C6000 achieve such performance from C? Technical Training Organization T TO

5 'C6000 CPU Architecture Memory  ‘C6000 Compiler excels at Natural C  While dual-MAC speeds math intensive algorithms, flexibility of 8 independent functional units allows the compiler to quickly perform other types of processing  All ‘C6000 instructions are conditional allowing efficient hardware pipelining  ‘C6000 CPU can dispatch up to eight parallel instructions each cycle A0 A31. A15..S1.D1.L1.S2.M1.M2.D2.L2 B0 B31. B15. Controller/Decoder Dual MACs Technical Training Organization T TO

6 Fastest MAC using Natural C ;** --------------------------------------------------* LOOP:; PIPED LOOP KERNEL LDDW.D1A4++,A7:A6 ||LDDW.D2B4++,B7:B6 ||MPYSP.M1XA6,B6,A5 ||MPYSP.M2XA7,B7,B5 ||ADDSP.L1A5,A8,A8 ||ADDSP.L2B5,B8,B8 || [A1]B.S2LOOP || [A1]SUB.S1A1,1,A1 ;** --------------------------------------------------* float mac(float *m, float *n, int count) { int i, float sum = 0; for (i=0; i < count; i++) { sum += m[i] * n[i]; } … A0 A31. A15..M1.L1.D1.S1.M2.L2.D2.S2 B0 B31. B15. Controller/Decoder Memory The C67x compiler gets two 32-bit floating-point Sum-of-Products per iteration The C67x compiler gets two 32-bit floating-point Sum-of-Products per iteration Can the 'C64x do better? Technical Training Organization T TO

7 Given this simple loop … y = 40  a n x n n = 1 * MVK.S140, cnt loop: LDH.D1*ap++, a LDH.D1*xp++, x MPY.M1a, x, prod ADD.L1 y, prod, y SUB.L1cnt, 1, cnt [cnt]B.S1loop STW.Dy, *yp a x prod y cnt *ap *xp *yp.M1.L1.S1.D1 How many of these instructions can we get in parallel? short mac(short *m, short *n, int count) { for (i=0; i < count; i++) { sum += m[i] * n[i]; } … Technical Training Organization T TO

8 L2: ; PIPED LOOP PROLOG LDW.D1 *A4++,A3 || LDW.D2 *B6++,B7 LDW.D1 *A4++,A3 || LDW.D2 *B6++,B7 [B0] B.S1 L3 || LDW.D1 *A4++,A3 || LDW.D2 *B6++,B7 [B0] B.S1 L3 || LDW.D1 *A4++,A3 || LDW.D2 *B6++,B7 [B0] B.S1 L3 || LDW.D1 *A4++,A3 || LDW.D2 *B6++,B7 MPY.M2B7,A3,B4 ||MPYH.M1B7,A3,A5 || [B0]B.S1L3 ||LDW.D1*A4++,A3 ||LDW.D2*B6++,B7 MPY.M2B7,A3,B4 ||MPYH.M1B7,A3,A5 || [B0]B.S1L3 || LDW.D1*A4++,A3 || LDW.D2*B6++,B7 ;** -----------------------* L3: ; PIPED LOOP KERNEL ADD.L2B4,B5,B5 || ADD.L1A5,A0,A0 || MPY.M2B7,A3,B4 || MPYH.M1B7,A3,A5 || [B0]B.S1L3 || [B0]SUB.S2B0,1,B0 || LDW.D1*A4++,A3 || LDW.D2*B6++,B7 ;** -----------------------* C62x Intense Parallelism What about the ‘C67x? short mac(short *m, short *n, int count) { for (i=0; i < count; i++) { sum += m[i] * n[i]; } … Given this C code The C62x compiler can achieve Two Sum-of-Products per cycle Given this C code The C62x compiler can achieve Two Sum-of-Products per cycle Technical Training Organization T TO

9 Sample Compiler Benchmarks TI C62x™ Compiler Performance Release 4.0: Execution Time in  s @ 300 MHz Versus hand-coded assembly based on cycle count AlgorithmUsed In Asm Cycles Assembly Time (  s) C Cycles (Rel 4.0) C Time (  s) % Efficiency vs Hand Coded Block Mean Square Error MSE of a 20 column image matrix For motion compensation of image data 3481.164021.34 87% Codebook Search CELP based voice coders 9773.269613.20 100% Vector Max 40 element input vector Search Algorithms 610.20590.20 100% All-zero FIR Filter 40 samples, 10 coefficients VSELP based voice coders 2380.792800.93 85% Minimum Error Search Table Size = 2304 Search Algorithms 11853.9513184.39 90% IIR Filter 16 coefficients Filter430.14380.13 100% IIR – cascaded biquads 10 Cascaded biquads (Direct Form II) Filter700.23750.25 93% MAC Two 40 sample vectors VSELP based voice coders 610.20580.19 100% Vector Sum Two 44 sample vectors 510.17470.16 100% MSE MSE between two 256 element vectors Mean Sq. Error Computation in Vector Quantizer 2790.932740.91 100%  Great out-of-box experience Great out-of-box experience  Completely natural C code (non ’C6000 specific) Completely natural C code (non ’C6000 specific)  Code available at dspvillage.com Code available at dspvillage.com  Great out-of-box experience Great out-of-box experience  Completely natural C code (non ’C6000 specific) Completely natural C code (non ’C6000 specific)  Code available at dspvillage.com Code available at dspvillage.com Technical Training Organization T TO

10 C67x MAC using Natural C ;** --------------------------------------------------* LOOP:; PIPED LOOP KERNEL LDDW.D1A4++,A7:A6 ||LDDW.D2B4++,B7:B6 ||MPYSP.M1XA6,B6,A5 ||MPYSP.M2XA7,B7,B5 ||ADDSP.L1A5,A8,A8 ||ADDSP.L2B5,B8,B8 || [A1]B.S2LOOP || [A1]SUB.S1A1,1,A1 ;** --------------------------------------------------* float mac(float *m, float *n, int count) { int i, float sum = 0; for (i=0; i < count; i++) { sum += m[i] * n[i]; } … A0 A15..M1.L1.D1.S1.M2.L2.D2.S2 B0 B15. Controller/Decoder Memory Can the 'C64x do better? The C67x compiler gets two 32-bit floating-point Sum-of-Products per iteration The C67x compiler gets two 32-bit floating-point Sum-of-Products per iteration Technical Training Organization T TO

11 C64x gets four MAC’s using DOTP2 short mac(short *m, short *n, int count) { int i, short sum = 0; for (i=0; i < count; i++) { sum += m[i] * n[i]; } … ;** --------------------------------------------------* ; PIPED LOOP KERNEL LOOP: ADD.L2B8,B6,B6 ||ADD.L1A6,A7,A7 ||DOTP2.M2X B4,A4,B8 ||DOTP2.M1X B5,A5,A6 || [ B0]B.S1LOOP || [ B0]SUB.S2B0,-1,B0 ||LDDW.D2T2*B7++,B5:B4 ||LDDW.D1T1*A3++,A5:A4 ;** --------------------------------------------------* A5 B5 A6 A7 x = + m1m0 n1n0 m1*n1 + m0*n0 running sum DOTP2 How many multiplies can the ‘C6x perform? Technical Training Organization T TO

12 MMAC’s  How many 16-bit MMACs (millions of MACs per second) can the 'C6201 perform? 400 MMACs (two.M units x 200 MHz) 2.M units x 2 16-bit MACs (per.M unit / per cycle) x 1 GHz ---------------- 4000 MMACs  How about 16x16 MMAC’s on the ‘C64x devices?  How many 8-bit MMACs on the ‘C64x? 8000 MMACs (on 8-bit data) Technical Training Organization T TO

13 C6415 DSP (1 GHz) L2 Memory PLL Power Down Logic JTAG RTDX Enhanced DMA Controller (64 channels) McBSP 0 McBSP 1 Utopia 2 EMIF 64 EMIF 16 McBSP 2 HPI32 133 MB/s 12.5 MB/s 100 MB/s 1064 MB/s 266 MB/s 12.5 MB/s C64x TM CPU Core 5760 MIPS 16 GB/s 32 GB/s 2.9 GB/s Timer 2Timer 1Timer 0 L1P Cache 32 GB/s L1D Cache How does the DSP fit into a system? 16 GB/s or Technical Training Organization T TO

14 'C6000 Peripherals Summary.D1.M1.L1.S1.D2.M2.L2.S2 Register Set B Register Set A CPU Internal Buses Internal Memory External Memory McBSP GPIO VCP TCP DMA, EDMA ( Boot ) Timers PLL XB, PCI, Host Port EMIF Technical Training Organization T TO

15 Example C6000 System Clockout Timer / Counters HWI PCI HPI Utopia 2 McASP McBSP EMAC C6000 CPU EDMA VCPTCP Boot Loader EMIF Clockin Clockout x PLL ATM Note:Not all ‘C6000 devices have all the various peripherals shown above. Please refer to the C6000 Product Update for a device-by-device listing. Serial Codec (TCP/IP stack avail) Audio Codec /8/8 SDRAM Sync SRAM EPROM PCI / 32 Ethernet Host  P / 16 or 32 NMI Reset Ext Interrupts /4/4 GPIO Switches Lamps Latches FPGA Etc. / 0-16+ 16, 32, or 64-bits Technical Training Organization T TO Video Ports DM64x

16 C6416T DSK Diagnostic Utility included with DSK... Technical Training Organization T TO

17 C6416 DSK Diagnostic Utility included with DSK... Technical Training Organization T TO

18 C6416 DSK Memory Map CPLD:  LED’s  DIP Switches  DSK status  DSK rev#  Daughter Card 0000_0000 Internal RAM: 1MB 0010_0000 Internal Peripherals or reserved 6000_0000 EMIFB CE0 : 64MB CPLD 6400_0000 EMIFB CE1 : 64MB Flash: 512KB 6800_0000 EMIFB CE2 : 64MB 6C00_0000 EMIFB CE3 : 64MB 8000_0000 EMIFA CE0 : 256MB SDRAM: 16MB 9000_0000 EMIFA CE1 : 256MB A000_0000 EMIFA CE2 : 256MB Daughter Card B000_0000 EMIFA CE3 : 256MB TMS320C6416C6416 DSK Technical Training Organization T TO

19 DSK’s Diagnostic Utility DSK Contents...  Test/Diagnose DSK hardware  Verify USB emulation link  Use Advanced tests to facilitate debugging  Reset DSK hardware Technical Training Organization T TO

20 DSK Contents ( i.e. what you get… )  1GHz C6416T DSP or 225 MHz C6713 DSP  TI 24-bit A/D Converter (AIC23)  External Memory  8 or 16MB SDRAM  Flash ROM- C6416 (512KB) - C6713 (256KB) Software  Code Composer Studio  SD Diagnostic Utility  Example Programs  LEDs and DIPs  Daughter card expansion  1 or 2 additional expansions  Power Supply & USB Cable Hardware Documentation  DSK Technical Reference  eXpressDSP for Dummies Technical Training Organization T TO MISC Hardware

21 Lab 1 CCS 1.Hook up the DSK 2.Supply power and observe POST 1.Run Diagnostic Utility 2.Run CCS Setup 3.Start CCS 4.Configure CCS Options 5.Close CCS HardwareSoftware Time: 20 minutes Technical Training Organization T TO

22 Lab Exercises – C67x vs. C64x  Which DSK are you using?  We provide instructions and solutions for both C67x and C64x.  We have tried to call out the few differences in lab steps as explicitly as possible: Technical Training Organization T TO

23 Optional Topics  POST  DSK Help Technical Training Organization T TO

24 TestLED4LED 3LED 2LED 1Description 10001DSP’s Internal Memory test 20010External SDRAM test 30011Check manufacture ID of Flash chip 40100McBSP 0 loopback test 50101McBSP 1 loopback test 60110McBSP 2 loopback test 70111Transfer small array with EDMA 81000Codec test (output 1KHz tone) 91001Timer test (cfg and wait for 100 ints) B L I N K A L LAll tests completed successfully C6416 DSK - Power On Self Test (POST)  Stored in FLASH memory and runs every time DSK is powered on  Source code on DSK CD-ROM  When test is performed, index number is shown on LED’s. If test fails, the index of that test will blink continuously.  When complete, all LEDs will blink three times, then turn off  See C6713 DSK help file for its index of tests. Technical Training Organization T TO

25 DSK Help Technical Training Organization T TO

26 ti Technical Training Organization


Download ppt "Introduction to C6000 Chapter 1 C6000 Integration Workshop Copyright © 2005 Texas Instruments. All rights reserved. Technical Training Organization T TO."

Similar presentations


Ads by Google