Presentation is loading. Please wait.

Presentation is loading. Please wait.

LPC Speech Coder on the TI C6x DSP Mark Anderson, Jeff Burke EE213A / EE298-2 Prof. Ingrid Verbauwhede.

Similar presentations


Presentation on theme: "LPC Speech Coder on the TI C6x DSP Mark Anderson, Jeff Burke EE213A / EE298-2 Prof. Ingrid Verbauwhede."— Presentation transcript:

1 LPC Speech Coder on the TI C6x DSP Mark Anderson, Jeff Burke EE213A / EE298-2 Prof. Ingrid Verbauwhede

2 Summary Implementation platform Texas Instruments TMS320C6000 Low-quantity cost US $35 (‘C6211) Architecture clock frequency 150 MHz (‘C6211) Throughput 75-80 channels @ 8000 samples/sec

3 Summary Total energy per sample 1.8 uJ/sample ‘Area’ 1.2% of cycle budget per chan. per frame 8.5% of unified memory per channel 25% of unified memory for algorithm

4 Summary Flexibility of implementation High; programmable processor with C compiler, GUI debugger & simulator SegSNR_A: ? SegSNR_Q: 26 dB (voiced segments)

5 Architecture overview 256-bit VLIW Two “clustered” data paths Four functional units in each data path 16x16 multiply Two ALUs Data addressing unit 32-bit instruction for each functional unit (256 bit “instruction” for 8 func. Units)

6 Data path diagram

7 Architecture overview Split register file Only two cross-paths exists Cluster is limited to one source read from opposite register file per cycle. Data types 8, 16, 32-bit with 40-bit accumulate 40-bit = register pair

8 Memory architecture ‘C6211 (US$35) has a cache! 4kB L1 Instruction cache (L1P) 4kB L1 Data cache (L1D) 64kB L2 Unified memory and/or cache Extra DMA channels

9 Memory architecture

10 Design Tools Command-line Compiler, debugger, simulator Code Composer Studio Same tools Windows NT GUI 30-day “evaluation” license Draconian copy protection, pulls out the rug from under you

11 Design Flow Consolidate Matlab reference into a single function Matlab rewritten C-style Verified C-style Matlab C prototype created Imported into Code Composer, optimized & simulated

12 Fixed-point quantization Input samples 16-bit, normalized to [-1,1) format used Coefficient quantization Hamming window, pre-emphasis, FIR format used No noticeable change in characteristics

13 Fixed-point quantization Most values 16 bit Take advantage of 16x16 fast multipliers Remain close to other class implementations Add metric for overpowered LPC engine Use # of channels as performance metric

14 Fixed-point quantization Energy stored in Prevent overflow, provide precision for low energy segments Temporary values stored in Take advantage of extended precision Modified autocorrelation used All whole numbers

15 Fixed-Point SNR Matlab simulation of magnitude truncation Tools again. SegSNR_A = ? SegSNR_Q = 26 dB Voiced segments only Sent_female test data

16 Performance results Initial version: 80,000 CPU cycles/frame Optimization Take advantage of VLIW, pipelining observe assembly, modify C loops Use TI’s DSP Library Assembly advantage without assembly Optimized version: 30,182 cycles/frame Had to stop early, still at least 5K cycles wasted

17 Performance Then, the tool license expired. The tool would not install on other machines. TI responded, but wasn’t too helpful. Moral #1: Avoid the evaluation version. Moral #2: Give tools away to sell hardware

18 Cycle count details Routine%Cycles/frame Windowing, pre-emphasis4.31285 Energy calc0.8254 Autocorrelation in Levinson-Durbin8.02421 Autocorrelation in pitch detection5115334 Algorithm total9528561 Total w/ housekeeping30182

19 Additional optimizations Use more DSPLIB routines Autocorrelation Assembly-level optimization Code size reduction? Reduce number of buffers to reduce L1D usage per frame

20 Energy per sample ‘C6211 consumes 1.24W 75% high activity / 25% low activity 1.24W / 80 channels = 15.5mW/channel 15.5 mJ/sec/channel * 1/8000 = 1.8 uJ / sample

21 Number of channels 150 x 10 6 cycles/sec x 0.02 sec/frame = 3.0 x 10 6 cycles/frame 3.0 x 10 6 cycles/frame / 30,182 cycles = 99 channels

22 Memory ‘C6211 Cache complicates estimates Performance is 85-99% of optimal for typical applications 30,182 cycles becomes 35,508 cycles/frame for 85% efficiency => now support only 86 channels

23 Memory Try to account for off-chip memory transfers ~220,000 cycles for 150ns fetches for 80 channels => support 75-80 channels Unable to verify/simulate because of unexpected tool expiration

24 Memory L2 usage ~16kB Code size thanks to VLIW 512 32-byte instruction clusters More suited for ‘C6201 & larger processors Remaining used by data for channels 480 bytes each (8.5% of remaining memory) L1 usage L1P: Can’t tell because of cache L1D: 2.2kB (~56%)

25 Tool comments Powerful, easy to use IDE… When it worked. Licensing problems for eval version Debugging support a bit odd puts/printf

26 C6x Conclusions Easily support 75-80 channels of coding 26 dB fixed-point SNR, 16-bit types VLIW = Large code size Cache on a low-end DSP! Good tools, but draconian copy protection


Download ppt "LPC Speech Coder on the TI C6x DSP Mark Anderson, Jeff Burke EE213A / EE298-2 Prof. Ingrid Verbauwhede."

Similar presentations


Ads by Google