RICE UNIVERSITY High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University

Slides:

Advertisements

Similar presentations

Computer Abstractions and Technology

Advertisements

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal.

L27:Lower Power Algorithm for Multimedia Systems 성균관대학교 조 준 동

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.

Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

An Energy-Efficient Reconfigurable Multiprocessor IC for DSP Applications Multiple programmable VLIW processors arranged in a ring topology –Balances its.

Highest Performance Programmable DSP Solution September 17, 2015.

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

Low-Power Wireless Sensor Networks

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

1 EE5900 Advanced Embedded System For Smart Infrastructure Energy Efficient Scheduling.

A bit-streaming, pipelined multiuser detector for wireless communications Sridhar Rajagopal and Joseph R. Cavallaro Rice University

Operating Systems Lecture 02: Computer System Overview Anda Iamnitchi

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.

TI DSPS FEST 1999 Implementation of Channel Estimation and Multiuser Detection Algorithms for W-CDMA on Digital Signal Processors Sridhar Rajagopal Gang.

RICE UNIVERSITY SWAPs: Re-thinking mobile and base-station architectures Sridhar Rajagopal VLSI Signal Processing Group Center for Multimedia Communication.

Programmable processors for wireless base-stations Sridhar Rajagopal December 9, 2003.

Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal Srikrishna Bhashyam, Joseph R. Cavallaro,

RICE UNIVERSITY DSP architectures for wireless communications Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.

RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.

ATtiny23131 A SEMINAR ON AVR MICROCONTROLLER ATtiny2313.

RICE UNIVERSITY A real-time baseband communications processor for high data rate wireless systems Sridhar Rajagopal ECE Department Ph.D.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Implementing algorithms for advanced communication systems -- My bag of tricks Sridhar Rajagopal Electrical and Computer Engineering This work is supported.

Programmable processors for wireless base-stations Sridhar Rajagopal December 11, 2003.

RICE UNIVERSITY On the architecture design of a 3G W-CDMA/W-LAN receiver Sridhar Rajagopal and Joseph R. Cavallaro Rice University Center for Multimedia.

RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.

Implementing Multiuser Channel Estimation and Detection for W-CDMA Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro and Behnaam Aazhang Rice.

DSP base-station comparisons. Second generation (2G) wireless 2 nd generation: digital: last decade: 1990’s Voice and low bit-rate data –~14.4 – 28.8.

RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

RICE UNIVERSITY Handset architectures Sridhar Rajagopal ASICsProgrammable The support for this work in.

RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.

RICE UNIVERSITY SWAPs: Re-thinking mobile and base-station architectures Sridhar Rajagopal VLSI Signal Processing Group Center for Multimedia Communication.

PipeliningPipelining Computer Architecture (Fall 2006)

Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro

Low-power Digital Signal Processing for Mobile Phone chipsets

Evaluating Register File Size

A programmable communications processor for future wireless systems

Morgan Kaufmann Publishers

Parallel Programming in C with MPI and OpenMP

Sridhar Rajagopal and Joseph R. Cavallaro Rice University

Sridhar Rajagopal and Joseph R. Cavallaro Rice University

DSPs for Future Wireless Base-Stations

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

A High Performance SoC: PkunityTM

Chapter 1 Introduction.

Programmable processors for wireless base-stations

DSPs in emerging wireless systems

Parallel Programming in C with MPI and OpenMP

DSP Architectures for Future Wireless Base-Stations

Suman Das, Sridhar Rajagopal, Chaitali Sengupta and Joseph R.Cavallaro

DSPs for Future Wireless Base-Stations

Presentation transcript:

RICE UNIVERSITY High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University

RICE UNIVERSITY 2 Recent (2003) Research Results  Stream-based programmable processors meet real-time requirements for a set of base-station phy layer algorithms +,*  Map algorithms on stream processors and studied tradeoffs between packing, ALU utilization and memory operations  Improve power efficiency in stream processors by adapting compute resources to workload variations and varying voltage and clock frequency to real-time requirements*  Design exploration between #ALUs and clock frequency to minimize power consumption of the processor + S. Rajagopal, S. Rixner, J. R. Cavallaro 'A programmable baseband processor design for software defined radios’, 2002, *Paper draft sent previously, rest of the contributions in thesis

RICE UNIVERSITY 3 Recent (2003) Research Results  Peak computation rate available : ~200 billion arithmetic operations at 1.2 GHz  Estimated Peak Power (0.13 micron) : W at 1.2 GHz  Power:  W for 32 users, constraint 9 decoding, at 128Kbps/user At 1.2 GHz, 1.4 V  300 mW for 4 users, constraint 7 decoding, at 128Kbps/user At 433 MHz, V

RICE UNIVERSITY 4 Motivation  This research could be applied to DSP design!  Designing  High performance DSPs  Power-efficient  Adapt computing resources with workload changes  Such that  Gradual changes in C64x architecture  Gradual changes in compilers and tools

RICE UNIVERSITY 5 Levels of changes  To allow changes in TI DSPs and tools gradually  Changes classified into 3 levels  Level 1 : simple, minimum changes (next silicon)  Level 2 : intermediate, handover changes (1-2 years)  Level 3 : actual proposed changes (2-3 years) We want to go to Level 3 but in steps!

RICE UNIVERSITY 6 Level 1 changes: Power-efficiency

RICE UNIVERSITY 7 Level 1 changes: Power saving features  (1) Use Dynamic Voltage and Frequency scaling  When workload changes such as  Users, data rates, modulation, coding rates, …  Already in industry : Crusoe, XScale …  (2) Use Voltage gating to turn off unused resources  When units idle for a ‘sufficiently’ long time  Saves static and dynamic power dissipation  See example on next page

RICE UNIVERSITY 8 Turning off ALUs AddersMultipliersAddersMultipliers Default schedule Schedule after exploration Instruction Schedule ‘Sleep’ Instruction 2 multipliers turned off to save power Turned off using voltage gating to eliminate static and dynamic power dissipation

RICE UNIVERSITY 9 Level 1: Architecture tradeoffs DVS:  Advanced voltage regulation scheme  Cannot use NMOS pass gates  Cannot use tri-state buffers  Use at a coarser time scale (once in a million cycles)  cycles settling time Voltage gating:  Gating device design important  Should be able to supply current to gated circuit  Use at coarser time scale (once in cycles)  1-10 cycles settling time

RICE UNIVERSITY 10 Level 1: Tools/Programming impact  Need a DSP BIOS “TASK” running continuously which looks at the workload change and changes voltage/frequency using a look-up table in memory  Compiler should be made ‘re-targetable’  Target subset of ALUs and explore static performance with different adder-multiplier schedules  Voltage gating using a ‘sleep’ instruction that the compiler generates for unused ALUs  ALUs should be idle for > 100 cycles for this to occur  Other resources can be gated off similarly to save static power dissipation  Programmer is not aware of these changes

RICE UNIVERSITY 11 Level 2 changes: Performance

RICE UNIVERSITY 12 Solutions to increase DSP performance  (1) Increasing clock frequency  C64x: 600 – 720 – ?  Easiest solution but limited benefits  Not good for power, given cubic dependence with frequency  (2) Increasing ALUs  Limited instruction level parallelism (ILP)  Register file area, ports explosion  Compiler issues in extracting more ILP  (3) Multiprocessors (MIMD)  Usually 3 rd party vendors (except C40-types)

RICE UNIVERSITY 13 DSP multiprocessors Source: Texas Instruments Wireless Infrastructure Solutions Guide, Pentek, Sundance, C80 DSP ASSP Co-Proc’s Network Interface Interconnection

RICE UNIVERSITY 14 Multiprocessing tradeoffs  Advantages:  Performance, and tools don’t have to change!!  Load-balancing algorithms on multiple DSPs not straight-forward +  Burden pushed on to the programmer  Not scalable with number of processors  difficult to adapt with workload changes  Traditional DSPs not built for multiprocessing* (except C40-types)  I/O impacts throughput, power and area  (E)DMA use minimizes the throughput problem  Power and area problems still remain *R. Baines, The DSP bottleneck, IEEE Communications Magazine, May 1995, pp (outdated?) + S. Rajagopal, B. Jones and J.R. Cavallaro, Task partitioning wireless base-station algorithms on multiple DSPs and FPGAs, ICSPAT’2001

RICE UNIVERSITY 15 Options  Chip multiprocessors with SIMD parallelism (Level 3)  SIMD parallelism can alleviate load balancing  (shown in Level 3)  Scalable with processors  Automatic SIMD parallelism can be done by the compiler  Single chip will alleviate I/O bottlenecks  Tool will need changes  To get to level 3, intermediate (Level 2) level investigation  Level 2  Do SPMD on DSP multiprocessor

RICE UNIVERSITY 16 Texas Instruments C64x DSP Source: Texas Instruments C64x DSP Generation (sprt236a.pdf) C64x Datapath

RICE UNIVERSITY 17 A possible, plausible solution Exploit data parallelism (DP)*  Available in many wireless algorithms  This is what ASICs do! int i,a[N],b[N],sum[N]; // 32 bits short int c[N],d[N],diff[N]; // 16 bits packed for (i = 0; i< 1024; ++i) { sum[i] = a[i] + b[i]; diff[i] = c[i] - d[i]; } ILP DP Subword *Data Parallelism is defined as the parallelism available after subword packing and loop unrolling

RICE UNIVERSITY 18 SPMD multiprocessor DSP C64x Datapath Same Program running on all DSPs

RICE UNIVERSITY 19 Level 2: Architecture tradeoffs  C64x’s  Interconnection could be similar to the ones used by 3 rd party vendors  FPGA- based C40 comm ports (Sundance) ~400 MBps  VIM modules (Pentek) ~300 MBps  Others developed by TI, BlueWave systems

RICE UNIVERSITY 20 Level 2: Tools/Programming impact  All DSPs run the same program  Programmer thinks of only 1 DSP program  Burden now on tools  Can use C8x compiler and tool support expertise  Integration of C8x and C6x compilers  Data parallelism used for SPMD  DMA data movement can be left to programmer at this stage to keep data fed to the all the processors  MPI (Message Passing) can also be alternatively applied

RICE UNIVERSITY 21 Level 3 changes: Performance and Power

RICE UNIVERSITY 22 A chip multiprocessor (CMP) DSP * * * Internal Memory L2 ILP Subword Internal Memory (L2) C64x DSP Core (1 cluster) * * * * * * * * * * * * … ILP Subword DP C64x based CMP DSP Core adapt #clusters to DP Identical clusters, same operations. Power-down unused ALUs, clusters Instruction decoder Instruction decoder

RICE UNIVERSITY 23 A 4 cluster CMP using TI C64x C64x Datapath Significant savings possible in area and power Increasing benefits with larger #clusters (8,16,32 clusters)

RICE UNIVERSITY 24 Alternate view of the CMP DSP DMA Controller L2 internal memory Bank C Inter-cluster communication network Bank 2 Bank 1 Prefetch Buffers Clusters Of C64x C64x core C C64x core 0 C64x core 1 Instruction decoder

RICE UNIVERSITY 25 Adapting #clusters to Data Parallelism Adaptive Multiplexer Network CCCC C CC C C C C No reconfiguration4: 2 reconfiguration 4:1 reconfigurationAll clusters off Turned off using voltage gating to eliminate static and dynamic power dissipation

RICE UNIVERSITY 26 Level 3: Architecture tradeoffs  Single processor -> SPMD -> SIMD  Single chip :  Max die size limited to 128 clusters with 8 functional units/cluster at 90 nm technology [estimate]  Number of memory banks = #clusters  Instruction addition to turn off clusters when data parallelism is insufficient

RICE UNIVERSITY 27 Level 3: Tools/Programming impact  Level 2 compiler provides support for data parallelism  adapt #clusters to data parallelism for power savings  check for loop count index after loop unrolling  If less than #clusters, provide instruction to turn off clusters  Design of parallel algorithms and mapping important  Programmer still writes regular C code  Transparent to the programmer  Burden on the compiler  Automatic DMA data movement to keep data feeding into the arithmetic units

RICE UNIVERSITY 28 Level 3 potential verification using the Imagine stream processor simulator Replacing the C64x DSP with a cluster containing 3 +, 3 X and a distributed register file Verification of potential benefits

RICE UNIVERSITY 29 Need for adapting to flexibility  Base-stations are designed for worst case workload  Base-stations rarely operate at worst case workload  Adapting the resources to the workload can save power!

RICE UNIVERSITY 30 Example of flexibility needed in workloads Operation count (in GOPs) (4,7)(4,9)(8,7)(8,9)(16,7)(16,9)(32,7)(32,9) 2G base-station (16 Kbps/user) 3G base-station (128 Kbps/user) (Users, Constraint lengths) Billions of computations per second needed Workload variation from ~1 GOPs for 4 users, constraint 7 viterbi to ~23 GOPs for 32 users, constraint 9 viterbi Note: GOPs refer only to arithmetic computations

RICE UNIVERSITY 31 Flexibility affects Data Parallelism* WorkloadEstimationDetectionDecoding (U,K)f(U,N) f(U,K,R) (4,7)32416 (4,9)32464 (8,7)32816 (8,9)32864 (16,7)3216 (16,9) (32,7)32 16 (32,9)32 64 U - Users, K - constraint length, N - spreading gain, R - decoding rate *Data Parallelism is defined as the parallelism available after subword packing and loop unrolling

RICE UNIVERSITY 32 Cluster utilization variation with workload Cluster utilization variation on a 32-cluster processor (32, 9) = 32 users, constraint length 9 Viterbi

RICE UNIVERSITY 33 Frequency variation with workload

RICE UNIVERSITY 34 Operation  DVS when system changes significantly  Users, data rates …  Coarse time scale (every few seconds)  Turn off clusters when parallelism changes significantly  Parallelism can change within the same algorithm  Eg: spreading gain changes during matched filtering  Finer time scales (100’s of microseconds)  Turn off ALUs when algorithms change significantly  estimation, detection, decoding  Finer time scales (100’s of microseconds)

RICE UNIVERSITY 35 Power savings: Voltage Gating & Scaling Power can change from W to 300 mW depending on workload changes

RICE UNIVERSITY 36 How to decide ALUs vs. clock frequency  No independent variables  Clusters, ALUs, frequency, voltage  Trade-offs exist  How to find the right combination for lowest power!

RICE UNIVERSITY 37 Setting clusters, adders, multipliers  If sufficient DP, linear decrease in frequency with clusters  Set clusters depending on DP and execution time estimate  To find adders and multipliers,  Let compiler schedule algorithm workloads across different numbers of adders and multipliers and let it find execution time  Put all numbers in previous equation  Compare increase in capacitance due to added ALUs and clusters with benefits in execution time  Choose the solution that minimizes the power Details available in Sridhar’s thesis

RICE UNIVERSITY 38 Conclusions  We propose a step-by-step methodology to design high performance power-efficient DSPs based on the TI 64x architecture  Initial results show benefits in power/performance greater than an order-of-magnitude over a conventional C64x  We tailor the design to ensure maximum compatibility with TI’s C6x architecture and tools  We are interested in exploring opportunities in TI for designing and actual fabrication of a chip and associated tool development  We are interested in feedback  limitations that we have not accounted for  Unreasonable assumptions that we have made Recommended reading: S. Rixner et al, A register organization for media processing, HPCA 2000 B. Khailany et al, Exploring the VLSI scalability of stream processors, HPCA 2003 U. J. Kapasi et al, Programmable Stream Processors, IEEE Computer, August 2003