RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

Slides:

Advertisements

Similar presentations

Advertisements

DSPs Vs General Purpose Microprocessors

Intro to the “c6x” VLIW processor

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.

Adding the Jump Instruction

Lecture 6 Programming the TMS320C6x Family of DSPs.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

ECSE DSP architecture Review of basic computer architecture concepts C6000 architecture: VLIW Principle and Scheduling Addressing Assembly and linear.

Real time DSP Professors: Eng. Julian S. Bruno Eng. Jerónimo F. Atencio Sr. Lucio Martinez Garbino.

Computer Architecture and Data Manipulation Chapter 3.

C66x CorePac: Achieving High Performance. Agenda 1.CorePac Architecture 2.Single Instruction Multiple Data (SIMD) 3.Memory Access 4.Pipeline Concept.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Data Manipulation Computer System consists of the following parts:

1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.

1 SHARC ‘S’uper ‘H’arvard ‘ARC’hitecture Nagendra Doddapaneni ER hit HAR ect VARD ure SUP Arc.

A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.

Pipelining By Toan Nguyen.

© 2007 Elsevier Lecture 6: Embedded Processors Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based on slides and textbook from Wayne.

SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.

Viterbi Decoder Project Alon weinberg, Dan Elran Supervisors: Emilia Burlak, Elisha Ulmer.

Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.

Basics and Architectures

Chapter 1 An Introduction to Processor Design 부산대학교 컴퓨터공학과.

ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.

RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

TI DSPS FEST 1999 Implementation of Channel Estimation and Multiuser Detection Algorithms for W-CDMA on Digital Signal Processors Sridhar Rajagopal Gang.

ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.

RISC Architecture RISC vs CISC Sherwin Chan.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

C66x CorePac: Achieving High Performance. Agenda 1.CorePac Architecture 2.Single Instruction Multiple Data (SIMD) 3.Memory Access 4.Pipeline Concept.

RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.

Electrical and Computer Engineering University of Cyprus LAB 2: MIPS.

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Implementing algorithms for advanced communication systems -- My bag of tricks Sridhar Rajagopal Electrical and Computer Engineering This work is supported.

Real-Time Turbo Decoder Nasir Ahmed Mani Vaya Elec 434 Rice University.

Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.

RICE UNIVERSITY On the architecture design of a 3G W-CDMA/W-LAN receiver Sridhar Rajagopal and Joseph R. Cavallaro Rice University Center for Multimedia.

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

EKT303/4 Superscalar vs Super-pipelined.

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Fundamentals of Programming Languages-II

SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

RISC / CISC Architecture by Derek Ng. Overview CISC Architecture RISC Architecture  Pipelining RISC vs CISC.

RICE UNIVERSITY Handset architectures Sridhar Rajagopal ASICsProgrammable The support for this work in.

Electrical and Computer Engineering University of Cyprus

William Stallings Computer Organization and Architecture 8th Edition

A programmable communications processor for future wireless systems

Embedded Systems Design

Processor Architecture: Introduction to RISC Datapath (MIPS and Nios II) CSCE 230.

Prof. Sirer CS 316 Cornell University

Pipelined Architectures for High-Speed and Area-Efficient Viterbi Decoders Chen, Chao-Nan Chu, Hsi-Cheng.

CDA 3101 Spring 2016 Introduction to Computer Organization

Introduction to Digital Signal Processors (DSPs)

DSPs for Future Wireless Base-Stations

TI C6701 VLIW MIMD.

The ARM Instruction Set

Superscalar and VLIW Architectures

DSPs in emerging wireless systems

DSP Architectures for Future Wireless Base-Stations

Suman Das, Sridhar Rajagopal, Chaitali Sengupta and Joseph R.Cavallaro

DSPs for Future Wireless Base-Stations

Presentation transcript:

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

RICE UNIVERSITY Motivation  Viterbi decoding - One of the major bottlenecks in baseband processing [PHY]  Need for flexibility in the algorithm parameters due to different protocols “read programmable”  No architecture developed yet to meet real-time requirements of 3G systems.  Mbps range for wideband CDMA  100 Mbps range for wireless LAN

RICE UNIVERSITY Today  Background  Advanced DSP architectures -- TI C6x [15]  Viterbi algorithm basics [10]  Viterbi on TI DSPs [10]  A programmable processor specifically designed for Viterbi [15]

RICE UNIVERSITY  VLIW [Very Long Instruction Word] arch.  Similar to a vector processor -- but  multiple instructions -> multiple Func. Units  FU’s are not all the same  32-bit architecture  8 functional units TI C6x architecture Inst 1 Inst 2 Inst 3 Inst 4 FU 1 FU 2 FU 3 FU 4 4-wide VLIW

RICE UNIVERSITY

8 VelociTI principles  Parallel fetch, decode and execute  Pipelined enough to make ADD critical path  Instructions based on RISC  Load - Store architecture  Orthogonal - Instruction Set and Reg. File  Determinism  Conditional Instructions  Instruction Packing

RICE UNIVERSITY 2 * 4 = 8 Functional Units .M Multiplication unit  16 bit x 16 bit signed/# packed/# .L arithmetic Logic unit  Comparisons and logic operations  Saturation arithmetic and absolute value .S Shifter unit  Bit manipulation (set, get, shift, rotate)  Branching, addition and packed addition .D Data unit  Load/store to memory  Addition and pointer arithmetic

RICE UNIVERSITY How powerful am I?  8 instructions per cycle  Max:  6 adds per cycle  2 multiplies per cycle  2 load/stores per cycle  2 branches per cycle  Idea is you will be using instructions in these ratios to get full FU utilization.

RICE UNIVERSITY C6x DSP Core

RICE UNIVERSITY C6x Datapath

RICE UNIVERSITY C6x Resource Constraints  Instructions using the same FU  1 inst. / FU  Cross Paths  only 1 operand from other reg. file to (L,S,M)  Loads and stores  2 loads and stores from 2 different reg. files  Reads and writes  max 4-reads from the same register  No 2 writes to the same register :)

RICE UNIVERSITY Instruction Packing  Fetch Packet  Execute Packet  Avoid NOPs in the instruction code  Multi-cycle NOPs if absolutely necessary  LSB- “p” bit of instruction for packing A || B || C,D || E, F, G || H 8 instructions instead of 32 A B C D E F G H

RICE UNIVERSITY Conditional Instructions  All instructions can be conditioned based on the value in registers A1,A2,B0,B1,B2  Avoids branch latencies  If condition not met by end of first phase of execution, results not written back to reg. file  Conditional loads/stores squashed before data phase

RICE UNIVERSITY C6x Pipeline  Fetch (if necessary) - 4 phases  Address Generate  Address Send  Access Ready Wait  Fetch Packet Receive  Decode - 2 phases  Instruction dispatch (if necessary)  Instruction decode  Execute - 10 phases  Most 1 phase

RICE UNIVERSITY Some interesting instructions  Saturation  Bit-counting -- Image coding  Integer-comparison  Bit-manipulation  Seed generation for reciprocal instructions

RICE UNIVERSITY Other details  64 KB internal program and data  DMA - peripherals to memory  Intrinsics in code for better programming  similar to using “ViS” in UltraSPARC  Software pipelining of loops  PERFORMANCE:  5-10X  higher clock -- higher pipeline (2-4X)  Additional ALUs

RICE UNIVERSITY Additional features in C64x  SIMD support  Communication-specific instructions  interleaving, galois field multiply  Bit count and rotate hardware  bit registers  Lower resource constraints  No more NOPs needed ever [no boundaries]

RICE UNIVERSITY C64x DSP Core

RICE UNIVERSITY Today  Background  Advanced DSP architectures -- TI C6x [15]  Viterbi algorithm basics [10]  Viterbi on TI DSPs [10]  A programmable processor specifically designed for Viterbi [15]

RICE UNIVERSITY Viterbi Decoding Encoder Decoder k k n > k n Rate k/n = 1/2 Convolutional Encoder

RICE UNIVERSITY Error Protection  States = 2^(FFs) = 2^(Constraint Length - 1)  Cannot go from any state to any state

RICE UNIVERSITY Trellis for decoding

RICE UNIVERSITY Trellis for an input sequence

RICE UNIVERSITY Error detection  Branch metric = “Distance” between received symbol pair and possible symbol pairs  Path metric = Accumulated error metric

RICE UNIVERSITY Error-correction

RICE UNIVERSITY Stages in Viterbi Decoding  Calculate Branch metrics for all states every stage  Update Path metrics for all states every stage  At the end, Traceback the trellis to get the decoded bits

RICE UNIVERSITY Computations  Branch metrics:  Hamming distance: (XOR) and Count 1’s  Euclidean distance: squared distance  Path metrics:  Add Branch metrics to existing path metrics  Compare for minimum and Select minimum  Survivor Traceback:  Linked list /Pointer chasing  Memory Intensive / Sequential Operations

RICE UNIVERSITY Today  Background  Advanced DSP architectures -- TI C6x [15]  Viterbi algorithm basics [10]  Viterbi on TI DSPs [10]  A programmable processor specifically designed for Viterbi [15]

RICE UNIVERSITY Viterbi support in different processors  C54x  Special hardware accelerator  ACS unit with 2 ACC and split ALU  Viterbi butterfly (2 ACS) in 4 cycles  C62x  nothing special  C6416  Viterbi coprocessor  K = 5-9,Rate = 1/2,1/3,1/4

RICE UNIVERSITY Viterbi Coprocessor in C6416

RICE UNIVERSITY Viterbi Coprocessor in C6416  SM, SD and HD memory not accessible to DSP

RICE UNIVERSITY Today  Background  Advanced DSP architectures -- TI C6x [15]  Viterbi algorithm basics [10]  Viterbi on TI DSPs [10]  A programmable processor specifically designed for Viterbi [15]

RICE UNIVERSITY Need for VSP architecture  Large amount of memory access  Traceback decoding  Not efficient on a GPP  Program instructions in a GPP is of a higher order than complexity of the algorithm

RICE UNIVERSITY VSP architecture

RICE UNIVERSITY Branch Metric Calculation

RICE UNIVERSITY Path Metric Calculation

RICE UNIVERSITY Traceback Unit

RICE UNIVERSITY Traceback with survivor updates Start Filling the Trellis Start Traceback 5*Constraint Length Symbol Decoded Update Survivor Path for most recent symbol

RICE UNIVERSITY Survivor Path Updates

RICE UNIVERSITY Circular updates

RICE UNIVERSITY Software Programming  Small but specialized instruction set  LOAD, ACS  Shorter execution time  All 3 subprocessors programmed independently  10 ns, (100 MHz) in 1990 to get 1.5 Mbps

RICE UNIVERSITY Conclusions  Viterbi algorithm important for implementation in a programmable communication receiver  Approaches have been as co-processor support to DSPs or specialized processors.  We are yet to design programmable processors that meet real-time requirements for 100 Mbps applications.