Design of a High-Speed Asynchronous Turbo Decoder Pankaj Golani, George Dimou, Mallika Prakash and Peter A. Beerel Asynchronous CAD/VLSI Group Ming Hsieh.

Slides:

Advertisements

Similar presentations

Noise-Predictive Turbo Equalization for Partial Response Channels Sharon Aviran, Paul H. Siegel and Jack K. Wolf Department of Electrical and Computer.

Advertisements

Programmable FIR Filter Design

Inserting Turbo Code Technology into the DVB Satellite Broadcasting System Matthew Valenti Assistant Professor West Virginia University Morgantown, WV.

Alexander Smirnov Alexander Taubin.  Determine ◦ max throughput ◦ causes of throughput limit ◦ max achievable throughput ◦ cost of achieving a given.

CSE241 Formal Verification.1Cichy, UCSD ©2003 CSE241A VLSI Digital Circuits Winter 2003 Recitation 6: Formal Verification.

Submission May, 2000 Doc: IEEE / 086 Steven Gray, Nokia Slide Brief Overview of Information Theory and Channel Coding Steven D. Gray 1.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.

EE 141 Project 2May 8, Outstanding Features of Design Maximize speed of one 8-bit Division by: i. Observing loop-holes in 8-bit division ii. Taking.

Turbo Codes Azmat Ali Pasha.

Huffman Encoder Project. Howd - Zur Hung Eric Lai Wei Jie Lee Yu - Chiang Lee Design Manager: Jonathan P. Lee Huffman Encoder Project Final Presentation.

Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.

Embedded Systems Hardware:

1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.

Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research

Viterbi Decoder: Presentation #1 Omar Ahmad Prateek Goenka Saim Qidwai Lingyan Sun M1 Overall Project Objective: Design of a high speed Viterbi Decoder.

Division of Engineering and Applied Sciences DIMACS-04 Iterative Timing Recovery Aleksandar Kavčić Division of Engineering and Applied Sciences Harvard.

Embedded Systems Hardware: Storage Elements; Finite State Machines; Sequential Logic.

1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.

Lecture 11 MOUSETRAP: Ultra-High-Speed Transition-Signaling Asynchronous Pipelines.

Improving the Performance of Turbo Codes by Repetition and Puncturing Youhan Kim March 4, 2005.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

Viterbi Decoder Project Alon weinberg, Dan Elran Supervisors: Emilia Burlak, Elisha Ulmer.

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

West Virginia University

A New Method For Developing IBIS-AMI Models

Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.

Optimal digital circuit design Mohammad Sharifkhani.

Soft-in/ Soft-out Noncoherent Sequence Detection for Bluetooth: Capacity, Error Rate and Throughput Analysis Rohit Iyer Seshadri and Matthew C. Valenti.

Important Components, Blocks and Methodologies. To remember 1.EXORS 2.Counters and Generalized Counters 3.State Machines (Moore, Mealy, Rabin-Scott) 4.Controllers.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Introduction of Low Density Parity Check Codes Mong-kai Ku.

Outline Transmitters (Chapters 3 and 4, Source Coding and Modulation) (week 1 and 2) Receivers (Chapter 5) (week 3 and 4) Received Signal Synchronization.

Coding Theory. 2 Communication System Channel encoder Source encoder Modulator Demodulator Channel Voice Image Data CRC encoder Interleaver Deinterleaver.

Turbo Codes COE 543 Mohammed Al-Shammeri. Agenda PProject objectives and motivations EError Correction Codes TTurbo Codes Technology TTurbo decoding.

StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:

CHAPTER 8 Developing Hard Macros The topics are: Overview Hard macro design issues Hard macro design process Physical design for hard macros Block integration.

IMPLEMENTATION OF MIPS 64 WITH VERILOG HARDWARE DESIGN LANGUAGE BY PRAMOD MENON CET520 S’03.

VIRGINIA POLYTECHNIC INSTITUTE & STATE UNIVERSITY MOBILE & PORTABLE RADIO RESEARCH GROUP MPRG Combined Multiuser Detection and Channel Decoding with Receiver.

Real-Time Turbo Decoder Nasir Ahmed Mani Vaya Elec 434 Rice University.

1 Channel Coding (III) Channel Decoding. ECED of 15 Topics today u Viterbi decoding –trellis diagram –surviving path –ending the decoding u Soft.

Design and Implementation of Turbo Decoder for 4G standards IEEE e and LTE Syed Z. Gilani.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.

Minufiya University Faculty of Electronic Engineering Dep. of Electronic and Communication Eng. 4’th Year Information Theory and Coding Lecture on: Performance.

Adding the Superset Adder to the DesignWare IP Library

FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.

Turbo Codes. 2 A Need for Better Codes Designing a channel code is always a tradeoff between energy efficiency and bandwidth efficiency. Lower rate Codes.

Log-Likelihood Algebra

Directorate of Technical and Quality Management Electrical System Department - TEC-E SCCC, LDPC and 4D-8PSK TCM Comparison of complexities Sergio Benedetto,

Comparison of Various Multipliers for Performance Issues 24 March Depart. Of Electronics By: Manto Kwan High Speed & Low Power ASIC

Overview of MB-OFDM UWB Baseband Channel Codec for MB-OFDM UWB 2006/10/27 Speaker: 蔡佩玲.

1 Code design: Computer search Low rate: Represent code by its generator matrix Find one representative for each equivalence class of codes Permutation.

Channel Coding and Error Control 1. Outline Introduction Linear Block Codes Cyclic Codes Cyclic Redundancy Check (CRC) Convolutional Codes Turbo Codes.

Low Power, High-Throughput AD Converters

FEC decoding algorithm overview VLSI 자동설계연구실 정재헌.

Results and Conclusions

Pipelining and Retiming 1

Architecture & Organization 1

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Trellis Codes With Low Ones Density For The OR Multiple Access Channel

Interleaver-Division Multiple Access on the OR Channel

January 2004 Turbo Codes for IEEE n

An Improved Split-Row Threshold Decoding Algorithm for LDPC Codes

Architecture & Organization 1

High Throughput LDPC Decoders Using a Multiple Split-Row Method

Chapter 10: Error-Control Coding

Clockless Logic: Asynchronous Pipelines

Comparison of Various Multipliers for Performance Issues

Wagging Logic: Moore's Law will eventually fix it

Presentation transcript:

Design of a High-Speed Asynchronous Turbo Decoder Pankaj Golani, George Dimou, Mallika Prakash and Peter A. Beerel Asynchronous CAD/VLSI Group Ming Hsieh Electrical Engineering Department University of Southern California ASYNC 2007 – Berkeley, California March 12 th 2007

Motivation and Goal Mainstream acceptance of asynchronous design Leverage-off of ASIC standard-cell library-based design flow Achieve significant benefits to overcome sync momentum Our research goal for async designs… High-speed standard-cell flow Applications where designs yield significant improvement throughput and throughput per area energy efficiency

Single Track Full Buffer (Ferretti’02) Follows 2 phase protocol High performance standard cell circuit family Comparison to synchronous standard-cell 4.5x better latency 1+GHz in 0.18µm ~2.4X faster than synchronous 2.8x more area R L S B RCD SCD A Reset L R B Forward path Reset path 1-of-N 12 L R

Block Processing – Pipelining and Parallelism K cases M people pipelines Latency l Let c be the person cycle time Steinhart Aquarium First M cases arrive at t = l Subsequent M cases arrive every c time units Consider two scenarios Baseline cycle time C1, latency L1 Improved cycle time C2 = C1/2.4, latency L2 = L1/4.5 Questions How does cycle time affect throughput? How does latency affect throughput ?

Block Processing – Combined Cycle Time and Latency Effect Large K: throughput ratio  cycle time ratio Small K: throughput ratio  latency ratio 4.32 Throughput vs Number of cases Number of cases (K) Throughput Baseline Improved 2.6

Talk Outline Turbo coding and decoding – an introduction Tree soft-input soft-output (SISO) decoder Synchronous turbo decoder Asynchronous turbo decoder Comparisons and conclusions

01111 N bits Turbo Coding – Introduction Error correcting codes Adds redundancy The input data is K bits The output code word is N bits (N>K) The code rate is r = K/N Type of codes Linear code Convolutional code (CC) Turbo code Encoder K bits

Turbo Encoding - Introduction Turbo Encoding Berrou, Glavieux and Thitimajshima (1993) Performance close to Shannon channel capacity Typically uses two convolutional codes and an interleaver Interleaver used to improve error correction increases minimum distance of code creates a large block code Interleaver Inner CC Outer CC Turbo Encoder

Turbo Decoding Received Data memory Inner SISO De- interleaver Interleaver Outer SISO Turbo decoder components Two soft-in soft-out (SISO) decoders one for inner CC and one for outer CC soft input: a priori estimates of input data soft output: a posterior estimates of input data SISO often based on Min-Sum formulation Interleaver / De-interleaver maps SISO outputs to SISO inputs same permutation as used in encoder Iterative nature of algorithm leads to block processing One SISO must finish before next SISO starts

The Decoding Problem t = 0t = K Sent bit is 1 Sent bit is 0 Requires finding paths in a graph called a trellis Node: State j of encoder at time index k Edge: Represents receiving a 0 or 1 in node for state j at time k Path: Represents a possible decoded sequence the algorithm finds multiple paths Example Trellis For a 2-state encoder, encoding K bits Decoded Sequence t = k

Min-Sum SISO Problem Formulation Branch and path metrics Branch metric (BM) indicates difference between expected and received values Path metric sum of associated branch metrics Min-Sum Formulation: for each time index k find Minimum path metric over all paths for which bit k = 1 Minimum path metric over all paths for which bit k = 0 t = 0t = kt = K Sent bit is 1 Sent bit is 0 Minimum path metric when bit k = 1 is 13 Minimum path metric when bit k = 0 is

Talk Outline Turbo coding and decoding – an introduction Tree SISO low-latency turbo decoder architecture Synchronous turbo decoder Asynchronous turbo decoder Comparisons and conclusions

Conventional SISO - O(K) latency Calculation of the minimum path can be divided into two phases Forward state metric for time k and state j: Backward state metric for time k and state j: Data dependency loop prevents pipelining Cycle time limited to latency of 2-way ACS Latency is O(K) t = 0 t = kt = Kt = k-1 Received bit is 1 Received bit is 0 t = k+1

Tree SISO – low latency architecture Tree SISO (Beerel/Chugg JSAC’01) Calculate BMs for larger and larger segments of trellis.( ) Analogous to creating group-wise PG logic for tree adders Tree SISO can process the entire trellis in parallel No data dependency loops so finer pipelining possible Latency is O(log K) t= t=0t=1 t=0 t=1 t=2t=3t=

Remainder of Talk Outline Turbo Coding – an introduction Turbo Decoding Tree SISO low-latency turbo decoder architecture Synchronous turbo decoder Asynchronous turbo decoder Comparisons and conclusions

Synchronous Base-Line Turbo Decoder Synchronous turbo decoder base-line IBM 0.18µm Artisan standard cell library SCCC code was used with a rate of ½ Number of iterations performed is 6 Gate level pipelined to achieve high throughput Performed timing-driven P&R Peak frequency of 475MHz SISO area of 2.46mm 2 To achieve high throughput, multiple blocks instantiated

Asynchronous Turbo Decoder Static Single Track Full Buffer Standard-Cell Library (Golani’06) Total of (only) 14 cells in IBM 0.18µm process Extensive spice simulations were performed optimized trade-off between performance and robustness Chip design Standard ASIC place-and-route flow (congestion-based) ECO optimization flow Chip level simulation Performed on critical sub-block (55K transistors) Verified timing constraints Measured latency and throughput using Synopsys Nanosim

Keeper S R M2 M1 M3 M11 M12 NR M10 L A Channel wire Static Single Track Full Buffer (Ferretti’01) Statically drive line → improves noise margin Sender Receiver 1-of-N data SST channel 1-of-N static single-track protocol 1-of-N 12 Holds low Drives high Holds high Drives low

Asynchronous Implementation Challenges - I FORK Join Degradation in throughput Unbalanced fork and join structure The token on the short branch is stalled due to imbalance This leads to over all slowing down of the fork join FORK Join Slack matching Improves the throughput because of an additional pipeline buffer Identify fork / join bottlenecks and resolve by adding buffers After P&R long wires can also create such a problem This can be solved by adding buffers on long wires using ECO flow

Asynchronous Implementation Challenges - II SSTFB implements only point to point communication Use dedicated Fork cells Creates another pipeline stage To slack match buffers are needed on the other paths Integrate Fork within Full Adder FA Fork Full adder Full adder Full adder Full adder Full adder Full adder Full adder 45% less area than full adder and fork Decreases the number of slack matching buffers required Full Adder + ForkFull Adder with Integrated Fork Full adder Full adder Full adder Full adder FORK Full adder Full adder Full adder Full adder FORK

Asynchronous Implementation Challenges – III Buffer 60% of the design are slack matching buffers Most of the time these buffers occur in linear chains Slack2 Slack4 17% area and 10% power improvement for SLACK2 30% area and 19% power improvement for SLACK4 To save area and power two new cells were created SLACK2 SLACK4 Slack2Buffer

Remainder of Talk Outline Turbo Coding – an introduction Turbo Decoding Tree SISO low-latency turbo decoder architecture Synchronous turbo decoder Asynchronous turbo decoder Comparisons and conclusions

Comparisons Synchronous Peak frequency of 475MHz Logic area of 2.46mm 2 Asynchronous Peak frequency of 1.15GHz Logic area of 6.92mm 2 Design time comparison Synchronous: ~4 graduate-student months Asynchronous: ~12 graduate-student months

Synch vs Async M pipelined 8-bit Tree SISOs Latency l Let c be the sync clock cycle time (475 MHz) First M bits arrive at t = l Subsequent M bits arrive every c time units Two implementations Synch: cycle time C1 and latency L1 Async: cycle time C2 = C1/2.4 latency L2 = L1/4.5 Desired comparisons Throughput comparison vs block size Energy comparison vs block size Received Memory Interleaver/ De- interleaver K bits

Comparisons – Throughput / Area For small block sizes asynchronous provides better throughput/area As block size ↑ the two implementations become comparable For block sizes of 512 bits synchronous cannot achieve async throughput 2.13 M= M= M=11

Comparisons – Energy/Block For equivalent throughputs and small block sizes asynchronous is more energy efficient than synchronous Async advantages grow with larger async library (e.g., w/ BUF1of4)

Conclusions Asynchronous turbo decoder vs. synchronous baseline static STFB offers significant improvements for small block sizes more than 2X throughput/area higher peak throughput (~500Mbps) more energy efficient well-suited for low-latency applications (e.g. voice) High-performance async advantageous for applications which require high performance (e.g., pipelining) low latency block processing for which parallelism has diminishing returns synchronous design requires extensive parallelism to achieve equivalent throughput

Future Work Library Design Larger library with more than 1 size per cell 1-of-4 encoding Async CAD Automated slack matching Static timing analysis

Questions ??