Para-CORDIC: Parallel CORDIC Rotation Algorithm and Architecture

Para-CORDIC: Parallel CORDIC Rotation Algorithm and Architecture
(IEEE T-CAS I, Vol. 51, No. 8, pp , Aug. 2004) Tso-Bing Juang, Ph.D VLSI Design LAB, Dept. CSE, NSYSU

My Research – Computer Arithmetic
Applications of arithmetic components DSP (Digital Signal Processing) 3-D graphics Computer communications, etc. Topics of arithmetic [Ercegovac 2004]: Addition/Subtraction Multiplication/Division Floating-point operations CORDIC (COordinate Rotation DIgital Computer)

International Conference
My Publications ( ) Topics SCI Journal International Conference Domestic Conference CORDIC 3 Multiplier 2 4 1 DCT

Academic Honors Best thesis award, Xerox Co. Ltd, 1995
Join Midwest Symposium of Circuits and Systems (MWSCAS) supported by NSC, 1999 First prize award of FPGA, National Intellectual Property Contest. FPGA, 2000 First prize award of Full Custom Design Contest, 2001 Join Asia-Pacific Conference on Circuits and Systems (APCCAS) supported by MOE, 2002 2005 Marquis, Who’s who in Science and Engineering, Edition 2006 Marquis, Who’s who in the World

Outline Basic Concept of CORDIC Bottleneck of CORDIC Rotation
Proposed Methods Previous Methods Comparisons Applications Conclusions

1. Basic Concept of CORDIC

What is CORDIC? CORDIC (COordinate Rotation DIgital Computer)
Rotate vector (1,0) by f to get (cos f, sin f) Can evaluate many arithmetic functions Rotation realized by shift-add operations Convergence method (iterative) About n iterations for n-bit accuracy Methods that we have seen so far: Table lookup: too much area Polynomial approx: too many multiply/adds

Conventional CORDIC Rotation
. Each iteration, x and y performs one micro-rotation based on the sign of z

CORDIC Functions

Pre-computation of tan(ai)
Find ai such that tan(ai)=2-i (or, ai=tan-1(2-i)) Possible to write any angle f = a0  a1  …  an as long as -99.7°  f  99.7° (which covers –90..90) Let’s say we want to rotate (x,y) by only 7.1 degrees. How do we rewrite the equations in the box on the prev slide? What if we want to rotate by 10.7 degrees? (note: 10.7= ). What are the exact sequence of operations you perform? What if 8.9 degrees ( )? So, for example, to get 90 degrees, you have to add almost all the angles. For –90, subtract almost all Do we have to use ALL alpha_i’s? (e.g., if you want to evaluate 71.6 degrees, adding the first two angles will do it) Important question. We will get back to that later Why is it good news that we can cover –90..90?

Conventional CORDIC Rotation
Algorithm: (z is the current angle) “At each step, try to make z approach to zero” Initialize x0=K= ,y0=0,z0= For i = 0 n i= 1 when zi>=0, else -1 [i.e., i=sign(zi)] xi+1 = xi – i 2-i yi yi+1 = yi + i 2-ixi zi+1 = zi – i ai End For Result: xn+1=cos(), yn+1=sin() Precision: n bits This is the basic CORDIC iteration Note: xi, yi shown on the right are misleading. The real axis is –phi degrees rotated

Example (z0==30= ) The angels come from the table in the previous slide The figure shows only a few steps Note that the sign is not always alternating between + and – (we have two consecutive +’s)

CORDIC Hardware What do each of the adders do?
What is the table lookup for? Not shown: logic for determining di

Three Important Factors of CORDIC
Large additions/subtractions Scaling factor (constant vs. non-constant) Sequential execution

Research Topics about CORDIC
Redundant CORDIC architecture Error analysis of CORDIC Application of CORDIC architectures CORDIC algorithm with non-constant scaling factors Parallel CORDIC architecture

2. Bottleneck of CORDIC Rotation

Conventional CORDIC Rotation (Revisited)
. Sequential determination of σi based on zi

Sequential CORDIC Rotation Architecture
The actual speed bottleneck lies in the sequential determination of the value of

3. Proposed Methods

How to parallelize? Using each bit of input angle to determine σi
Remove the bottleneck (B: bit accuracy) In the first m-1 iterations  sequential In other iterations  parallel

Our Proposed Techniques
MAR (Micro-rotation to Angle Recoding) Obtain the combinations of tan-1 terms in each 2-i, i=1 to m-1 BBR (Binary to Bipolar Recoding) Obtain the polarity{-1,+1} of each binary {1,0} weight of input angle  hardware free For example, B=24

Example (B=24) Three extra micro-rotation stages are required Phase 1

Architecture of a 24-b CORDIC –based SIN/COS Generator

Algorithm of MAR

Our MAR Results

Para-CORDIC Architecture -1/2

Para-CORDIC Architecture -2/2
S(1) S(5) S(8) σ1 R(1) R(i)

Carry-save Adder-Based Realization for Micro-Rotation Stages
A 4:2 compressor is exploited to produce the carry save form (a sum and a carry)

Evaluation of the Z Datapath
Delay is: Area is:

The delay of Z Datapath

Merged Rotations of the Second Half Iterations
Delay savings

4. Previous Methods

Comments of Previous Proposed CORDIC Rotation – 1/4
[Wang 1997]: IEEE T-Computers The first m-1 iterations are sequential Area saving

Comments of Previous Proposed CORDIC Rotation - 2/4
[Phatak 1998]: IEEE T-Computers Double hardware to perform clockwise/counterclockwise rotations Area cost is high (signed-digit realization of X/Y/Z iterations)

[Kwak 2000] Proc. MWSCAS Complicated logic circuits to generate the first m-1 rotation directions

[Kuhlmann 2002] : EUROSIP Using ROM to generate the first m-1 directions

Our Proposed Para-CORDIC
The delay and the area costs of para-CORDIC is: and

5. Comparisons

Latency Comparisons

Area Comparisons

6. Applications

ROM-based Implementations for sine/cosine generation
When x1 and y1 are constant (x1=K, y1=0, xB+1=cos(), yB+1=sin()) Can reduce the extra micro-rotation stages

Optimal Number of ROM Entries

7. Conclusions

Summary Parallel CORDIC rotation (Para-CORDIC) Better latency/area
Improve the original sequential execution of CORDIC rotation Complete proof of the proposed theorems Submission information 2003/7/11 submitted 2004/4/21 fully accepted 2004/ published Better latency/area

Future Work Physical implementation of Para-CORDIC
Dealing with the negative numbers when perform carry-save addition Floating-point representation of data Reduced micro-rotation stages in MAR Parallel CORDIC Vectoring Methods Must deal with two concurrent variables

Low-Error Fixed-Width Carry-Free Multipliers Design
( To appear in IEEE T-CAS II, 2005)

Definition An n  n fixed-width multiplier ECV
Has n most significant product bits Needs a small compensation circuit to generate error compensation value (ECV) ECV Constant Fixed Simple implementation, large errors Adaptive Variable Complex implementation, lower errors

An 88 Carry-Free Fixed-Width Multiplier using Modified Booth Encoding (MBE)
LPminor = others in truncated parts Mpost = truncates the bit after multiplication

Direct Implementation – Mdirect (only considers LPmajor)
The ECV is for n-bit accuracy RFA/RHA : Redundant Full/Half Adders

The Concept of Our Derivation of Compensation Circuits
Using the basic definition of MBE to obtain the possibility of each partial product digit equals to 1, -1 and 0. Previous works: same probability of each partial product  Using statistical analysis to derive the relationship between LPminor and LPmajor Previous works: only makes use of LPmajor 

Derivation Process

Derivation of Compensation Value and Circuit

Probability of the Partial Product Digits After MBE

The expected value can be derived by considering three conditions when (1)

(2)

(3)

Combining (1)(2)(3), Using similar methods, we have

Our Proposed Low-Error Carry-Free Fixed-Width Multipliers
Half of partial products are reduced in the compensation circuit, LPmajor only

Previous Proposed Fixed-Width Multipliers
All are binary representations [Kidambi 1996]: the ECV is a pre-determined constant [Jou 1999]: LPmajor to generate ECV. [Van 2000]: program-based exhaustive search method to obtain ECV [Jou 2000]: MBE, similar to the direct implementation [Cho 2004]: LPmajor and LPminor are required to calculate the ECV

Comparisons of Previous Methods

Absolute Average Error Analysis and Variance Analysis

Area ratios of three kinds of BSD fixed-width multipliers

Quality Analysis of Fixed-Width Multiplications in JPEG Image Compressions

Summary Our proposed fixed-width multipliers
Lower average errors and variances Low-cost compensation circuits Can be applied to high-speed DSP applications

Future Research Topics
Chip Implementation of proposed CORDIC and fixed-width multipliers Low-power RNS multiplier design Automatic datapath synthesizer for embedded systems Design and analysis of high-speed dividers using proposed multipliers

Thank you very much, I love Dept. of IECS at Feng Chia!

Para-CORDIC: Parallel CORDIC Rotation Algorithm and Architecture

Similar presentations

Presentation on theme: "Para-CORDIC: Parallel CORDIC Rotation Algorithm and Architecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Para-CORDIC: Parallel CORDIC Rotation Algorithm and Architecture

Similar presentations

Presentation on theme: "Para-CORDIC: Parallel CORDIC Rotation Algorithm and Architecture"— Presentation transcript:

Similar presentations

About project

Feedback