Datapath Designs CK Cheng CSE Department UC, San Diego.

Datapath Designs CK Cheng CSE Department UC, San Diego

Prefix Adder – Well-known and Well-developed? Classic prefix networks: Sklansky, Kogge-Stone, Brent-Kung, Ladner-Fischer, Han-Carlson, Knowles etc.

Prefix Adder – New Respects, New Method Realistic design considerations: Timing, Power and Area. Integer Linear Programming for prefix adder: –Logic effort timing model (gate cap. + wire cap.) –Activity-statistic power model –Non-uniform signal arrival/required times Logic Levels Max FanoutsMax Wire Tracks Timing PowerArea

Prefix Adder – Optimum Prefix adders Uniform signal arrival/required times Sklansky AdderKogge-Stone Adder Fastest depth-4 optimal prefix adder Fastest depth-3 optimal prefix adder

Prefix Adder – Optimum Prefix adders Uniform signal arrival/required times

Prefix Adder – Optimum Prefix adders Non-uniform signal arrival/required times Increasing Signal Arrival TimesDecreasing Signal Arrival TimesConvex Signal Arrival Times

Division – Iteration effort Pencil and paper method: (A=Q  B+2 -n R and R<B) 1 bit partial quotient per iteration, n iterations A = 0.1001, B = 0.1010; Q = A / B. Q = 0.1101 + Q i : Partial Quotient R i : Partial Remainder R i+1 = R i – B  Q i 1 0 0 11 0 R0=AR0=A 1 0 1 0 0 R2R2 0 0 0 1 0 0 0 R3R3 1 1 0 0 1 1 0 R4R4 1 0 0.1 1 0 0 0 R1R1 Q 1 = 0.1 Q 2 = 0.01 Q 3 = 0.000 Q 4 = 0.0001

Division – Memory effort Lookup table is the simplest way to obtain multiple partial quotient bits in each iteration. SRT method: a lookup tables stores m-bit partial quotients decided by m bits of partial remainder and m bits of divisor. Table size: 2 2m  m STR method is limited by memory wall.

Division – Arithmetic effort Partial quotient is calculated by arithmetic functions. Prescaling: Taylor expansion: Series expansion:

Division – Solution space Modern FPGAs contains plenty of memory and build-in multipliers, which enable high performance divider. Iteration Effort Memory Effort Arithmetic Effort Memory Wall Pencil-and-paper SRT Prescaling Taylor Expansion Low area Series Expansion Low latency Our target

Division – PST algorithm Utilize the power of series expansion, but need a good start point. Prescaling provide a scaled divisor close to 1. 0-order Taylor expansion iterates to reach the final quotient

Division – PST algorithm E 0 = Table (B (m) )  1/B A 1 = A  E 0 ; B 1 = B  E 0 E 1 = (2  B 1 )  INV(B 1 (2m) ) Q i = R i-1  E 1 R i = R i-1  Q i  B 1 Q = Q + Q i A = 0.1011,0110 B = 0.1100,1011 B (m) = 0.1100  E 0 = 1.0011 E 1 = INV(B 1 (2m) ) = 1.0000,1110 A 1 = A  E 0 = 0.1101,1000,0010 B 1 = B  E 0 = 0.1111,0001,0001 Q 1 = A 1  E 1 = 0.1110,0011 R 1 = B 1 – Q 1  B 1 = 0.0000,0010,0101,1110,1101 Q 2 = R 1  E 1 = 0.1001,1111 R 2 = R 1 – Q 2  B 1 = 0.0000,0001,1111,1011,0001 Q = 0.1110,0011 + 0.0000,0010,0111,11 = 0.1110,0101,0111,11

Division – FPGA Implementation PST algorithm is suitable for high- performance division unit design in FPGAs Fmax (Period) ALUT s Memor y Bits DSP Blocks Power Consumption (Dynamic+Static) Throughput IP Core (no DSP) 50.16M Hz (19.935n s) 1203840381mW (52mW+329mW) 50.16Mdiv/s PST (DSP) 72.8MHz (13.737n s) 21376828350mW (23mW+327mW) 24.3Mdiv/s PST (no DSP) 73.20M Hz (13.661n s) 14377680378mW (50mW+328mW) 24.4Mdiv/s PST-pipelined (DSP) 74.15M Hz (13.486n s) 26176840344mW (17mW+327mW) 74.15Mdiv/s PSTp (no DSP) 76.05M Hz (13.150n s) 19407680359mW (31mW+328mW) 76.05Mdiv/s 32-bit division with 5-cycle latency

Datapath Designs CK Cheng CSE Department UC, San Diego.

Similar presentations

Presentation on theme: "Datapath Designs CK Cheng CSE Department UC, San Diego."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Datapath Designs CK Cheng CSE Department UC, San Diego.

Similar presentations

Presentation on theme: "Datapath Designs CK Cheng CSE Department UC, San Diego."— Presentation transcript:

Similar presentations

About project

Feedback