Datapath Designs CK Cheng CSE Department UC, San Diego
Prefix Adder – Well-known and Well-developed? Classic prefix networks: Sklansky, Kogge-Stone, Brent-Kung, Ladner-Fischer, Han-Carlson, Knowles etc.
Prefix Adder – New Respects, New Method Realistic design considerations: Timing, Power and Area. Integer Linear Programming for prefix adder: –Logic effort timing model (gate cap. + wire cap.) –Activity-statistic power model –Non-uniform signal arrival/required times Logic Levels Max FanoutsMax Wire Tracks Timing PowerArea
Prefix Adder – Optimum Prefix adders Uniform signal arrival/required times Sklansky AdderKogge-Stone Adder Fastest depth-4 optimal prefix adder Fastest depth-3 optimal prefix adder
Prefix Adder – Optimum Prefix adders Uniform signal arrival/required times
Prefix Adder – Optimum Prefix adders Non-uniform signal arrival/required times Increasing Signal Arrival TimesDecreasing Signal Arrival TimesConvex Signal Arrival Times
Division – Iteration effort Pencil and paper method: (A=Q B+2 -n R and R<B) 1 bit partial quotient per iteration, n iterations A = , B = ; Q = A / B. Q = Q i : Partial Quotient R i : Partial Remainder R i+1 = R i – B Q i R0=AR0=A R2R R3R R4R R1R1 Q 1 = 0.1 Q 2 = 0.01 Q 3 = Q 4 =
Division – Memory effort Lookup table is the simplest way to obtain multiple partial quotient bits in each iteration. SRT method: a lookup tables stores m-bit partial quotients decided by m bits of partial remainder and m bits of divisor. Table size: 2 2m m STR method is limited by memory wall.
Division – Arithmetic effort Partial quotient is calculated by arithmetic functions. Prescaling: Taylor expansion: Series expansion:
Division – Solution space Modern FPGAs contains plenty of memory and build-in multipliers, which enable high performance divider. Iteration Effort Memory Effort Arithmetic Effort Memory Wall Pencil-and-paper SRT Prescaling Taylor Expansion Low area Series Expansion Low latency Our target
Division – PST algorithm Utilize the power of series expansion, but need a good start point. Prescaling provide a scaled divisor close to 1. 0-order Taylor expansion iterates to reach the final quotient
Division – PST algorithm E 0 = Table (B (m) ) 1/B A 1 = A E 0 ; B 1 = B E 0 E 1 = (2 B 1 ) INV(B 1 (2m) ) Q i = R i-1 E 1 R i = R i-1 Q i B 1 Q = Q + Q i A = ,0110 B = ,1011 B (m) = E 0 = E 1 = INV(B 1 (2m) ) = ,1110 A 1 = A E 0 = ,1000,0010 B 1 = B E 0 = ,0001,0001 Q 1 = A 1 E 1 = ,0011 R 1 = B 1 – Q 1 B 1 = ,0010,0101,1110,1101 Q 2 = R 1 E 1 = ,1111 R 2 = R 1 – Q 2 B 1 = ,0001,1111,1011,0001 Q = , ,0010,0111,11 = ,0101,0111,11
Division – FPGA Implementation PST algorithm is suitable for high- performance division unit design in FPGAs Fmax (Period) ALUT s Memor y Bits DSP Blocks Power Consumption (Dynamic+Static) Throughput IP Core (no DSP) 50.16M Hz (19.935n s) mW (52mW+329mW) 50.16Mdiv/s PST (DSP) 72.8MHz (13.737n s) mW (23mW+327mW) 24.3Mdiv/s PST (no DSP) 73.20M Hz (13.661n s) mW (50mW+328mW) 24.4Mdiv/s PST-pipelined (DSP) 74.15M Hz (13.486n s) mW (17mW+327mW) 74.15Mdiv/s PSTp (no DSP) 76.05M Hz (13.150n s) mW (31mW+328mW) 76.05Mdiv/s 32-bit division with 5-cycle latency