Optimizing Multipliers for the CPU: A ROM based approach Michael Moeng Jason Wei Electrical Engineering and Computer Science University of California:

Slides:



Advertisements
Similar presentations
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
Advertisements

Introduction So far, we have studied the basic skills of designing combinational and sequential logic using schematic and Verilog-HDL Now, we are going.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
1 KU College of Engineering Elec 204: Digital Systems Design Lecture 9 Programmable Configurations Read Only Memory (ROM) – –a fixed array of AND gates.
Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
Using Carry-Save Adders For Radix- 4, Can Be Used to Generate 3a – No Booth’s Slight Delay Penalty from CSA – 3 Gates.
Square Root Function- The Restoring Algorithm VLSI–Lab project Moran Amir Elior.
UNIVERSITY OF MASSACHUSETTS Dept
EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE VLSI Circuit Design Lecture 24 - Subsystem.
1 CS 140 Lecture 14 Standard Combinational Modules Professor CK Cheng CSE Dept. UC San Diego Some slides from Harris and Harris.
1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,
EECS Components and Design Techniques for Digital Systems Lec 18 – Arithmetic II (Multiplication) David Culler Electrical Engineering and Computer.
IMPLEMENTATION OF µ - PROCESSOR DATA PATH
Nov. 29, 2005ELEC Power Minimization Using Voltage Reduction and Parallel Processing By Sudheer Vemula.
ECE C03 Lecture 61 Lecture 6 Arithmetic Logic Circuits Hai Zhou ECE 303 Advanced Digital Design Spring 2002.
University of Michigan Electrical Engineering and Computer Science 1 An Architecture Framework for Transparent Instruction Set Customization in Embedded.
Lecture 8 Arithmetic Logic Circuits
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
Energy and Delay Improvement via Decimal Floating Point Hossam A.H.Fahmy, Electronics and Communications Department, CairoUniversity Egypt and.
Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.
Aug Shift Operations Source: David Harris. Aug Shifter Implementation Regular layout, can be compact, use transmission gates to avoid threshold.
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
Benchmarks Prepared By : Arafat El-madhoun Supervised By:eng. Mohammad temraz.
Institute of Applied Microelectronics and Computer Engineering College of Computer Science and Electrical Engineering, University of Rostock Slide 1 Spezielle.
Adders and Multipliers Review. ARITHMETIC CIRCUITS Is a combinational circuit that performs arithmetic operations, e.g. –Addition –Subtraction –Multiplication.
Institute of Applied Microelectronics and Computer Engineering College of Computer Science and Electrical Engineering, University of Rostock Slide 1 Spezielle.
Institute of Applied Microelectronics and Computer Engineering College of Computer Science and Electrical Engineering, University of Rostock Slide 1 Spezielle.
Team MUX Adam BurtonMark Colombo David MooreDaniel Toler.
High Speed, Low Power FIR Digital Filter Implementation Presented by, Praveen Dongara and Rahul Bhasin.
Institute of Applied Microelectronics and Computer Engineering College of Computer Science and Electrical Engineering, University of Rostock Slide 1 Color.
Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.
Chapter 6-1 ALU, Adder and Subtractor
Description and Analysis of MULTIPLIERS using LAVA.
Digital Kommunikationselektronik TNE027 Lecture 2 1 FA x n –1 c n c n1- y n1– s n1– FA x 1 c 2 y 1 s 1 c 1 x 0 y 0 s 0 c 0 MSB positionLSB position Ripple-Carry.
2013/10/21 Yun-Chung Yang An Energy-Efficient Adaptive Hybrid Cache Jason Cong, Karthik Gururaj, Hui Huang, Chunyue Liu, Glenn Reinman, Yi Zou Computer.
Implementation of Finite Field Inversion
EECS Components and Design Techniques for Digital Systems Lec 16 – Arithmetic II (Multiplication) David Culler Electrical Engineering and Computer.
A Fast Hardware Approach for Approximate, Efficient Logarithm and Anti-logarithm Computation Suganth Paul Nikhil Jayakumar Sunil P. Khatri Department of.
1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.
Institute of Applied Microelectronics and Computer Engineering College of Computer Science and Electrical Engineering, University of Rostock Slide 1 Selected.
Institute of Applied Microelectronics and Computer Engineering College of Computer Science and Electrical Engineering, University of Rostock Slide 1 Spezielle.
LA-LRU: A Latency-Aware Replacement Policy for Variation Tolerant Caches Aarul Jain, Cambridge Silicon Radio, Phoenix Aviral Shrivastava, Arizona State.
Design of an 8-bit Carry-Skip Adder Using Reversible Gates Vinothini Velusamy, Advisor: Prof. Xingguo Xiong Department of Electrical Engineering, University.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Implementing and Optimizing a Direct Digital Frequency Synthesizer on FPGA Jung Seob LEE Xiangning YANG.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,
Full Tree Multipliers All k PPs Produced Simultaneously Input to k-input Multioperand Tree Multiples of a (Binary, High-Radix or Recoded) Formed at Top.
Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.
Evaluating and Optimizing IP Lookup on Many Core Processors Author: Peng He, Hongtao Guan, Gaogang Xie and Kav´e Salamatian Publisher: International Conference.
UNIT III -PIPELINE.
Comparison of Various Multipliers for Performance Issues 24 March Depart. Of Electronics By: Manto Kwan High Speed & Low Power ASIC
Institute of Applied Microelectronics and Computer Engineering College of Computer Science and Electrical Engineering, University of Rostock Slide 1 Spezielle.
Institute of Applied Microelectronics and Computer Engineering College of Computer Science and Electrical Engineering, University of Rostock Slide 1 Spezielle.
UNIT 2. ADDITION & SUBTRACTION OF SIGNED NUMBERS.
Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,
Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,
Lecture 3. Combinational Logic #2 Prof. Taeweon Suh Computer Science & Engineering Korea University COSE221, COMP211 Logic Design.
BITS Pilani, Pilani Campus Today’s Agenda Role of Performance.
Full Adder Truth Table Conjugate Symmetry A B C CARRY SUM
Array Multiplier Haibin Wang Qiong Wu. Outlines Background & Motivation Principles Implementation & Simulation Advantages & Disadvantages Conclusions.
Backprojection Project Update January 2002
UNIVERSITY OF MASSACHUSETTS Dept
Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.
Unsigned Multiplication
Multiplier-less Multiplication by Constants
A.R. Hurson 323 CS Building, Missouri S&T
Faustino J. Gomez, Doug Burger, and Risto Miikkulainen
Comparison of Various Multipliers for Performance Issues
Presentation transcript:

Optimizing Multipliers for the CPU: A ROM based approach Michael Moeng Jason Wei Electrical Engineering and Computer Science University of California: Berkeley

Problem Many power-limited applications for CPU  Media/Graphics  Portable applications Investigating the impact of different multiplier designs on power and performance of CPU:  SimpleScalar to model CPU and benchmarks  Modify SimpleScalar multiplier cycle times to model different multiplier architectures

Array Multipliers AND function to multiply bits Critical path in carry-chain

Wallace Multipliers Critical path shortened Final Adder still needed to combine partial products Power consumption approximately the same as Array Multiplier

Modified Booth Representation 3 bits examined at a time, even values of i traversed Reduces partial products by half However, overhead required to generate signals, MUXes Y -1 = 0 Examples: [0] [0] 2 -2

Read Only Memory Desirable because of low power requirements Con stems from read delay, size 240 MHz -> 4.2 ns delay Consumes 3.24mW at 100MHz (10ns delay)

ROM-based multipliers ROM-based multipliers attractive  Issue of space 32-bit multiplier requires 2 32 *2 32 *64 bits—unrealistic Techniques to reduce table sizes  Karatsuba Algorithm: A=A A 15-0, B=B B 15-0 A*B=A B <<32+A 15-0 B <<16+A B 15-0 <<16+A B 15-0 Reduces table size to 2 16 *2 16 *32 bits, but requires 4 lookups and 3 additions. Using multiple, parallel lookups still uses fewer bits than regular table lookup

ROM-based multipliers cont.  Vinnakota’s approach – Use tables of squares Let x = floor([A + B]/2) and y = floor([A- B]/2) If A 0 xor B 0 = 0: A*B = x 2 -y 2 If A 0 xor B 0 = 1: A*B = x 2 -y 2 +B Reduces table size to 2 32 * 64 bits, further reducible with split-tables (introduced later), requires 2 table lookups and 3 (or 4) additions  Hybrid approach: Use tables of squares to find partial products for Karatsuba algorithm

Proposed Implementation A=A 1 A 0 B=B 1 B 0 x 11, y 11 … 2 16 * 32bit ROM x 11 2, y 11 2 … A 1 *B 1, A 1 *B 0 … 2 16 * 32bit ROM

Results  Most of the SPEC2000 benchmarks exhibited little or no performance loss (<.5%) from extra multiplier cycles: art, bzip*, gcc, gzip*, ijpeg, li, mcf, mesa, parser*, vpr  : Significant  * : Possibly significant  Of applications that did experience a drop in performance (extra cycles): go.outorder (6.41%) – go playing program m88ksim (5.39%) – chip simulator perl (0.72%) – perl interpreter vortex (2.33%) – Object Orientated Database

Further Work Measurements:  Accurate power measurements  More specific benchmarks—targeting multimedia Optimizations:  Tables: Vinnakota’s split-table work If A, B share lower k bits, A 2, B 2 share lower k+1 bits. Can change 2 N *N table to 2 N *(N-[k+1]) and 2 k *(k+1) tables. Gives somewhat faster lookups and lower memory requirements.  Adders: Adders can be optimized, final 64-bit additions are more like 48-bit additions. Pipelining multiplication operations can occur in up to 3 stages.