CML RESIDUE NUMBER SYSTEM ENHANCEMENTS FOR PROGRAMMABLE PROCESSORS Arizona State University Rooju Chokshi 7 th November, 2008 Compiler-Microarchitecture.

Slides:

Advertisements

Similar presentations

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Advertisements

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

Lecture 6 Programming the TMS320C6x Family of DSPs.

ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos What’s the.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

Computer Abstractions and Technology

Spring 2013 Advising Starts this week! CS2710 Computer Organization1.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Instruction Set Issues MIPS easy –Instructions are only committed at MEM  WB transition Other architectures are more difficult –Instructions may update.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

CENG536 Computer Engineering Department Çankaya University.

UNIVERSITY OF MASSACHUSETTS Dept

1 A Timing-Driven Synthesis Approach of a Fast Four-Stage Hybrid Adder in Sum-of-Products Sabyasachi Das University of Colorado, Boulder Sunil P. Khatri.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.

Improving Power And Performance of Embedded Applications Using Residue Number System Compilers For Embedded Systems Rooju Chokshi.

Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.

Copyright 2008 Koren ECE666/Koren Part.6a.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.

M. Interleaving Montgomery High-Radix Comparison Improvement Adders CLA CSK Comparison Conclusion Improving Cryptographic Architectures by Adopting Efficient.

Pipelining By Toan Nguyen.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

3-1 Chapter 3 - Arithmetic Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Computer Architecture.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

- 1 - EE898-HW/SW co-design Hardware/Software Codesign “Finding right combination of HW/SW resulting in the most efficient product meeting the specification”

1 Layers of Computer Science, ISA and uArch Alexander Titov 20 September 2014.

Chapter 6-2 Multiplier Multiplier Next Lecture Divider

Accuracy-Configurable Adder for Approximate Arithmetic Designs

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.

Reconfigurable Computing - Multipliers: Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

L/O/G/O CPU Arithmetic Chapter 7 CS.216 Computer Architecture and Organization.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Macro instruction synthesis for embedded processors Pinhong Chen Yunjian Jiang (william) - CS252 project presentation.

Mohammad Reza Najafi Main Ref: Computer Arithmetic Algorithms and Hardware Designs (Behrooz Parhami) Spring 2010 Class presentation for the course: “Custom.

ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.

1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

Overview of Previous Lesson(s) Over View  A program must be translated into a form in which it can be executed by a computer.  The software systems.

VLIW Digital Signal Processor Michael Chang. Alison Chen. Candace Hobson. Bill Hodges.

Topics covered: Arithmetic CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

Using Dynamic Binary Translation to Fuse Dependent Instructions Shiliang Hu & James E. Smith.

Reconfigurable Computing - Pipelined Systems John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western.

Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Integer Multiplication and Division COE 301 Computer Organization Dr. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University.

Chapter 8 Computer Arithmetic. 8.1 Unsigned Notation Non-negative notation  It treats every number as either zero or a positive value  Range: 0 to 2.

UNIT 2. ADDITION & SUBTRACTION OF SIGNED NUMBERS.

Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,

Integer Multiplication and Division ICS 233 Computer Architecture & Assembly Language Prof. Muhamed Mudawar College of Computer Sciences and Engineering.

Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,

Integer Multiplication and Division

Instruction Level Parallelism

UNIVERSITY OF MASSACHUSETTS Dept

Embedded Systems Design

CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www. cse. psu

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

UNIVERSITY OF MASSACHUSETTS Dept

Introduction to Microprocessor Programming

Chapter 12 Pipelining and RISC

UNIVERSITY OF MASSACHUSETTS Dept

CMSC 611: Advanced Computer Architecture

Presentation transcript:

CML RESIDUE NUMBER SYSTEM ENHANCEMENTS FOR PROGRAMMABLE PROCESSORS Arizona State University Rooju Chokshi 7 th November, 2008 Compiler-Microarchitecture Lab Computer Science and Engineering 1

CML Power and Performance Demand  Perpetual demand for higher performance and power  Real-time computing environments require high speed computation  Cellular phones  Battery power is a limited resource  How do we reduce power gap without performance loss? 2

CML Limitation of 2’s complement  2’s complement system limits parallelism  O(n) carry propagation chains in adders Carry prediction schemes consume area, power  Limited parallelism due to carry Do better alternatives exist? 3

CML Residue Number System  Non-positional number system, characterized by relatively prime integers P = (P 1,P 2,…,P k )  2’s complement integer N transforms to k-tuple (R 1,R 2,…,R k ), R i = N mod P i  Convert back to 2’s complement by application of Chinese Remainder Theorem  Perform operation OP in parallel on smaller bit-widths  X  (x 1,x 2,…,x k ), Y  (y 1,y 2,…,y k )  X OP Y = (x 1 OP y 1,…,x k OP y k ) XY P1P1 P2P2 P3P3 X OP Y 4

CML Residue Number System Pros and Cons  Advantages  Splits an n-bit integer into multiple smaller independent components  Computation on smaller bit-widths, in parallel.  Faster computation  Lower power consumption  Limitations  Fast arithmetic does not extend to division, general comparison, bit-wise operations.  Conversion from 2’s complement to RNS and vice-versa has high overhead. 5

CML Research Objectives  Utilize RNS to design faster, lower power programmable processors.  Design hardware that enables hiding overhead  Automate code mapping  Formalize the code mapping problem  Develop compiler techniques for code mapping Focus on maximizing application performance 6

CML Agenda Towards alternative number systems Introduction to RNS Research Objectives  Previous RNS Research  RNS Processor Challenges  Proposed Microarchitecture  Compiler Technique  Experimental Results  Conclusions 7

CML Previous RNS Research  RNS typically used in fixed-function DSP architectures  Digital filters, DFT, DWT  Griffin, Taylor proposed programmable RNS RISC processors as a topic of future research.  Chavez, Sousa developed a RNS-based RISC DSP  Focus is on reducing area, power not improving execution time  Ramirez et al developed a RNS DSP microprocessor.  Pure RNS ALU  ISA does not include conversion operations  Conversions need to be added as separate stages. Overhead is not hidden effectively 8

CML Agenda Towards alternative number systems Introduction to RNS Research Objectives Previous RNS Research  RNS Processor Challenges  Proposed Microarchitecture  Compiler Technique  Experimental Results  Conclusions 9

CML RNS Processor Challenges  Parallel operations limited to (+,-,x)  Need to keep 2’s complement units also  Conversion overheads  Software-transparent operation needs that conversions be done before and after every computation High overhead of conversions  Design should enable hiding overheads 10

CML Agenda Towards alternative number systems Introduction to RNS Research Objectives Previous RNS Research RNS Processor Challenges  Proposed Microarchitecture  Compiler Technique  Experimental Results  Conclusions 11

CML Separate conversion and computation  Augment ISA with explicit conversion instructions  Conversions can now be scheduled and optimized like any other instruction.  Enables better hiding of conversion latencies. 12

CML Carry-save Operand Representation  Basis of functional units are CSA trees  Produce sum and carry vectors S and C  Final modulo adder stage combines S and C Larger delay, area and power  Store both S and C for a RNS value Modulo adder removed Use existing register file with double precision load, store and mov instructions CSA Tree Modulo Adder (S+2C) XY SC Z 13

CML Selection of Moduli Set  Moduli set affects channel delays operates on same number of bits in every channel Power-of-two channel is much faster than other Propagation delays should be as close as possible What about, k > n ? 14

CML Synthesis Results – 0.18  15

CML Multiplier Adder FC RNS Multiplier RNS Adder IF EX 33-bit RNS Reg File/GP Floating Point Reg File Integer Reg File RC ID WB COM Pipeline Model 16

CML Agenda Towards alternative number systems Introduction to RNS Aims and Objectives Previous RNS Research RNS Processor Challenges Proposed Microarchitecture  Compiler Technique  Experimental Results  Conclusions 17

CML Compiler Technique - Aims  Analyze data dependency graphs of applications for RNS profitability.  Identify potential subgraphs  Profit model needed  Map profitable subgraphs to RNS instructions.  Cycle time is metric for profit  No previous compiler technique for RNS. 18

CML Definitions / L ** + LLL *++ > + * LL RNS Eligible Node Node that is (+, -, x) RNS Eligible Subgraph (RES) Subgraph G RES (V RES,E RES ) such that V RES consists only of RNS Eligible Nodes. Maximal RNS Eligible Subgraph (MRES) A RES G MRES (V MRES,E MRES ) of DFG G(V,E) is maximal if, for all v in V MRES there is no edge (u,v) or (v,u) in E, s.t. u is RNS eligible node. 19

CML Problem Definition  Aim is to map as many operations to RNS, provided doing so is profitable.  Given a set of dataflow graphs of program basic blocks,  Find all Maximal RNS Eligible Subgraphs  Estimate profitability  Map profitable MRESs to RNS. 20

CML Finding MRESs  Start with unvisited RNS eligible node as seed node.  Expand to include adjacent RNS eligible nodes, until no more can be included  BFS / L ** + LLL *++ >>+ * LL 21

CML Evaluating profit of MRES  A pair of forward conversions is overhead of 1 cycle.  Dataflow, s.t.  A reverse conversion is overhead of 2 cycles.  Dataflow, s.t.  Every 3-operand addition (x+y+z) is a profit of 1 cycle.  Pair addition nodes before profit analysis  Every multiplication is a profit of 1 cycle.  Apply profit model to every MRES found earlier. 22

CML Forward Conversions In Loops Basic Algorithm With FC Improvement Move FC if: Register is not written in loop Is written only in the same MRES as the FC 23

CML Improving Addition Pairing  Given an addition expression with n additions, what DFG structure enables best pairing?  Expression with n additions can have pairs at best.  Some DFG structures do not enable best pairing  Linear structures enable best pairing 24

CML Improving Addition Pairing  Take an addition tree and linearize it  Apply transformation repeatedly  Each application linearizes a sub-tree  Eventually entire tree is linearized 25

CML Agenda Towards alternative number systems Introduction to RNS Aims and Objectives Previous RNS Research RNS Processor Challenges Proposed Microarchitecture Compiler Technique  Experimental Results  Conclusions 26

CML Experimental Setup  Simulation Model  Simplesim-ARM  Augmented with RNS units according to synthesis numbers  Measure cycle-time and functional unit power.  Benchmarks FIR, Gaussian smoothing, 2D-DCT, MatMul, some Livermore Loops  GCC  binutils-2.14  arm-linux Flow Analysis RNS Optimization Flow Analysis Scheduling Register Alloc Assembly RTL Generation 27

CML Experimental Results Simulation of manually optimized binaries 28

CML Experimental Results Simulation of compiled binaries & comparison with manually optimized code 29

CML Experimental Results Power vs Performance across multiple resource configurations 30

CML Agenda Towards alternative number systems Introduction to RNS Aims and Objectives Previous RNS Research RNS Processor Challenges Proposed Microarchitecture Compiler Technique Experimental Results  Conclusions 31

CML Future Directions  More aggressive ISA optimizations  Moving conversions out of the processor pipeline?  Extend technique from operating at basic block level to super-block or hyper-block level  Code annotation for improved compiler analysis? 32

CML Publications  Residue Number Enhancements For Programmable Processors – to be submitted to Design Automation Conference (DAC)  Residue Number Enhancement For Programmable Processors – to be submitted to IEEE Transactions on Computer Aided Design (T-CAD) 33

CML Thank You ! Conclusions  Proposed a RNS-based extension for RISC processors.  Computation separated from conversion, carry-save operand representation, balanced moduli  Enables hiding overheads  Developed first compiler techniques for automated analysis and code mapping to RNS units.  Basic technique finds and maps profitable MRES  Improvements for conversions in loops, addition pairing  20.7% improvement in performance.  51.6% improvement in functional unit power. 34

CML Extra Slides 35

CML Design of Hardware Units  Property of Periodicity of Residues  Bit at (i+nj) th is equivalent to bit at i th  Align bits according to this rule when reducing bits in CSA tree 36

CML Design of Hardware Units  Reverse Converter  Based on New Chinese Remainder Theorem by Wang et al.  Designed for 37