Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical Engineering Princeton University ++ : NEC Laboratories America, Inc.
Outline SoC design constraints Background Previous work in ASIP design Xtensa platform Manual custom instruction generation procedure Automatic custom instruction generation flow Experimental results Conclusions
SoC Design Constraints Time to market Cost Performance Power Cost-performance trade-off Flexibility ……
Comparison of Different Approaches ASICASIPGPP Time to market Cost Performance Power Cost-performance Flexibility Very good + Good -- Very bad
Domain Specific Processor (DSP) General Embedded Processor MIPS/mw 1-10 MIPS/mw MIPS/mw MOPS/mw Energy Efficiency Flexibility ASIC ASIP (Xtensa) Domain Specific Processor (AMD-K6E) MIPS/mW 1-10 MIPS/mW MIPS/mW MOPS/mW Energy Efficiency Flexibility Flexibility vs. Energy Efficiency
Previous Work in ASIP Design ASIP architectures and overall design methodologies [Huang, 1994], [Adams, 1996], [Fisher, 1999], [Kucukcakar, 1999] Application-specific instruction set selection [Choi, 1999], [Gschwind, 1999], [Arnold, 1999] Low power ASIP design [Kalambur, 1997], [Dougherty, 1999], [Ishihara, 2000], [Sami, 2001] Commercial offerings Xtensa, ARCtangent, Jazz, SP-5flex, Carmel
Processor Controls TRACE Port JTAG Tap Control On Chip Debug Align and Decode Coprocessor Register File Coprocessor Execution Units Window Register File ALU & Address Generation MAC 16 Designer Defined Instruction Execution Unit Instruction Memory or Cache & Tags Branch Logic & Instruction Fetch Date Memory or Cache &Tags Processor Interface Write Buffer Timers 1 to n Special Function Register Access Data Address Watch 0 to n Instruction Address Watch 0 to n Instruction Base ISA Feature Configurable Function Optional Function Configurable & Optional Function Extensible Data Instruction Address Data Address Exception Support Interrupt Control Memory Protection Unit Source: Xtensa Architecture
Xtensa Processor Design Flow Processor Configuration Inputs Designer-Defined Instruction Descriptions Configuration File Configured GNU C/C++ Compiler Configured GNU Assembler/ Disassembler Configured Instruction Set Simulator/Emulator Configured Processor HDL Area, Power and Timing Estimation Logic Synthesis (Synopsys or Ambit) Block Place/Route (Avant! Or Cadence) Timing Verification Hardware Profile Application Specific Compile, Assemble, Link Application Simulation with ISS and/or Emulator Software Debugging/Profiling Application Source Code Sample Application Data Optimized Software Optimized Hardware Generator Output Internal Database Design data Use of Generated Data Source:
Manual Custom Instruction Generation Procedure Identify potential new instructions Describe custom instructions Insert custom instructions Verify functional correctness Profile, read source code Understand source code Rewrite source code Slow and error-prone
Contributions of Our Work Automatic custom instruction selection Application program to extensible processors with custom instructions Features Efficient design space search Use accurate information from instruction set simulator and synthesis Bridge the gap between automatic synthesized and manually designed architectures
Automatic Custom Instruction Generation Flow
Example Illustration of Template Generation
Key Observations for Pruning Higher the weight of the template, higher the potential for improvement --- Amdahl’s law Scope for optimization determined by computation --- No. of cycles needed for executing the template Scope for optimization determined by read/write ports limitation --- Additional cycles needed for extra reading/writing of input/output variables
Pruning Algorithm Ranking criterion: OriginalTime: Fraction of the total execution time of the original program spent in the template (weight) In, Out: Number of inputs and outputs of the template, respectively α, β: Number of inputs/outputs encoded in the instruction γ: No. of cycles needed for executing the template Higher priority means greater potential for speed up
12.73 Template Generation with Pruning Ranked pool of seed templates Highest priority Threshold: 0.1 Template set
Template Generation with Pruning Highest priority Threshold: 0.1 Template set Ranked pool of seed templates
Template Generation with Pruning Highest priority Threshold: 0.1 Template set Ranked pool of seed templates
Template Generation with Pruning Highest priority Threshold: 0.1 Template set Ranked pool of seed templates
No. of Templates vs. Threshold Ratio
Automatic Custom Instruction Generation Flow
Automatic Custom Instruction Generation Flow (Contd.)
Custom Instruction Insertion Care must be taken to insert custom instructions into appropriate places without affecting program’s functional correctness If custom instructions need extra inputs (outputs), care must be taken to select appropriate variables to write to (read from) user-defined registers
Example Illustration of Custom Instruction Insertion
Example Illustration of Custom Instruction Insertion (Contd.) (a) (b).... offset = t + 1; for (i=0; i<100; i++) { j =.... result = offset + i * j; } offset = t + 1; for (i=0; i<100; i++) { j =.... result = CustomInstr(i,j); }.... WUR(offset,0);
Automatic Custom Instruction Generation Flow
Custom Instruction Combination Selection --- Problem Statement Given a set of non-overlapping custom instructions, with each instruction having several versions, find a version for each instruction such that performance is maximized while area is under a certain threshold
Custom Instruction Combination Selection --- Flow Chart
Automatic Custom Instruction Generation Flow
Experimental Methodology C Program Automatic Custom Instruction Generation Aristotle Xtensa TIE Compiler Synopsys Design Compiler Xtensa GNU Profiler Custom Processor (HDL Description) NEC CB11 TIE Tensilica Processor Generator Synopsys Design Compiler Modified C program Cross Compiler ISS Sente Wattwatcher AreaClock Period Execution Cycles Power
Experimental Results (Contd.) Average Performance improvement: 3.4X Energy reduction: 3.2X Energy*delay reduction: 12.6X Area increase: 1.8%
Conclusions Automatic custom instruction synthesis for ASIPs Template generation/selection Custom instruction insertion Custom instruction combination selection Experimental results 3.4X average performance improvement 12.6X average energy*delay reduction