Download presentation
Presentation is loading. Please wait.
1
Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical Engineering Princeton University ++ : NEC Laboratories America, Inc.
2
Outline SoC design constraints Background Previous work in ASIP design Xtensa platform Manual custom instruction generation procedure Automatic custom instruction generation flow Experimental results Conclusions
3
SoC Design Constraints Time to market Cost Performance Power Cost-performance trade-off Flexibility ……
4
Comparison of Different Approaches ASICASIPGPP Time to market -- + ++ Cost ++ + -- Performance ++ + -- Power ++ + -- Cost-performance ++ + -- Flexibility -- + ++ ++ Very good + Good -- Very bad
5
Domain Specific Processor (DSP) General Embedded Processor 0.1-1 MIPS/mw 1-10 MIPS/mw 50-100 MIPS/mw 500-1000 MOPS/mw Energy Efficiency Flexibility ASIC ASIP (Xtensa) Domain Specific Processor (AMD-K6E) 0.1-1 MIPS/mW 1-10 MIPS/mW 50-100 MIPS/mW 500-1000 MOPS/mW Energy Efficiency Flexibility Flexibility vs. Energy Efficiency
6
Previous Work in ASIP Design ASIP architectures and overall design methodologies [Huang, 1994], [Adams, 1996], [Fisher, 1999], [Kucukcakar, 1999] Application-specific instruction set selection [Choi, 1999], [Gschwind, 1999], [Arnold, 1999] Low power ASIP design [Kalambur, 1997], [Dougherty, 1999], [Ishihara, 2000], [Sami, 2001] Commercial offerings Xtensa, ARCtangent, Jazz, SP-5flex, Carmel
7
Processor Controls TRACE Port JTAG Tap Control On Chip Debug Align and Decode Coprocessor Register File Coprocessor Execution Units Window Register File ALU & Address Generation MAC 16 Designer Defined Instruction Execution Unit Instruction Memory or Cache & Tags Branch Logic & Instruction Fetch Date Memory or Cache &Tags Processor Interface Write Buffer Timers 1 to n Special Function Register Access Data Address Watch 0 to n Instruction Address Watch 0 to n Instruction Base ISA Feature Configurable Function Optional Function Configurable & Optional Function Extensible Data Instruction Address Data Address Exception Support Interrupt Control Memory Protection Unit Source: www.tensilica.com Xtensa Architecture
8
Xtensa Processor Design Flow Processor Configuration Inputs Designer-Defined Instruction Descriptions Configuration File Configured GNU C/C++ Compiler Configured GNU Assembler/ Disassembler Configured Instruction Set Simulator/Emulator Configured Processor HDL Area, Power and Timing Estimation Logic Synthesis (Synopsys or Ambit) Block Place/Route (Avant! Or Cadence) Timing Verification Hardware Profile Application Specific Compile, Assemble, Link Application Simulation with ISS and/or Emulator Software Debugging/Profiling Application Source Code Sample Application Data Optimized Software Optimized Hardware Generator Output Internal Database Design data Use of Generated Data Source: www.tensilica.com
9
Manual Custom Instruction Generation Procedure Identify potential new instructions Describe custom instructions Insert custom instructions Verify functional correctness Profile, read source code Understand source code Rewrite source code Slow and error-prone
10
Contributions of Our Work Automatic custom instruction selection Application program to extensible processors with custom instructions Features Efficient design space search Use accurate information from instruction set simulator and synthesis Bridge the gap between automatic synthesized and manually designed architectures
11
Automatic Custom Instruction Generation Flow
13
Example Illustration of Template Generation
18
Key Observations for Pruning Higher the weight of the template, higher the potential for improvement --- Amdahl’s law Scope for optimization determined by computation --- No. of cycles needed for executing the template Scope for optimization determined by read/write ports limitation --- Additional cycles needed for extra reading/writing of input/output variables
19
Pruning Algorithm Ranking criterion: OriginalTime: Fraction of the total execution time of the original program spent in the template (weight) In, Out: Number of inputs and outputs of the template, respectively α, β: Number of inputs/outputs encoded in the instruction γ: No. of cycles needed for executing the template Higher priority means greater potential for speed up
20
12.73 Template Generation with Pruning 10.51 7.92 4.05 2.13 Ranked pool of seed templates 12.73 Highest priority 5.36 1.1816.35 Threshold: 0.1 Template set
21
4.05 2.13 10.51 7.92 5.36 10.51 7.92 4.05 2.13 Template Generation with Pruning 12.73 Highest priority 5.36 1.1816.35 12.73 Threshold: 0.1 Template set Ranked pool of seed templates
22
12.73 4.05 2.13 10.51 7.92 5.36 Template Generation with Pruning 12.73 Highest priority 1.18 16.35 1.18 Threshold: 0.1 Template set Ranked pool of seed templates
23
4.05 2.13 10.51 7.92 5.36 16.3512.7316.35 Template Generation with Pruning 12.73 Highest priority 16.35 4.05 2.13 10.51 7.92 5.36 Threshold: 0.1 Template set Ranked pool of seed templates
24
No. of Templates vs. Threshold Ratio
25
Automatic Custom Instruction Generation Flow
26
Automatic Custom Instruction Generation Flow (Contd.)
28
Custom Instruction Insertion Care must be taken to insert custom instructions into appropriate places without affecting program’s functional correctness If custom instructions need extra inputs (outputs), care must be taken to select appropriate variables to write to (read from) user-defined registers
29
Example Illustration of Custom Instruction Insertion
30
Example Illustration of Custom Instruction Insertion (Contd.) (a) (b).... offset = t + 1; for (i=0; i<100; i++) { j =.... result = offset + i * j; }........ offset = t + 1; for (i=0; i<100; i++) { j =.... result = CustomInstr(i,j); }.... WUR(offset,0);
31
Automatic Custom Instruction Generation Flow
32
Custom Instruction Combination Selection --- Problem Statement Given a set of non-overlapping custom instructions, with each instruction having several versions, find a version for each instruction such that performance is maximized while area is under a certain threshold
33
Custom Instruction Combination Selection --- Flow Chart
34
Automatic Custom Instruction Generation Flow
35
Experimental Methodology C Program Automatic Custom Instruction Generation Aristotle Xtensa TIE Compiler Synopsys Design Compiler Xtensa GNU Profiler Custom Processor (HDL Description) NEC CB11 TIE Tensilica Processor Generator Synopsys Design Compiler Modified C program Cross Compiler ISS Sente Wattwatcher AreaClock Period Execution Cycles Power
36
Experimental Results (Contd.) Average Performance improvement: 3.4X Energy reduction: 3.2X Energy*delay reduction: 12.6X Area increase: 1.8%
37
Conclusions Automatic custom instruction synthesis for ASIPs Template generation/selection Custom instruction insertion Custom instruction combination selection Experimental results 3.4X average performance improvement 12.6X average energy*delay reduction
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.