SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.

Slides:

Advertisements

Similar presentations

Comparison of Altera NIOS II Processor with Analog Device’s TigerSHARC

Advertisements

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.

ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)

University Of Vaasa Telecommunications Engineering Automation Seminar Signal Generator By Tibebu Sime 13 th December 2011.

VESPA: Portable, Scalable, and Flexible FPGA-Based Vector Processors Peter YiannacourasUniv. of Toronto J. Gregory Steffan Univ. of Toronto Jonathan Rose.

Application-Specific Customization of FPGA Soft- core Processors Journal Paper Presentation Presented by: Ahmad Sghaier Course Instructor: Dr. Shawki Areibi.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Term Project Overview Yong Wang. Introduction Goal –familiarize with the design and implementation of a simple pipelined RISC processor What to do –Build.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

Configurable System-on-Chip: Xilinx EDK

1 Chapter 14 Embedded Processing Cores. 2 Overview RISC: Reduced Instruction Set Computer RISC-based processor: PowerPC, ARM and MIPS The embedded processor.

UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.

ECE 699: Lecture 2 ZYNQ Design Flow.

Experiences Implementing Tinuso in gem5 Maxwell Walter, Pascal Schleuniger, Andreas Erik Hindborg, Carl Christian Kjærgaard, Nicklas Bo Jensen, Sven Karlsson.

Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.

Educational Computer Architecture Experimentation Tool Dr. Abdelhafid Bouhraoua.

From Concept to Silicon How an idea becomes a part of a new chip at ATI Richard Huddy ATI Research.

Specific Choice of Soft Processor Features Mark Grover Prof. Greg Steffan Dept. of Electrical and Computer Engineering.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

© 2011 Xilinx, Inc. All Rights Reserved This material exempt per Department of Commerce license exception TSU Xilinx Tool Flow.

8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.

Programmable Logic- How do they do that? 1/16/2015 Warren Miller Class 5: Software Tools and More 1.

CSE430/830 Course Project Tutorial Instructor: Dr. Hong Jiang TA: Dongyuan Zhan Project Duration: 01/26/11 – 04/29/11.

Ross Brennan On the Introduction of Reconfigurable Hardware into Computer Architecture Education Ross Brennan

CSCE 430/830 Course Project Guidelines By Dongyuan Zhan Feb. 4, 2010.

ISE. Tatjana Petrovic 249/982/22 ISE software tools ISE is Xilinx software design tools that concentrate on delivering you the most productivity available.

B212/MAPLD 2005 Craven1 Configurable Soft Processor Arrays Using the OpenFire Processor Stephen Craven Cameron Patterson Peter Athanas Configurable Computing.

1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,

Infrastructure design & implementation of MIPS processors for students lab based on Bluespec HDL Students: Danny Hofshi, Shai Shachrur Supervisor: Mony.

SPREE Tutorial Peter Yiannacouras April 13, 2006.

Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct.

High Performance Embedded Computing © 2007 Elsevier Lecture 3: Design Methodologies Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based.

High Performance Embedded Computing © 2007 Elsevier Chapter 1, part 2: Embedded Computing High Performance Embedded Computing Wayne Wolf.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Developing software and hardware in parallel Vladimir Rubanov ISP RAS.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005.

IMPLEMENTATION OF MIPS 64 WITH VERILOG HARDWARE DESIGN LANGUAGE BY PRAMOD MENON CET520 S’03.

© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.

Improving Memory System Performance for Soft Vector Processors Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008.

Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.

VHDL and Hardware Tools CS 184, Spring 4/6/5. Hardware Design for Architecture What goes into the hardware level of architecture design? Evaluate design.

Teaching Digital Logic courses with Altera Technology

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

Computer Architecture Lecture 7: Microprogrammed Microarchitectures Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/30/2013.

K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.

ASIC Design Methodology

ECE354 Embedded Systems Introduction C Andras Moritz.

Introduction to Programmable Logic

Application-Specific Customization of Soft Processor Microarchitecture

System Interconnect Fabric

A Review of Processor Design Flow

Introduction to cosynthesis Rabi Mahapatra CSCE617

Embedded systems, Lab 1: notes

A High Performance SoC: PkunityTM

ECE 699: Lecture 3 ZYNQ Design Flow.

Improving Memory System Performance for Soft Vector Processors

A small SOPC-based aircraft autopilot system that contains an FPGA with a Nios processor core, a DSP processor, and memory is seen above. The bottom sensor.

THE ECE 554 XILINX DESIGN PROCESS

Application-Specific Customization of Soft Processor Microarchitecture

THE ECE 554 XILINX DESIGN PROCESS

Presentation transcript:

SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture Description The input to SPREE is a textual description in the form of C++ code. The description provides: a) List of Components Components are implemented in Verilog by the user and are imported into the SPREE Component Library. This allows for FPGA-specific implementations of the components in order to ensure efficient synthesis. When building a processor, the user specifies which components from the library are desired. b) The Datapath Wiring The processor is defined primarily by its datapath. Users can control the flow of data through the components and insert pipe registers where desired. The number of stages (pipelined) or number of cycles per instruction (unpipelined) is inferred by the amount of cycle latency in the described datapath. c) Hazard Detection and Forwarding (Pipelined) These are implemented by the user as part of the datapath, this must be done (currently) manually to ensure proper execution. 1. The Purpose of SPREE Processors implemented on a programmable fabric are referred to as soft processors. Soft processors are already widely deployed (Altera’s Nios, Xilinx’s Microblaze), therefore their architectures have become important. Our goal is to investigate the architecture of soft processors and develop an FPGA-specific understanding of processor architecture. To do so, we have developed SPREE (Soft Processor Rapid Exploration Environment), which, through RTL generation, can produce accurate area, clock frequency, power, and cycle count measurements from a textual description of a processor. /****************** Component List *******************/ RTLComponent *addersub=new RTLComponent("addersub"); RTLComponent *logic_unit=new RTLComponent("logic_unit"); RTLComponent *shifter=new RTLComponent("shifter","barrel"); RTLComponent *mul=new RTLComponent("mul"); RTLComponent *reg_file=new RTLComponent("reg_file"); RTLComponent *ifetch=new RTLComponent("ifetch"); RTLComponent *branchresolve=new RTLComponent("branchresolve"); /*************** Datapath Wiring ****************/ // RS Fanout addConnection(reg_file,"a_readdata",addersub,"opA"); addConnection(reg_file,"a_readdata",logic_unit,"opA"); addConnection(reg_file,"a_readdata",shifter,"sa"); addConnection(reg_file,"a_readdata",mul,"opA"); addConnection(reg_file,"a_readdata",branchresolve,"rs");... /*********************** Hazard detection *******************/ HazardDetector *rs_haz=newHazardDetector(rs_reg,"q",dst1,"q"); HazardDetector *rt_haz=newHazardDetector(rt_reg,"q",dst1,"q"); stallOnHazard(rs_haz,1); stallOnHazard(rt_haz,1); Peter Yiannacouras Jonathan Rose Gregory Steffan 2. SPREE Overview The entire SPREE system consists of everything needed to extract measurements from the input processor architecture description. This includes the RTL Generator, benchmarks, RTL Simulator, RTL CAD system, and accompanying scripts for using each. Not shown is also a compiler infrastructure (GCC cross- compiled) for benchmarking and instruction set simulator for verification. The core of SPREE is the automatic RTL Generator which produces synthesizable Verilog HDL code from the input processor architecture description. Using the generator, one can quickly transform an architectural idea into a real implementation. The advantage of intending the implementation to stay on an FPGA means one can make all measurements directly from the RTL description. Synthesis of the HDL can produce accurate area, clock frequency, and power measurements (see below). In addition, RTL Simulation can be used to extract cycle count as well as cycle-by-cycle behaviour. Thus, one can quickly and fully understand the costs/benefits of any architectural modification. With this ability, one can perform focussed studies on many architectural ideas including: Different component implementations, resource sharing, pipeline depth, pipeline organization, forward/bypass logic, HW/SW codesign evaluation, ISA changes, application specific customizations. 6. SPREE Benchmarks SPREE includes a suite of 20 benchmarks which can be instantly executed on any processor generated by the the SPREE RTL Generator. The benchmarks include Dhrystone 2.1, eight benchmarks from the MiBench suite, and 11 others from various academic sources. The benchmarks are mostly of an embedded nature and are stripped of any file I/O or command line arguments. In addition, the size of the benchmarks are reduced to prevent excessively long simulations. 4. SPREE RTL Generator The SPREE RTL Generator performs a number of operations needed to turn the architecture description input into synthesizable Verilog. These are listed below: i. Datapath vs ISA verification The datapath is checked to make sure it has all the functionality required to execute the ISA (currently fixed to be a subset of MIPS I), and also that the wiring between components is consistent with the flow of data implied by each instruction. ii. Removal of unused connections/components After verification, any connections/components that were not used by any instruction is removed. This allows for ISA subsetting, where one can reduce the processor by restricting what fraction of the ISA is used. iii. Datapath Analysis (timing) The datapath is examined and the generator determines the stages of each component. Further checks are made for structural hazards and other illegal configurations. iv. Control Generation Either pipelined or unpipelined control is generated which ensures components are enabled only when their data is ready, and allows for stalled components. 5. SPREE RTL CAD Flow SPREE uses Altera’s Quartus II 4.2 CAD software and targets a Stratix device. Accuracy is ensured through seed sweeping and management of optimization settings. In addition, SPREE’s benchmarks are used to profile switching frequencies for power analysis. Though still in its infancy, we have used SPREE to generate several different processor architectures. We have measured the area, and mean wall clock time for each architecture over our benchmarks and have displayed the plot of those results to the right. Also in that plot are two versions of Altera’s Nios II processor. It is clear that even these few designs can span much of the space between the Nios II designs. Moreover, the circled point shows SPREE’s capability of competing with industrial designs. As a design space exploration tool, SPREE has many uses. To illustrate one such use, we have used SPREE to perform a focussed study on the shifter implementation. We experimented with a serial shifter, a barrel shifter, and a shifter implemented on the dedicated multipliers on an FPGA. We found that tradeoffs exist between the three but there is a clear win if one dedicated multipler is shared between the shifting and multiplication. Automatically generated design 27% smaller and 6% slower then NiosIIs Expanded Legend SPREE Unpipelined Processors SPREE Pipelined Processors Industry Other than stages, the only difference lies in the shifter implementation. The following shifters were used: Barrelshifter Single-cycle shifter implemented in programmable fabric. Serialshifter One cycle per bit shifter in programmable fabric. The _with_mem_align version shares the same shifter for memory aligning operations. Shifter_in_Multiplier Single-cycle shifter implemented in dedicated multipliers, +stall indicates 2-cycle shift which stalls the pipeline in order to fit in a single stage.