1 Kadlec MAPLD05/P149 Floating Point Controller as PicoBlaze Network on Single Spartan 3 FPGA Jiri Kadlec 1, Roger Gook 2 1 Institute of Information Theory.

Slides:



Advertisements
Similar presentations
What are FPGA Power Management HDL Coding Techniques Xilinx Training.
Advertisements

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Lecture 16 PicoBlaze Instruction Set & Assembler Directives
An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.
Spartan-3 FPGA HDL Coding Techniques
Sumitha Ajith Saicharan Bandarupalli Mahesh Borgaonkar.
EXTERNAL COMMUNICATIONS DESIGNING AN EXTERNAL 3 BYTE INTERFACE Mark Neil - Microprocessor Course 1 External Memory & I/O.
Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.
COMP3221 lec31-mem-bus-I.1 Saeid Nooshabadi COMP 3221 Microprocessors and Embedded Systems Lectures 31: Memory and Bus Organisation - I
Term Project Overview Yong Wang. Introduction Goal –familiarize with the design and implementation of a simple pipelined RISC processor What to do –Build.
Aug. 24, 2007ELEC 5200/6200 Project1 Computer Design Project ELEC 5200/6200-Computer Architecture and Design Fall 2007 Vishwani D. Agrawal James J.Danaher.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.
Configurable System-on-Chip: Xilinx EDK
FPGA BASED IMAGE PROCESSING Texas A&M University / Prairie View A&M University Over the past few decades, the improvements from machine language to objected.
Performance Analysis of Processor Midterm Presentation Performed by : Winter 2005 Alexei Iolin Alexander Faingersh Instructor: Evgeny.
Final Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Reconfigurable Computing in the Undergraduate Curriculum Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina.
ECE Department: University of Massachusetts, Amherst Lab 1: Introduction to NIOS II Hardware Development.
Controllers-system for APS – CubeSat nano-satellite Instructor: Daniel Alkalay Students: Moshe Emmer & Meir Harar Technion – Israel Institute of Technology.
Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Mid-Term Presentation Performed by: Roni.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
Delevopment Tools Beyond HDL
DSP Development System
Ross Brennan On the Introduction of Reconfigurable Hardware into Computer Architecture Education Ross Brennan
Technology Date 10/17/00, Page 1 Technology s PROFIBUS Technology Chips - Modules - Development Kits.
DOP - A CPU CORE FOR TEACHING BASICS OF COMPUTER ARCHITECTURE Miloš Bečvář, Alois Pluháček and Jiří Daněček Department of Computer Science and Engineering.
Student : Andrey Kuyel Supervised by Mony Orbach Spring 2011 Final Presentation High speed digital systems laboratory High-Throughput FFT Technion - Israel.
Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.
1 WORLD CLASS – through people, technology and dedication High level modem development for Radio Link INF3430/4431 H2013.
Implementation of MAC Assisted CORDIC engine on FPGA EE382N-4 Abhik Bhattacharya Mrinal Deo Raghunandan K R Samir Dutt.
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.
Panzer Division bit Central Processing Unit.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
ECE 448 – FPGA and ASIC Design with VHDL Lecture 12 PicoBlaze Overview.
NIOS II Ethernet Communication Final Presentation
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
Infrastructure design & implementation of MIPS processors for students lab based on Bluespec HDL Students: Danny Hofshi, Shai Shachrur Supervisor: Mony.
Timing and Constraints “The software is the lens through which the user views the FPGA.” -Bill Carter.
A Monte Carlo Simulation Accelerator using FPGA Devices Final Year project : LHW0304 Ng Kin Fung && Ng Kwok Tung Supervisor : Professor LEONG, Heng Wai.
Tools - LogiBLOX - Chapter 5 slide 1 FPGA Tools Course The LogiBLOX GUI and the Core Generator LogiBLOX L BX.
© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
Part A Final Dor Obstbaum Kami Elbaz Advisor: Moshe Porian August 2012 FPGA S ETTING U SING F LASH.
November 29, 2011 Final Presentation. Team Members Troy Huguet Computer Engineer Post-Route Testing Parker Jacobs Computer Engineer Post-Route Testing.
Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May – 9 June 2007 Javier.
CDA 4253 FPGA System Design The PicoBlaze Microcontroller
Content Project Goals. Workflow Background. System configuration. Working environment. System simulation. System synthesis. Benchmark. Multicore.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
1 Kadlec MAPLD05/P148 Reconfigurable Floating Point Co-Processor for Atmel FPSLIC Jiri Kadlec Institute of Information Theory and Automation (UTIA), Academy.
DEPARTMENT OF ELECTRONICS ENGINEERING V-SEMESTER MICROPROCESSOR & MICROCONTROLLER 1 CHAPTER NO microcontroller & programming.
Lab 1: Using NIOS II processor for code execution on FPGA
Lecture 15 PicoBlaze Overview
FPGAs in AWS and First Use Cases, Kees Vissers
FPGA Implementation of Multicore AES 128/192/256
Lecture 14 PicoBlaze Overview
Lecture 16 PicoBlaze Overview
ECE 448: Lab 6 Using PicoBlaze Fast Sorting Class Exercise 2.
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
COMS 361 Computer Organization
Modified from notes by Saeid Nooshabadi
Portable SystemC-on-a-Chip
ADSP 21065L.
CSC Muon Sorter Status Tests Plans M.Matveev August 21, 2003.
Presentation transcript:

1 Kadlec MAPLD05/P149 Floating Point Controller as PicoBlaze Network on Single Spartan 3 FPGA Jiri Kadlec 1, Roger Gook 2 1 Institute of Information Theory and Automation, Academy of Sciences of the Czech Republic, Prague, CZ Tel : Celoxica Ltd. Abingdon Oxford, UK Tel:

2 Kadlec MAPLD05/P149 Presentation outline PicoBlaze KCPSM3 processor from Ken Chapman Xilinx One PicoBlaze Master and four Workers connected by DP BRAMs Demo1: Four Bouncing Ball on a VGA – 5 PicoBlaze on XC3S200 Scalable pipelined Floating point Bit-exact high level simulation in Simulink Demo2: 400 M Flop (18-bit FP) parallel vector products Power/area: Virtex2, Spartan3, Spartan3L, Spartan3E Conclusions

3 Kadlec MAPLD05/P149 PicoBlaze KCPSM3 processor Author: Ken Chapman, Xilinx; VHDL Core with assembler: free from Main parameters: l 8 bit CPU, 1 BRAM 1024x18 for program, only 96 slices (5% of xc3s200) l 16 registers, Scratch pad memory 64 byte, 8bit I/O bus, 8bit port address l all instructions take constantly 2 clock cycles, 1 level of interrupt l KCPSM3 includes Assembler, RS232 macros and uart_clock demo. l Optimized for Virtex E, Virtex 2, and Spartan 3 Our design is reusing parts of Ken’s uart_clock demo We add inter-processor connect, VGA support and Floating point HW We add hazard free access to DP BRAM from Master and Worker PicoBlaze

4 Kadlec MAPLD05/P149 Master includes in addition: l bps UART l 1 microsecond timer l Interrupt-based 1u sec time base Worker: 4 input ports, 8 (max 256) output ports

5 Kadlec MAPLD05/P149 Workers provide asynchronous interrupt input and interrupt_ack output Master encapsulates serial bps UART with basic Ken’s SW support Each processor includes 1 BRAM with local program Network is using dual-ported BRAM blocks 2048 x 8 Master and Worker macros and connectivity:

6 Kadlec MAPLD05/P149 Demo 1: Four Bouncing Balls, VGA, 5 PicoBlaze net Board IO: SW led 7seg 7seg 7seg 7seg PC Master Worker 1 Worker 2 Worker 3 Worker 4 time base 1u sec Ball 1 Ball 2 Ball 3 Ball 4 RS bps Address 11 bit (8 banks, each 256 bytes) Data busses 8 bits 4 Dual ported Block RAMs 2048 Bytes each Address 11bit Data 8 bit 4 PicoBlaze workers 1 PicoBlaze Master 4 PicoBlaze Workers VGA HW Monitor Support 4-balls Ball sped is sent from PC to BRAMs Workers read it to Drive the Balls

7 Kadlec MAPLD05/P149 Demo1: 4 Bouncing Balls Spartan3 xc3s200-4-ft256 Slices 962 out of % BRAMS 9 out of 12 75% System clock 50 MHz Up to 5x25MIPs 125 MIPs Interrupt latency 4 clk Master DP SRAM 4 x 2048 Byte Worker DP SRAM 2048 Byte HW avoids write conflicts (No hazard in case of parallel WR to same address). Worker 1,2,3 VGA support Master Worker 4 Power estimate: Dynamic 9 mW Quiescent 61 mW Total 70 mW

8 Kadlec MAPLD05/P149 Generic Short Latency Floating Point Macros Based on Celoxica DK 1.1 Handel-C Floating Point Library Precisions m : 18m11, 24m17, 32m23, 36m27 l ADD/SUB, MUL 2 stage pipelined (retimed) l FIXPT2F, F2FIXPT 4 stage pipelined (retimed) l DIV, SQRT Sequential. No of cycles = mantissa width bit Pipelined Floating Point Macros Based on Celoxica DK 3.1 Handel-C Pipelined Floating Point Library 32 bit Precisions m : 32m23 l ADD/SUB 10 stage pipelined MUL 7 stage pipelined l FIXPT2F 12 stage pipelined F2FIXPT 14 stage pipelined l DIV 28 stage pipelined SQRT 27 stage pipelined Used in Final FP Vector Product Demo 2

9 Kadlec MAPLD05/P149 Modeling & RTL Example for 32bit FP ADD l Simple FP ADD and SUB (bit exact) l ADD/SUB one block (bit exact) l ADD/SUB one block (bit and cycle exact) Source code in Handel C VHDL C++ Simulink S-functions RTL level DK 3.1 Simulato r

10 Kadlec MAPLD05/P149 Area Used in Final FP Vector Product Design

11 Kadlec MAPLD05/P149 Speed Used in Final FP Vector Product Design

12 Kadlec MAPLD05/P149 Spartan3 xc3s1000-4, L, E Virtex2 xc2v MHz clock 125 MIPs 4x 100 M FLOP FP Mantissa 11 bit, FP Exponent 6 bit, FP Sign 1 bit DSP program: Wait for signal; Z[0]=X’ [0:255]*Y[0:255]; interrupt worker; Demo 2: 400 M Flop (18-bit FP) vector product DP BRAM connecting processors: 2048 x 8bit (8 banks) DP BRAM connecting DSP: 1024 x 18bit (4 banks) 1x PicoBlaze Master 1 us time-base RS bps 4x PicoBlaze Workers 4x 3 DP BRAMs X, Y, Z 4x dedicated HW: 18 bit FP MACs at 50MHz From Simulink and DK4 test b. Vector Product FP HW 18m11 Vector Product FP HW 18m11 Vector Product FP HW 18m11 Vector Product FP HW 18m11

13 Kadlec MAPLD05/P149 Design and verification strategy for FP DSP modules Step 1: Bit exact model in Simulink. Verification with Double. Create test data. Step 2: Simulation of identical HW (hand coded in Handel-C) in DK4 Software simulator (I/O functions automate connection to Matlab) Step 3: Compilation from DK4 to HW kit to verify on real HW. Kit specific versions of same I/O functions automate connection of the HW kit to Matlab without the need to modify code (parallel port in the case of RC200E). Step 4: Isolate debugged DSP design ( BRAM -> do DSP -> BRAM) as modules. Step 5: Attach these verified DSP modules with PicoBlaze. The top level can be Handel-C, VHDL or Verilog. (Handel-C has been used) Step 6: Verify the DSP module first on one PicoBlaze worker with mem dump support from the Master. Use test data from Step 1 Step 7: Extend your DSP design to multiple workers, large data sets and real time constrains. Concentrate on SW to manage combination of DSP blocks.

14 Kadlec MAPLD05/P149 Step1-3: Bit exact model in Simulink and debugging.

15 Kadlec MAPLD05/P149 Step 4-5: Top level used in all design experiments to get comparable power estimation results. Falutest_hcc contains all 5 processors and all FP ALUs (Handel-C top). This is detail of X, Y, Z BRAMs and one FP 18bit MAC operating bit-exact as Simulink.

16 Kadlec MAPLD05/P149 Step 6: Integrate and test with PicoBlaze on HW X[0:7] Y[0:7] Z[0] This is Mem Dump managed by PicoBlaze net on rc200e hw. It prints test vector data and result of vector prod. identical with Simulink. Simulink test bench generates data which can be used by DK4 simulator, HW board for verrification on the HW kit (RC200E with XC2V in our case). Finally to target PicoBlaze network, data are generated in format compatible with 18-bit wide BRAMS X,Y and Z: Finally, OK on HW :-)

17 Kadlec MAPLD05/P149 Step 7: Real vector product 400mflop Virtex2 xc2v fg456 Slice Flip Flops290528% 4 input LUTs % Occupied Slices329264% BRAMS21 52% MULT18x18s 4 10% Clock 50 MHz ISE: 53,3 MHz Power (Xpower setting has been verified by measurement of case temperature): Vccint Dynamic 666 mW Quiescent 18 mW Vccoux Dynamic 0 mW Quiescent 330 mW Vcco Dynamic 3 mW Quiescent 3 mW Total 1020 mW

18 Kadlec MAPLD05/P149 Step 7: Real vector product 400mflop Spartan3 xc3s1000(L)-4-fg456 Slice Flip Flops263717% 4 input LUTs % Occupied Slices309740% BRAMS21 87% MULT18x18s 4 16% Clock 50 MHz ISE: 50,6 MHz Power estimate (X_power) S3 S3L Vccint Dynamic 92,8 mW 91 mW Quiescent 78 mW 36 mW Vccoux Dynamic 0 mW 0 mW Quiescent 62 mW 62 mW Vcco Dynamic 1 mW 1 mW Quiescent 0 mW 0 mW Total 235 mW 191 mW

19 Kadlec MAPLD05/P149 Step 7: Real vector product 400mflop Spartan3E xc3s1200E-4-fg400 Slice Flip Flops282916% 4 input LUTs % Occupied Slices313636% BRAMS21 75% MULT18x18s 4 14% Clock 50 MHz ISE: 50,1 MHz Power estimate is not available yet in X power tool. The complete 4x 100 M FLOP Vector product with 5 PicoBlaze processors has been implemented and tested on RC200E board from Celoxica with the Virtex 2 XC2V part, running at 50MHz. Spartan 3 designs have been all compiled but not tested on real HW.

20 Kadlec MAPLD05/P149 Lessons learned and thanks This work has been partially supported by the Ministry of Education of the Czech republic projects: 1M and 1ET PicoBlaze Architecture ++ It is compatible with our design strategy for DSP modules: Simulink model -> DK4 debug -> HW debug -> Reuse in PicoBlaze net. PicoBlaze is small and simple, hence manageable. 5 PicoBlaze Architecture -- Currently implemented conversion of data formats (8bit - 18bit) is slow. Spartan 3(L) power reduction ++ Spartan3(L) is 5x reducing power consumption comparing to Virtex2. Spartan3E is most likely choice for our designs based on PicoBlaze net.