The Project Asymmetric FPGA-loaded hardware accelerators for FPGA- enhanced CPU systems with Linux The Project Asymmetric FPGA-loaded hardware accelerators.

Slides:



Advertisements
Similar presentations
Computer Architecture
Advertisements

Using VMX within Linux We explore the feasibility of executing ROM-BIOS code within the Linux x86_64 kernel.
Page 1 Dorado 400 Series Server Club Page 2 First member of the Dorado family based on the Next Generation architecture Employs Intel 64 Xeon Dual.
Operating-System Structures
CS-334: Computer Architecture
Definition Asymmetric FPGA-loaded hardware accelerators for FPGA- enhanced CPU systems with LinuxDefinition Performed by:Avi Werner William Backshi Instructor:Evgeny.
General information Course web page: html Office hours:- Prof. Eyal.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
PC To GT Program Load Shachar Rosenberg Alex Normatov Technion - Digital Lab.
1 Performed by: Lin Ilia Khinich Fanny Instructor: Fiksman Eugene המעבדה למערכות ספרתיות מהירות High Speed Digital Systems Laboratory הטכניון - מכון טכנולוגי.
VirtexIIPRO FPGA Device Functional Testing In Space environment. Performed by: Mati Musry, Yahav Bar Yosef Instuctor: Inna Rivkin Semester: Winter/Spring.
Configurable System-on-Chip: Xilinx EDK
29 April 2005 Part B Final Presentation Peripheral Devices For ML310 Board Project name : Spring Semester 2005 Final Presentation Presenting : Erez Cohen.
1 Network Packet Generator Characterization presentation Supervisor: Mony Orbach Presenting: Eugeney Ryzhyk, Igor Brevdo.
The Xilinx EDK Toolset: Xilinx Platform Studio (XPS) Building a base system platform.
1 Real-Time System Design Developing a Cross Compiler and libraries for a target system.
Students:Gilad Goldman Lior Kamran Supervisor:Mony Orbach Mid-Semester Presentation Spring 2005 Network Sniffer.
Figure 1.1 Interaction between applications and the operating system.
Cs238 Lecture 3 Operating System Structures Dr. Alan R. Davis.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
RAMP-White Hari Angepat Derek Chiou University of Texas at Austin.
1 Fast Communication for Multi – Core SOPC Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab.
Students:Gilad Goldman Lior Kamran Supervisor:Mony Orbach Part A Presentation Network Sniffer.
Embedded Real time System Design Introduction to the course.
Implementation of DSP Algorithm on SoC. Characterization presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompany engineer : Emilia Burlak.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
General Purpose FIFO on Virtex-6 FPGA ML605 board midterm presentation
Introduction Purpose Objectives Content Learning Time
Shell and Flashing Images Commands and upgrades. RS-232 Driver chip – ST3232C Driver chip is ST3232C Provides electrical interface between UART port and.
Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf
Introduction to The Linaro Toolchain Embedded Processors Training Multicore Software Applications Literature Number: SPRPXXX 1.
1 Introduction to Tool chains. 2 Tool chain for the Sitara Family (but it is true for other ARM based devices as well) A tool chain is a collection of.
General System Architecture and I/O.  I/O devices and the CPU can execute concurrently.  Each device controller is in charge of a particular device.
Image Processing for Remote Sensing Matthew E. Nelson Joseph Coleman.
Computer Organization
System Calls 1.
Infrastructure design & implementation of MIPS processors for students lab based on Bluespec HDL Students: Danny Hofshi, Shai Shachrur Supervisor: Mony.
I/O Example: Disk Drives To access data: — seek: position head over the proper track (8 to 20 ms. avg.) — rotational latency: wait for desired sector (.5.
GBT Interface Card for a Linux Computer Carson Teale 1.
Guide to Linux Installation and Administration, 2e1 Chapter 2 Planning Your System.
Porting Operating Systems Phan Duy Hùng (PhD) ES Lecturer – Hanoi FPT University.
Interfaces to External EDA Tools Debussy Denali SWIFT™ Course 12.
LAB1 Summary Zhaofeng SJTU.SOME. Embedded Software Tools CPU Logic Design Tools I/O FPGA Memory Logic Design Tools FPGA + Memory + IP + High Speed IO.
PROJECT - ZYNQ Yakir Peretz Idan Homri Semester - winter 2014 Duration - one semester.
We will focus on operating system concepts What does it do? How is it implemented? Apply to Windows, Linux, Unix, Solaris, Mac OS X. Will discuss differences.
80386DX functional Block Diagram PIN Description Register set Flags Physical address space Data types.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
CSC414 “Introduction to UNIX/ Linux” Lecture 2. Schedule 1. Introduction to Unix/ Linux 2. Kernel Structure and Device Drivers. 3. System and Storage.
By Fernan Naderzad.  Today we’ll go over: Von Neumann Architecture, Hardware and Software Approaches, Computer Functions, Interrupts, and Buses.
Full and Para Virtualization
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
Content Project Goals. Workflow Background. System configuration. Working environment. System simulation. System synthesis. Benchmark. Multicore.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Recen progress R93088 李清新. Recent status – about hardware design Finishing the EPXA10 JPEG2000 project. Due to the DPRAM problem can’t be solved by me,
1 Basic Processor Architecture. 2 Building Blocks of Processor Systems CPU.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Asymmetric FPGA-loaded hardware accelerators for FPGA- enhanced CPU systems with Linux Performed by:Avi Werner William Backshi Instructor:Evgeny Fiksman.
OPERATING SYSTEMS DO YOU REQUIRE AN OPERATING SYSTEM IN YOUR SYSTEM?
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
IMPLEMENTING RISC MULTI CORE PROCESSOR USING HLS LANGUAGE - BLUESPEC LIAM WIGDOR INSTRUCTOR MONY ORBACH SHIREL JOSEF Winter 2013 One Semester Mid-term.
Introduction to Operating Systems Concepts
Maj Jeffrey Falkinburg Room 2E46E
Current Generation Hypervisor Type 1 Type 2.
Derek Chiou The University of Texas at Austin
Lecture Topics: 11/1 General Operating System Concepts Processes
Presentation transcript:

The Project Asymmetric FPGA-loaded hardware accelerators for FPGA- enhanced CPU systems with Linux The Project Asymmetric FPGA-loaded hardware accelerators for FPGA- enhanced CPU systems with Linux Performed by:Avi Werner William Backshi Instructor:Evgeny Fiksman Duration:1 year (2 semesters) Performed by:Avi Werner William Backshi Instructor:Evgeny Fiksman Duration:1 year (2 semesters) Mid-project presentation 30/03/2009

RMI Processor

RMI – SW Programming Model

RMI Processor - RMIOS

Agenda  Project description  Design considerations and schematics  System diagram and functionality  Preparing the demo  Planned future progress

Project definition  An FPGA-based system.  Asymmetric multiprocessor system, with Master CPU and several slave Accelerators (modified softcore CPUs with RAM) with same or different OpCode.  Master CPU running single-processor Linux OS, with the Accelerators functionality provided to the applications in OS by driver API.

 Platform  ML310 with PPC405  Accelerators  Based on uBlaze soft-core microprocessors.  Controllers  IRQ controller for each core. “Accelerator” refers to microprocessor + IRQ generator + RAM The Platform

Project Progress  Theoretical research  Found and read articles on HW accelerators, both of the faculty staff and external (CELL – IBM, etc)  Met with most of MATRICS group, checking their interest in our platform and possible demands  Met with Systems Dpt. Members in IBM (Muli Ben-Yehuda) for a concept review.  System architecture has undergone significant changes.  Practical achievements – attempt to load Linux on ML310  Compiled kernel for PPC-405 with ML310 support (no PCI support).  Booted ML310 from CF with Xilinx pre-loaded Linux.  Introduced additional hardware into FPGA, tested liveness.  Practical achievements – creating HW system platform  Moved to Xilinx 10.1 to get a single system bus (PLB v.4.6) with multi-port memory.  Created a template for Accelerator core (IRQ Generator and microprocessor).  Designed interconnect topology.  Connected the devices on HW level, tested system liveness and independency.

HW Design considerations  Scalability – the design is CPU-independent.  Accelerator working with interrupts – no polling (improved performance).  OS not working with interrupts – generic HW compatibility and scalability (polling IRQ generators).  Separate register space – not using main memory for flags / device data / etc.  Single cycle transaction for checking / setting accelerator status.  Data Mover stub init includes chunk size – no character recognition needed.

Accelerator Data & Instr. Dual port RAM CPU (uBlaze) IRQ Generator General Purpose Registers Slave Master PLB v.4.6 IRQ Accelerator Schematics MEM Controller MEM Controller Instruction bus Data bus

PPC Accelerator DDR MEMMMU Accelerator PLB v.4.6 bus Data & Instr MEM Data & Instr MEM Data & Instr MEM HW Design Schematics

Accelerated Software platform FPGA PPC 405 Accelerator DDR MEM MMU Memory test demo Instr MEM & Data MEM Software Stub (Data mover & executer) LED Accelerator demo Manual execution Manual execution : we can’t load any executable into the DDR without JTAG – since we don’t have OS. Thus we have to load it manually, and setup and execute stub manually. Current System layer

Accelerated Software platform FPGA PPC 405 Accelerator DDR MEM MMU Linux (Debian) Driver Virtual communication Layer (SW) Instr MEM & Data MEM Software Stub (Data mover & executer) Complete System layer

System Functionality  Functionality  HW is loaded on FPGA, Demo application (in the future - Linux kernel) runs on central PPC core, accelerators are preloaded with client software stub.  SW driver is loaded in the memory (in kernel - using insmod command).  Accelerator-aware SW is executed (in kernel - communicates with the driver API).  To commit a job for specific accelerator, the SW initializes the registers of the accelerator’s IRQ controller and sets the “run” flag in the status register.  Client stub runs in idle loop until an IRQ controller of the accelerator issues an interrupt - initialized by driver code running on PPC core.  The stub reads IRQ controller registers that initialize the Data Mover (in the 1 st stage - with start address and length of code).  Data Mover sets a flag in the IRQ generator status register, that signals a working accelerator core.  Data Mover initializes transactions with the main memory until all the code segment has been brought and passes control to the 1 st byte of the code segment.  The target code includes “rtid” instruction to return control to Data Mover after execution, it finishes and the inserted “rtid” passes control back to Data Mover stub.  Data Mover changes the status register of IRQ generator to “complete”, and returns to idle loop (the stub has a possibility to support returning resulting data structures to the main memory).

Preparing Accelerator SW Compilation of accelerator target code, with execution-only segment (there is no data segment – data inserted inline). Target code should be compiled with Program starting address = 0x1000, set via Compiler options, using Default linker script. Insert in the end – call to a “return” function with address that is taken from 0xFFC: asm("andi r1, r1, 0x0;\ lwi r15, r1, 0xFFC;\ rtsd r15,0;"); Compilation of accelerator target code, with execution-only segment (there is no data segment – data inserted inline). Target code should be compiled with Program starting address = 0x1000, set via Compiler options, using Default linker script. Insert in the end – call to a “return” function with address that is taken from 0xFFC: asm("andi r1, r1, 0x0;\ lwi r15, r1, 0xFFC;\ rtsd r15,0;"); Open Xilinx EDK Shell, run for converting ELF to binary code: mb-objcopy -O binary --remove-section=.stab -- remove-section=.stabstr executable.elf target.bin Open Xilinx EDK Shell, run for converting ELF to binary code: mb-objcopy -O binary --remove-section=.stab -- remove-section=.stabstr executable.elf target.bin

Preparing the system Download bitstream to FPGA (PPC code and uBlaze stub). Launch XMD on PPC core. Download target accelerator BIN to DRAM as data: dow –data target.bin 0xSTART_ADDR Download target accelerator BIN to DRAM as data: dow –data target.bin 0xSTART_ADDR Set IRQ Generator parameters: 1.Base address – 0xSTART_ADDR + 0x Length of BIN in DRAM. 3.Run bit. 4.Set run bit again, if you liked it. Set IRQ Generator parameters: 1.Base address – 0xSTART_ADDR + 0x Length of BIN in DRAM. 3.Run bit. 4.Set run bit again, if you liked it.

Planned future progress  Load Linux on the platform.  Update the stub to allow data passing.  Finish writing the driver API for Linux.  Write additional demo application for uBlaze.  Write demo application for PPC (Linux).

Backup slides  Hidden

The process  Studying the environment  Build fully functional cross-compilation toolchain for PPC  Implementation of one of the board+CPU+OS demos on FPGA  Introduce additional hardware into FPGA, test liveness.  Multi-core  Study existing buses; build FSB for the Accelerator core  Compile and test simple SW function for the Accelerator core  Insert CPU cache for Accelerator core  Insert test simple SW function to Accelerator cache, test functionality  Design Accelerators interface, memory and controllers  Design SW stub for the Accelerators to work in passive mode  Add several Accelerators, test existing functionality  Write OS support driver that allows sending code to Accelerators for execution  Write a test bench  FPGA dynamic loading  Test FPGA dynamic loading with Xilinx software  Test dynamic loading of a simple Accelerator  Write OS support driver for FPGA-on-the-fly functions  Test loading custom precompiled VHDL code (“hardware DLL”) on-the-fly

Operational Description  HW  1 PPC core and several uBlaze cores are connected to PLB v.4.6 bus.  Main memory is DDR.  Each uBlaze has a shared 2-port memory for instructions and data, with separate access for instruction and data ports.  SW  Kernel is running on central PPC core (single OS).  Kernel is aware of reserved driver-allocated memory range, unaware of accelerators and their CPU functionality.  The SW capable of using accelerators should be aware of acceleration driver usage and has code segments which are supported by accelerators and can be loaded to them.  Each accelerator is running a SW loop in its own memory, a small stub that is actually client operation controller, and has Data Mover functionality.  Driver is independent of the structure of implemented accelerators (assumption – compiler exists for the implemented accelerators).

Progress  Studying the environment  Build fully functional cross-compilation toolchain for PPC  Implementation of one of the board+CPU+OS demos on FPGA  Introduce additional hardware into FPGA, test liveness. Duration: up to 1 month Status: Complete (OS demos using Xilinx-delivered OS)

Progress  Multi-core  Study existing buses; build FSB for the Accelerator core  Compile and test simple SW function for the Accelerator core  Insert CPU cache for Accelerator core  Insert test simple SW function to Accelerator cache, test functionality  Design Accelerators interface, memory and controllers  Design SW stub for the Accelerators to work in passive mode  Add several Accelerators, test existing functionality  Write OS support driver that allows sending code to Accelerators for execution  Write a test bench Duration: up to 6 months

Progress  Multi-core  Study existing buses; build FSB for the Accelerator core Since uBlaze was selected as accelerator core, the existing buses were to be utilized – OPB for connection of all the microprocessors and the common memory (each accelerator core has its own memory).  Compile and test simple SW function for the Accelerator core Every application for uBlaze compiles and runs in EDK.  Insert CPU cache for Accelerator core No need for CPU cache – the accelerator cores have separate memory blocks.  Insert test simple SW function to Accelerator cache, test functionality See above.  Design Accelerators interface, memory and controllers Done.  Design SW stub for the Accelerators to work in passive mode In progress.  Add several Accelerators, test existing functionality PPC + 3 uBlaze cores are running separately, no OS – environmental issues.  Write OS support driver that allows sending code to Accelerators for execution Almost completed.  Write a test bench Useless until environment problems are solved.

Progress  FPGA dynamic loading  Test FPGA dynamic loading with Xilinx software  Test dynamic loading of a simple Accelerator  Write OS support driver for FPGA-on-the-fly functions  Test loading custom precompiled VHDL code (“hardware DLL”) on- the-fly Duration: up to 3 months (Given success of previous parts’ schedule)

Progress  Compiling the Linux kernel  Compilation from scratch for GNU-supported architectures: Downloaded GCC source and binutils (versions had to be specified), to get a cross-compiler. Downloaded cross-tool and run it (automated Perl cross-compiler generator). Buildroot supports cross-compilation toolchain. Compilation of kernels had to use ML310 platform’s xparams.h, and names of devices had to be edited to attach them to the kernel namespace.  Kernel must be an ELF file.  Kernel must be compiled to run from memory, not using IDE/SYSACE/Network.  Generating file systems  Buildroot can generate kernel and filesystems. Does not support PowerPC, even though claims the functionality.  Installation from a PowerPC CD on a host computer that emulates PowerPC core (using QEMU). No support for Linux kernel v.2.6 for PowerPC.  Debian bootstrap cross-installation – method for generating packages for Debian installation without actually installing or running them, however, for file system creation QEMU is still needed.  LFS (Linux From Scratch manual) – to build the file system from scratch, compiling every application by hand. Huge workload.  We are using Linux v.2.6, reasons follow.  Practical results  Xilinx tools allow for easy generation of PowerPC platform with Xilinx-provided devices.  Xilinx-provided Linux v.2.4 runs on the platform.  Our compiled Linux v.2.6 runs on the platform, hangs on (non-existing) file system load, although we’ve generated file systems in several ways.  Newer boards have Xilinx support for Linux v.2.6 – working with newer boards saves headache!

Environment problems  Montavista 2.4 kernel: we’re not using it: There is no working compiler for Montavista in the lab. The EDK-provided Montavista environment isn’t up to date, the update is costly, and it isn’t able to compile the kernel as it is. There is no reason to even try to compile: we don’t have PCI controller core from Xilinx, which costs money. Montavista precompiled kernel won’t boot from memory, it explicitly depends on IDE or Network, and both are behind PCI controller. There is no reason to work with XUP to overcome the lack of PCI controller, because in XUP there’s no IDE and the Network core in XUP – again, costs money. Kernel 2.4 doesn’t compile as smoothly as kernel 2.6.  Boot: There’s a boot with CF support within kernel 2.6, however, the driver doesn’t detect a file system. We’ve managed to boot the kernel with OS fully in memory, but the booted kernel doesn’t detect the file system in memory.

Failures  We assume that there’s a problem with either kernel or all of our generated file systems.  In order to “divide and conquer”, we turned to a fully simulated environment that runs on any computer - QEMU. Within it we tried to run PowerPC – it failed (QEMU is not practically supporting PowerPC – boot loader hangs while loading, since it’s not finished yet).  In order to work around the issue with loading PowerPC in QEMU, we’re running ARM-based system (there’s no support for PowerPC in QEMU which is being used for simulation). Kernel and file system has both been adapted to ARM. Status: in the meantime, also failing. Work is in progress.