Download presentation
Presentation is loading. Please wait.
Published byNora Gallagher Modified over 9 years ago
2
The Project Asymmetric FPGA-loaded hardware accelerators for FPGA- enhanced CPU systems with Linux The Project Asymmetric FPGA-loaded hardware accelerators for FPGA- enhanced CPU systems with Linux Performed by:Avi Werner William Backshi Instructor:Evgeny Fiksman Duration:1 year (2 semesters) Performed by:Avi Werner William Backshi Instructor:Evgeny Fiksman Duration:1 year (2 semesters) Mid-project presentation 30/03/2009
3
RMI Processor
4
RMI – SW Programming Model
5
RMI Processor - RMIOS
6
Agenda Project description Design considerations and schematics System diagram and functionality Preparing the demo Planned future progress
7
Project definition An FPGA-based system. Asymmetric multiprocessor system, with Master CPU and several slave Accelerators (modified softcore CPUs with RAM) with same or different OpCode. Master CPU running single-processor Linux OS, with the Accelerators functionality provided to the applications in OS by driver API.
8
Platform ML310 with PPC405 Accelerators Based on uBlaze soft-core microprocessors. Controllers IRQ controller for each core. “Accelerator” refers to microprocessor + IRQ generator + RAM The Platform
9
Project Progress Theoretical research Found and read articles on HW accelerators, both of the faculty staff and external (CELL – IBM, etc) Met with most of MATRICS group, checking their interest in our platform and possible demands Met with Systems Dpt. Members in IBM (Muli Ben-Yehuda) for a concept review. System architecture has undergone significant changes. Practical achievements – attempt to load Linux on ML310 Compiled kernel for PPC-405 with ML310 support (no PCI support). Booted ML310 from CF with Xilinx pre-loaded Linux. Introduced additional hardware into FPGA, tested liveness. Practical achievements – creating HW system platform Moved to Xilinx 10.1 to get a single system bus (PLB v.4.6) with multi-port memory. Created a template for Accelerator core (IRQ Generator and microprocessor). Designed interconnect topology. Connected the devices on HW level, tested system liveness and independency.
10
HW Design considerations Scalability – the design is CPU-independent. Accelerator working with interrupts – no polling (improved performance). OS not working with interrupts – generic HW compatibility and scalability (polling IRQ generators). Separate register space – not using main memory for flags / device data / etc. Single cycle transaction for checking / setting accelerator status. Data Mover stub init includes chunk size – no character recognition needed.
11
Accelerator Data & Instr. Dual port RAM CPU (uBlaze) IRQ Generator General Purpose Registers Slave Master PLB v.4.6 IRQ Accelerator Schematics MEM Controller MEM Controller Instruction bus Data bus
12
PPC Accelerator DDR MEMMMU Accelerator PLB v.4.6 bus Data & Instr MEM Data & Instr MEM Data & Instr MEM HW Design Schematics
13
Accelerated Software platform FPGA PPC 405 Accelerator DDR MEM MMU Memory test demo Instr MEM & Data MEM Software Stub (Data mover & executer) LED Accelerator demo Manual execution Manual execution : we can’t load any executable into the DDR without JTAG – since we don’t have OS. Thus we have to load it manually, and setup and execute stub manually. Current System layer
14
Accelerated Software platform FPGA PPC 405 Accelerator DDR MEM MMU Linux (Debian) Driver Virtual communication Layer (SW) Instr MEM & Data MEM Software Stub (Data mover & executer) Complete System layer
15
System Functionality Functionality HW is loaded on FPGA, Demo application (in the future - Linux kernel) runs on central PPC core, accelerators are preloaded with client software stub. SW driver is loaded in the memory (in kernel - using insmod command). Accelerator-aware SW is executed (in kernel - communicates with the driver API). To commit a job for specific accelerator, the SW initializes the registers of the accelerator’s IRQ controller and sets the “run” flag in the status register. Client stub runs in idle loop until an IRQ controller of the accelerator issues an interrupt - initialized by driver code running on PPC core. The stub reads IRQ controller registers that initialize the Data Mover (in the 1 st stage - with start address and length of code). Data Mover sets a flag in the IRQ generator status register, that signals a working accelerator core. Data Mover initializes transactions with the main memory until all the code segment has been brought and passes control to the 1 st byte of the code segment. The target code includes “rtid” instruction to return control to Data Mover after execution, it finishes and the inserted “rtid” passes control back to Data Mover stub. Data Mover changes the status register of IRQ generator to “complete”, and returns to idle loop (the stub has a possibility to support returning resulting data structures to the main memory).
16
Preparing Accelerator SW Compilation of accelerator target code, with execution-only segment (there is no data segment – data inserted inline). Target code should be compiled with Program starting address = 0x1000, set via Compiler options, using Default linker script. Insert in the end – call to a “return” function with address that is taken from 0xFFC: asm("andi r1, r1, 0x0;\ lwi r15, r1, 0xFFC;\ rtsd r15,0;"); Compilation of accelerator target code, with execution-only segment (there is no data segment – data inserted inline). Target code should be compiled with Program starting address = 0x1000, set via Compiler options, using Default linker script. Insert in the end – call to a “return” function with address that is taken from 0xFFC: asm("andi r1, r1, 0x0;\ lwi r15, r1, 0xFFC;\ rtsd r15,0;"); Open Xilinx EDK Shell, run for converting ELF to binary code: mb-objcopy -O binary --remove-section=.stab -- remove-section=.stabstr executable.elf target.bin Open Xilinx EDK Shell, run for converting ELF to binary code: mb-objcopy -O binary --remove-section=.stab -- remove-section=.stabstr executable.elf target.bin
17
Preparing the system Download bitstream to FPGA (PPC code and uBlaze stub). Launch XMD on PPC core. Download target accelerator BIN to DRAM as data: dow –data target.bin 0xSTART_ADDR Download target accelerator BIN to DRAM as data: dow –data target.bin 0xSTART_ADDR Set IRQ Generator parameters: 1.Base address – 0xSTART_ADDR + 0x1000. 2.Length of BIN in DRAM. 3.Run bit. 4.Set run bit again, if you liked it. Set IRQ Generator parameters: 1.Base address – 0xSTART_ADDR + 0x1000. 2.Length of BIN in DRAM. 3.Run bit. 4.Set run bit again, if you liked it.
18
Planned future progress Load Linux on the platform. Update the stub to allow data passing. Finish writing the driver API for Linux. Write additional demo application for uBlaze. Write demo application for PPC (Linux).
19
Backup slides Hidden
20
The process Studying the environment Build fully functional cross-compilation toolchain for PPC Implementation of one of the board+CPU+OS demos on FPGA Introduce additional hardware into FPGA, test liveness. Multi-core Study existing buses; build FSB for the Accelerator core Compile and test simple SW function for the Accelerator core Insert CPU cache for Accelerator core Insert test simple SW function to Accelerator cache, test functionality Design Accelerators interface, memory and controllers Design SW stub for the Accelerators to work in passive mode Add several Accelerators, test existing functionality Write OS support driver that allows sending code to Accelerators for execution Write a test bench FPGA dynamic loading Test FPGA dynamic loading with Xilinx software Test dynamic loading of a simple Accelerator Write OS support driver for FPGA-on-the-fly functions Test loading custom precompiled VHDL code (“hardware DLL”) on-the-fly
21
Operational Description HW 1 PPC core and several uBlaze cores are connected to PLB v.4.6 bus. Main memory is DDR. Each uBlaze has a shared 2-port memory for instructions and data, with separate access for instruction and data ports. SW Kernel is running on central PPC core (single OS). Kernel is aware of reserved driver-allocated memory range, unaware of accelerators and their CPU functionality. The SW capable of using accelerators should be aware of acceleration driver usage and has code segments which are supported by accelerators and can be loaded to them. Each accelerator is running a SW loop in its own memory, a small stub that is actually client operation controller, and has Data Mover functionality. Driver is independent of the structure of implemented accelerators (assumption – compiler exists for the implemented accelerators).
22
Progress Studying the environment Build fully functional cross-compilation toolchain for PPC Implementation of one of the board+CPU+OS demos on FPGA Introduce additional hardware into FPGA, test liveness. Duration: up to 1 month Status: Complete (OS demos using Xilinx-delivered OS)
23
Progress Multi-core Study existing buses; build FSB for the Accelerator core Compile and test simple SW function for the Accelerator core Insert CPU cache for Accelerator core Insert test simple SW function to Accelerator cache, test functionality Design Accelerators interface, memory and controllers Design SW stub for the Accelerators to work in passive mode Add several Accelerators, test existing functionality Write OS support driver that allows sending code to Accelerators for execution Write a test bench Duration: up to 6 months
24
Progress Multi-core Study existing buses; build FSB for the Accelerator core Since uBlaze was selected as accelerator core, the existing buses were to be utilized – OPB for connection of all the microprocessors and the common memory (each accelerator core has its own memory). Compile and test simple SW function for the Accelerator core Every application for uBlaze compiles and runs in EDK. Insert CPU cache for Accelerator core No need for CPU cache – the accelerator cores have separate memory blocks. Insert test simple SW function to Accelerator cache, test functionality See above. Design Accelerators interface, memory and controllers Done. Design SW stub for the Accelerators to work in passive mode In progress. Add several Accelerators, test existing functionality PPC + 3 uBlaze cores are running separately, no OS – environmental issues. Write OS support driver that allows sending code to Accelerators for execution Almost completed. Write a test bench Useless until environment problems are solved.
25
Progress FPGA dynamic loading Test FPGA dynamic loading with Xilinx software Test dynamic loading of a simple Accelerator Write OS support driver for FPGA-on-the-fly functions Test loading custom precompiled VHDL code (“hardware DLL”) on- the-fly Duration: up to 3 months (Given success of previous parts’ schedule)
26
Progress Compiling the Linux kernel Compilation from scratch for GNU-supported architectures: Downloaded GCC source and binutils (versions had to be specified), to get a cross-compiler. Downloaded cross-tool and run it (automated Perl cross-compiler generator). Buildroot supports cross-compilation toolchain. Compilation of kernels had to use ML310 platform’s xparams.h, and names of devices had to be edited to attach them to the kernel namespace. Kernel must be an ELF file. Kernel must be compiled to run from memory, not using IDE/SYSACE/Network. Generating file systems Buildroot can generate kernel and filesystems. Does not support PowerPC, even though claims the functionality. Installation from a PowerPC CD on a host computer that emulates PowerPC core (using QEMU). No support for Linux kernel v.2.6 for PowerPC. Debian bootstrap cross-installation – method for generating packages for Debian installation without actually installing or running them, however, for file system creation QEMU is still needed. LFS (Linux From Scratch manual) – to build the file system from scratch, compiling every application by hand. Huge workload. We are using Linux v.2.6, reasons follow. Practical results Xilinx tools allow for easy generation of PowerPC platform with Xilinx-provided devices. Xilinx-provided Linux v.2.4 runs on the platform. Our compiled Linux v.2.6 runs on the platform, hangs on (non-existing) file system load, although we’ve generated file systems in several ways. Newer boards have Xilinx support for Linux v.2.6 – working with newer boards saves headache!
27
Environment problems Montavista 2.4 kernel: we’re not using it: There is no working compiler for Montavista in the lab. The EDK-provided Montavista environment isn’t up to date, the update is costly, and it isn’t able to compile the kernel as it is. There is no reason to even try to compile: we don’t have PCI controller core from Xilinx, which costs money. Montavista precompiled kernel won’t boot from memory, it explicitly depends on IDE or Network, and both are behind PCI controller. There is no reason to work with XUP to overcome the lack of PCI controller, because in XUP there’s no IDE and the Network core in XUP – again, costs money. Kernel 2.4 doesn’t compile as smoothly as kernel 2.6. Boot: There’s a boot with CF support within kernel 2.6, however, the driver doesn’t detect a file system. We’ve managed to boot the kernel with OS fully in memory, but the booted kernel doesn’t detect the file system in memory.
28
Failures We assume that there’s a problem with either kernel or all of our generated file systems. In order to “divide and conquer”, we turned to a fully simulated environment that runs on any computer - QEMU. Within it we tried to run PowerPC – it failed (QEMU is not practically supporting PowerPC – boot loader hangs while loading, since it’s not finished yet). In order to work around the issue with loading PowerPC in QEMU, we’re running ARM-based system (there’s no support for PowerPC in QEMU which is being used for simulation). Kernel and file system has both been adapted to ARM. Status: in the meantime, also failing. Work is in progress.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.