Definition Asymmetric FPGA-loaded hardware accelerators for FPGA- enhanced CPU systems with LinuxDefinition Performed by:Avi Werner William Backshi Instructor:Evgeny Fiksman Cooperated with:IBM Dror Livne (head of SW Dpt, start-up). Performed by:Avi Werner William Backshi Instructor:Evgeny Fiksman Cooperated with:IBM Dror Livne (head of SW Dpt, start-up).
Studying the environment Build fully functional cross-compilation toolchain for PPC Implementation of one of the board+CPU+OS demos on FPGA Introduce additional hardware into FPGA, test liveness. Multi-core Study existing buses; build FSB for the Accelerator core Compile and test simple SW function for the Accelerator core Insert CPU cache for Accelerator core Insert test simple SW function to Accelerator cache, test functionality Design Accelerators interface, memory and controllers Design SW stub for the Accelerators to work in passive mode Add several Accelerators, test existing functionality Write OS support driver that allows sending code to Accelerators for execution Write a test bench FPGA dynamic loading Test FPGA dynamic loading with Xilinx software Test dynamic loading of a simple Accelerator Write OS support driver for FPGA-on-the-fly functions Test loading custom precompiled VHDL code (“hardware DLL”) on-the-fly
PPC Accelerator DDR MEMMMU Accelerator OPB bus Data & Instr MEM Data & Instr MEM Data & Instr MEM
Accelerated Software platform FPGA PPC 405 Accelerator DDR MEM MMU Linux (Debian) Driver Virtual communication Layer (SW) Instr MEM & Data MEM Software Stub (Data mover & executer)
Platform ML310 with PPC405 Accelerators uBlaze soft-core microprocessors. Controllers No need for controllers, SW-controlled operation.
HW 1 PPC core and several uBlaze cores are connected to OPB bus. Main memory is DDR, after OPB-to-PCB bridge. Each uBlaze has a shared 2-port memory for instructions and data, with separate access for instruction and data ports. SW Kernel is running on central PPC core (single OS). Kernel is aware of reserved driver-allocated memory range, unaware of accelerators and their CPU functionality. The SW capable of using accelerators should be aware of acceleration driver usage and has code segments which are supported by accelerators and can be loaded to them. Each accelerator is running a SW loop in its own memory, a small stub that is actually client operation controller, and has Data Mover functionality. Driver is independent of the structure of implemented accelerators (assumption – compiler exists for the implemented accelerators).
Functionality HW is loaded on FPGA, Linux kernel runs on central PPC core, accelerators are preloaded with client software stub. SW driver is loaded in the memory (using insmod command). Accelerator-aware SW is executed and communicates with the driver. Client stub runs in idle loop until a transaction to specific memory range on OPB has been recognized (trigger transaction), initialized by driver code running on PPC core. The stub reads several following transactions (preset amount or linked list till zero data) that initialize the Data Mover. Data Mover sets a specific flag (preset address) in the main memory, that signals a busy accelerator core. Data Mover initializes transactions with the main memory until all the data of the 1 st chunk has been brought, adds “ret” instruction to return control to Data Mover after execution, and passes control to the 1 st byte of 1 st chunk of data. The code finishes execution and the inserted “ret” passes control back to Data Mover. Data Mover initializes transactions with the main memory until all the result data was put back into main memory, clears the busy flag, and returns to idle loop.
Studying the environment Build fully functional cross-compilation toolchain for PPC Implementation of one of the board+CPU+OS demos on FPGA Introduce additional hardware into FPGA, test liveness. Duration: up to 1 month Status: Complete (OS demos using Xilinx-delivered OS)
Multi-core Study existing buses; build FSB for the Accelerator core Compile and test simple SW function for the Accelerator core Insert CPU cache for Accelerator core Insert test simple SW function to Accelerator cache, test functionality Design Accelerators interface, memory and controllers Design SW stub for the Accelerators to work in passive mode Add several Accelerators, test existing functionality Write OS support driver that allows sending code to Accelerators for execution Write a test bench Duration: up to 6 months
Multi-core Study existing buses; build FSB for the Accelerator core Since uBlaze was selected as accelerator core, the existing buses were to be utilized – OPB for connection of all the microprocessors and the common memory (each accelerator core has its own memory). Compile and test simple SW function for the Accelerator core Every application for uBlaze compiles and runs in EDK. Insert CPU cache for Accelerator core No need for CPU cache – the accelerator cores have separate memory blocks. Insert test simple SW function to Accelerator cache, test functionality See above. Design Accelerators interface, memory and controllers Done. Design SW stub for the Accelerators to work in passive mode In progress. Add several Accelerators, test existing functionality PPC + 3 uBlaze cores are running separately, no OS – environmental issues. Write OS support driver that allows sending code to Accelerators for execution Almost completed. Write a test bench Useless until environment problems are solved.
FPGA dynamic loading Test FPGA dynamic loading with Xilinx software Test dynamic loading of a simple Accelerator Write OS support driver for FPGA-on-the-fly functions Test loading custom precompiled VHDL code (“hardware DLL”) on- the-fly Duration: up to 3 months (Given success of previous parts’ schedule)
Theoretical research Found and read articles on HW accelerators, both of the faculty staff and external (CELL – IBM, etc) Met with most of MATRICS group, checking their interest in our platform and possible demands Met with Systems Dpt. Members in IBM (Muli Ben-Yehuda) for a concept review. System architecture has undergone significant changes. Practical achievements Cross-compiler toolchain (GCC 3.6) Stack overflow protection of PPC-64 (reported in the GCC by BugZilla) presented a significant workload to avoid. Compiled kernel for PPC-405 with ML310 support (no PCI support). Booted ML310 from CF with Xilinx pre-loaded Linux. Introduced additional hardware into FPGA, tested liveness. Remaining Boot ML310 with our kernel.
Compiling the Linux kernel Compilation from scratch for GNU-supported architectures: Downloaded GCC source and binutils (versions had to be specified), to get a cross-compiler. Downloaded cross-tool and run it (automated Perl cross-compiler generator). Buildroot supports cross-compilation toolchain. Compilation of kernels had to use ML310 platform’s xparams.h, and names of devices had to be edited to attach them to the kernel namespace. Kernel must be an ELF file. Kernel must be compiled to run from memory, not using IDE/SYSACE/Network. Generating file systems Buildroot can generate kernel and filesystems. Does not support PowerPC, even though claims the functionality. Installation from a PowerPC CD on a host computer that emulates PowerPC core (using QEMU). No support for Linux kernel v.2.6 for PowerPC. Debian bootstrap cross-installation – method for generating packages for Debian installation without actually installing or running them, however, for file system creation QEMU is still needed. LFS (Linux From Scratch manual) – to build the file system from scratch, compiling every application by hand. Huge workload. We are using Linux v.2.6, reasons follow. Practical results Xilinx tools allow for easy generation of PowerPC platform with Xilinx-provided devices. Xilinx-provided Linux v.2.4 runs on the platform. Our compiled Linux v.2.6 runs on the platform, hangs on (non-existing) file system load, although we’ve generated file systems in several ways. Newer boards have Xilinx support for Linux v.2.6 – working with newer boards saves headache!
Montavista 2.4 kernel: we’re not using it: There is no working compiler for Montavista in the lab. The EDK-provided Montavista environment isn’t up to date, the update is costly, and it isn’t able to compile the kernel as it is. There is no reason to even try to compile: we don’t have PCI controller core from Xilinx, which costs money. Montavista precompiled kernel won’t boot from memory, it explicitly depends on IDE or Network, and both are behind PCI controller. There is no reason to work with XUP to overcome the lack of PCI controller, because in XUP there’s no IDE and the Network core in XUP – again, costs money. Kernel 2.4 doesn’t compile as smoothly as kernel 2.6. Boot: There’s a boot with CF support within kernel 2.6, however, the driver doesn’t detect a file system. We’ve managed to boot the kernel with OS fully in memory, but the booted kernel doesn’t detect the file system in memory.
We assume that there’s a problem with either kernel or all of our generated file systems. In order to “divide and conquer”, we turned to a fully simulated environment that runs on any computer - QEMU. Within it we tried to run PowerPC – it failed (QEMU is not practically supporting PowerPC – boot loader hangs while loading, since it’s not finished yet). In order to work around the issue with loading PowerPC in QEMU, we’re running ARM-based system (there’s no support for PowerPC in QEMU which is being used for simulation). Kernel and file system has both been adapted to ARM. Status: in the meantime, also failing. Work is in progress.