Definition Asymmetric FPGA-loaded hardware accelerators for FPGA- enhanced CPU systems with LinuxDefinition Performed by:Avi Werner William Backshi Instructor:Evgeny.

Slides:



Advertisements
Similar presentations
Operating Systems Components of OS
Advertisements

Memory Protection: Kernel and User Address Spaces  Background  Address binding  How memory protection is achieved.
Final Presentation Part-A
Linux on an FPGA Team: Anthony Bentley Dylan Ismari Bryan Myers Tyler Jordan Mario Espinoza Sponsor: Dr. Alonzo Vera.
9.0 EMBEDDED SOFTWARE DEVELOPMENT TOOLS 9.1 Introduction Application programs are typically developed, compiled, and run on host system Embedded programs.
Lecture 19 Page 1 CS 111 Online Protecting Operating Systems Resources How do we use these various tools to protect actual OS resources? Memory? Files?
Page 1 Dorado 400 Series Server Club Page 2 First member of the Dorado family based on the Next Generation architecture Employs Intel 64 Xeon Dual.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 2: Computer-System Structures Computer System Operation I/O Structure Storage.
Embedded Real-time Systems The Linux kernel. The Operating System Kernel Resident in memory, privileged mode System calls offer general purpose services.
Configurable System-on-Chip: Xilinx EDK
29 April 2005 Part B Final Presentation Peripheral Devices For ML310 Board Project name : Spring Semester 2005 Final Presentation Presenting : Erez Cohen.
1 Network Packet Generator Characterization presentation Supervisor: Mony Orbach Presenting: Eugeney Ryzhyk, Igor Brevdo.
The Xilinx EDK Toolset: Xilinx Platform Studio (XPS) Building a base system platform.
Students:Gilad Goldman Lior Kamran Supervisor:Mony Orbach Mid-Semester Presentation Spring 2005 Network Sniffer.
Cs238 Lecture 3 Operating System Structures Dr. Alan R. Davis.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.
Students:Gilad Goldman Lior Kamran Supervisor:Mony Orbach Part A Presentation Network Sniffer.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 2: Operating-System Structures Modified from the text book.
Implementation of DSP Algorithm on SoC. Characterization presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompany engineer : Emilia Burlak.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
Out-of-Order OpenRISC 2 semesters project Semester A: Implementation of OpenRISC on XUPV5 board Final A Presentation By: Vova Menis-Lurie Sonia Gershkovich.
Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf
Introduction to The Linaro Toolchain Embedded Processors Training Multicore Software Applications Literature Number: SPRPXXX 1.
1 Introduction to Tool chains. 2 Tool chain for the Sitara Family (but it is true for other ARM based devices as well) A tool chain is a collection of.
Tanenbaum 8.3 See references
General System Architecture and I/O.  I/O devices and the CPU can execute concurrently.  Each device controller is in charge of a particular device.
Image Processing for Remote Sensing Matthew E. Nelson Joseph Coleman.
Computer Organization
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems David Goldschmidt, Ph.D.
Processor Structure & Operations of an Accumulator Machine
Out-of-Order OpenRISC 2 semesters project Semester A: Implementation of OpenRISC on XUPV5 board Midterm Presentation By: Vova Menis-Lurie Sonia Gershkovich.
UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.
Content Project Goals. Term A Goals. Quick Overview of Term A Goals. Term B Goals. Gantt Chart. Requests.
Lab 11 Department of Computer Science and Information Engineering National Taiwan University Lab11 - Porting 2014/12/9/ 26 1.
I/O Example: Disk Drives To access data: — seek: position head over the proper track (8 to 20 ms. avg.) — rotational latency: wait for desired sector (.5.
GBT Interface Card for a Linux Computer Carson Teale 1.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Guide to Linux Installation and Administration, 2e1 Chapter 2 Planning Your System.
The Project Asymmetric FPGA-loaded hardware accelerators for FPGA- enhanced CPU systems with Linux The Project Asymmetric FPGA-loaded hardware accelerators.
Background: Operating Systems Brad Karp UCL Computer Science CS GZ03 / M th November, 2008.
3.1 Operating System Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual.
UDI HDK Roadmap Matt Kaufman Senior Software Engineer
PROJECT - ZYNQ Yakir Peretz Idan Homri Semester - winter 2014 Duration - one semester.
Silberschatz, Galvin and Gagne  Operating System Concepts UNIT II Operating System Services.
CSC414 “Introduction to UNIX/ Linux” Lecture 2. Schedule 1. Introduction to Unix/ Linux 2. Kernel Structure and Device Drivers. 3. System and Storage.
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
Content Project Goals. Workflow Background. System configuration. Working environment. System simulation. System synthesis. Benchmark. Multicore.
Copyright © 2007 by Curt Hill Interrupts How the system responds.
1 Basic Processor Architecture. 2 Building Blocks of Processor Systems CPU.
Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Asymmetric FPGA-loaded hardware accelerators for FPGA- enhanced CPU systems with Linux Performed by:Avi Werner William Backshi Instructor:Evgeny Fiksman.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
UNIX U.Y: 1435/1436 H Operating System Concept. What is an Operating System?  The operating system (OS) is the program which starts up when you turn.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
Introduction to Operating Systems Concepts
Computer System Structures
Virtualization.
Current Generation Hypervisor Type 1 Type 2.
Operating System Structure
Chapter 2: System Structures
OS Virtualization.
Chapter 3: Operating-System Structures
Memory Management Tasks
Lecture Topics: 11/1 General Operating System Concepts Processes
Outline Chapter 2 (cont) OS Design OS structure
System calls….. C-program->POSIX call
Presentation transcript:

Definition Asymmetric FPGA-loaded hardware accelerators for FPGA- enhanced CPU systems with LinuxDefinition Performed by:Avi Werner William Backshi Instructor:Evgeny Fiksman Cooperated with:IBM Dror Livne (head of SW Dpt, start-up). Performed by:Avi Werner William Backshi Instructor:Evgeny Fiksman Cooperated with:IBM Dror Livne (head of SW Dpt, start-up).

 Studying the environment  Build fully functional cross-compilation toolchain for PPC  Implementation of one of the board+CPU+OS demos on FPGA  Introduce additional hardware into FPGA, test liveness.  Multi-core  Study existing buses; build FSB for the Accelerator core  Compile and test simple SW function for the Accelerator core  Insert CPU cache for Accelerator core  Insert test simple SW function to Accelerator cache, test functionality  Design Accelerators interface, memory and controllers  Design SW stub for the Accelerators to work in passive mode  Add several Accelerators, test existing functionality  Write OS support driver that allows sending code to Accelerators for execution  Write a test bench  FPGA dynamic loading  Test FPGA dynamic loading with Xilinx software  Test dynamic loading of a simple Accelerator  Write OS support driver for FPGA-on-the-fly functions  Test loading custom precompiled VHDL code (“hardware DLL”) on-the-fly

PPC Accelerator DDR MEMMMU Accelerator OPB bus Data & Instr MEM Data & Instr MEM Data & Instr MEM

Accelerated Software platform FPGA PPC 405 Accelerator DDR MEM MMU Linux (Debian) Driver Virtual communication Layer (SW) Instr MEM & Data MEM Software Stub (Data mover & executer)

 Platform  ML310 with PPC405  Accelerators  uBlaze soft-core microprocessors.  Controllers  No need for controllers, SW-controlled operation.

 HW  1 PPC core and several uBlaze cores are connected to OPB bus.  Main memory is DDR, after OPB-to-PCB bridge.  Each uBlaze has a shared 2-port memory for instructions and data, with separate access for instruction and data ports.  SW  Kernel is running on central PPC core (single OS).  Kernel is aware of reserved driver-allocated memory range, unaware of accelerators and their CPU functionality.  The SW capable of using accelerators should be aware of acceleration driver usage and has code segments which are supported by accelerators and can be loaded to them.  Each accelerator is running a SW loop in its own memory, a small stub that is actually client operation controller, and has Data Mover functionality.  Driver is independent of the structure of implemented accelerators (assumption – compiler exists for the implemented accelerators).

 Functionality  HW is loaded on FPGA, Linux kernel runs on central PPC core, accelerators are preloaded with client software stub.  SW driver is loaded in the memory (using insmod command).  Accelerator-aware SW is executed and communicates with the driver.  Client stub runs in idle loop until a transaction to specific memory range on OPB has been recognized (trigger transaction), initialized by driver code running on PPC core.  The stub reads several following transactions (preset amount or linked list till zero data) that initialize the Data Mover.  Data Mover sets a specific flag (preset address) in the main memory, that signals a busy accelerator core.  Data Mover initializes transactions with the main memory until all the data of the 1 st chunk has been brought, adds “ret” instruction to return control to Data Mover after execution, and passes control to the 1 st byte of 1 st chunk of data.  The code finishes execution and the inserted “ret” passes control back to Data Mover.  Data Mover initializes transactions with the main memory until all the result data was put back into main memory, clears the busy flag, and returns to idle loop.

 Studying the environment  Build fully functional cross-compilation toolchain for PPC  Implementation of one of the board+CPU+OS demos on FPGA  Introduce additional hardware into FPGA, test liveness. Duration: up to 1 month Status: Complete (OS demos using Xilinx-delivered OS)

 Multi-core  Study existing buses; build FSB for the Accelerator core  Compile and test simple SW function for the Accelerator core  Insert CPU cache for Accelerator core  Insert test simple SW function to Accelerator cache, test functionality  Design Accelerators interface, memory and controllers  Design SW stub for the Accelerators to work in passive mode  Add several Accelerators, test existing functionality  Write OS support driver that allows sending code to Accelerators for execution  Write a test bench Duration: up to 6 months

 Multi-core  Study existing buses; build FSB for the Accelerator core Since uBlaze was selected as accelerator core, the existing buses were to be utilized – OPB for connection of all the microprocessors and the common memory (each accelerator core has its own memory).  Compile and test simple SW function for the Accelerator core Every application for uBlaze compiles and runs in EDK.  Insert CPU cache for Accelerator core No need for CPU cache – the accelerator cores have separate memory blocks.  Insert test simple SW function to Accelerator cache, test functionality See above.  Design Accelerators interface, memory and controllers Done.  Design SW stub for the Accelerators to work in passive mode In progress.  Add several Accelerators, test existing functionality PPC + 3 uBlaze cores are running separately, no OS – environmental issues.  Write OS support driver that allows sending code to Accelerators for execution Almost completed.  Write a test bench Useless until environment problems are solved.

 FPGA dynamic loading  Test FPGA dynamic loading with Xilinx software  Test dynamic loading of a simple Accelerator  Write OS support driver for FPGA-on-the-fly functions  Test loading custom precompiled VHDL code (“hardware DLL”) on- the-fly Duration: up to 3 months (Given success of previous parts’ schedule)

 Theoretical research  Found and read articles on HW accelerators, both of the faculty staff and external (CELL – IBM, etc)  Met with most of MATRICS group, checking their interest in our platform and possible demands  Met with Systems Dpt. Members in IBM (Muli Ben-Yehuda) for a concept review.  System architecture has undergone significant changes.  Practical achievements  Cross-compiler toolchain (GCC 3.6) Stack overflow protection of PPC-64 (reported in the GCC by BugZilla) presented a significant workload to avoid.  Compiled kernel for PPC-405 with ML310 support (no PCI support).  Booted ML310 from CF with Xilinx pre-loaded Linux.  Introduced additional hardware into FPGA, tested liveness.  Remaining  Boot ML310 with our kernel.

 Compiling the Linux kernel  Compilation from scratch for GNU-supported architectures: Downloaded GCC source and binutils (versions had to be specified), to get a cross-compiler. Downloaded cross-tool and run it (automated Perl cross-compiler generator). Buildroot supports cross-compilation toolchain. Compilation of kernels had to use ML310 platform’s xparams.h, and names of devices had to be edited to attach them to the kernel namespace.  Kernel must be an ELF file.  Kernel must be compiled to run from memory, not using IDE/SYSACE/Network.  Generating file systems  Buildroot can generate kernel and filesystems. Does not support PowerPC, even though claims the functionality.  Installation from a PowerPC CD on a host computer that emulates PowerPC core (using QEMU). No support for Linux kernel v.2.6 for PowerPC.  Debian bootstrap cross-installation – method for generating packages for Debian installation without actually installing or running them, however, for file system creation QEMU is still needed.  LFS (Linux From Scratch manual) – to build the file system from scratch, compiling every application by hand. Huge workload.  We are using Linux v.2.6, reasons follow.  Practical results  Xilinx tools allow for easy generation of PowerPC platform with Xilinx-provided devices.  Xilinx-provided Linux v.2.4 runs on the platform.  Our compiled Linux v.2.6 runs on the platform, hangs on (non-existing) file system load, although we’ve generated file systems in several ways.  Newer boards have Xilinx support for Linux v.2.6 – working with newer boards saves headache!

 Montavista 2.4 kernel: we’re not using it: There is no working compiler for Montavista in the lab. The EDK-provided Montavista environment isn’t up to date, the update is costly, and it isn’t able to compile the kernel as it is. There is no reason to even try to compile: we don’t have PCI controller core from Xilinx, which costs money. Montavista precompiled kernel won’t boot from memory, it explicitly depends on IDE or Network, and both are behind PCI controller. There is no reason to work with XUP to overcome the lack of PCI controller, because in XUP there’s no IDE and the Network core in XUP – again, costs money. Kernel 2.4 doesn’t compile as smoothly as kernel 2.6.  Boot: There’s a boot with CF support within kernel 2.6, however, the driver doesn’t detect a file system. We’ve managed to boot the kernel with OS fully in memory, but the booted kernel doesn’t detect the file system in memory.

 We assume that there’s a problem with either kernel or all of our generated file systems.  In order to “divide and conquer”, we turned to a fully simulated environment that runs on any computer - QEMU. Within it we tried to run PowerPC – it failed (QEMU is not practically supporting PowerPC – boot loader hangs while loading, since it’s not finished yet).  In order to work around the issue with loading PowerPC in QEMU, we’re running ARM-based system (there’s no support for PowerPC in QEMU which is being used for simulation). Kernel and file system has both been adapted to ARM. Status: in the meantime, also failing. Work is in progress.