Fall 2006Lecture 16 Lecture 16: Accelerator Design in the XUP Board ECE 412: Microcomputer Laboratory.

Slides:



Advertisements
Similar presentations
System Integration and Performance
Advertisements

CT213 – Computing system Organization
Lecture Objectives: 1)Explain the limitations of flash memory. 2)Define wear leveling. 3)Define the term IO Transaction 4)Define the terms synchronous.
Week 1- Fall 2009 Dr. Kimberly E. Newman University of Colorado.
1 Lecture 2: Review of Computer Organization Operating System Spring 2007.
Figure 2.8 Compiler phases Compiling. Figure 2.9 Object module Linking.
Operating System - Overview Lecture 2. OPERATING SYSTEM STRUCTURES Main componants of an O/S Process Management Main Memory Management File Management.
Architectural Support for Operating Systems. Announcements Most office hours are finalized Assignments up every Wednesday, due next week CS 415 section.
Hardware accelerator for PPC microprocessor Final presentation By: Instructor: Kopitman Reem Fiksman Evgeny Stolberg Dmitri.
Configurable System-on-Chip: Xilinx EDK
Advanced OS Chapter 3p2 Sections 3.4 / 3.5. Interrupts These enable software to respond to signals from hardware. The set of instructions to be executed.
Device Driver for Generic ASC Module - Project Presentation - By: Yigal Korman Erez Fuchs Instructor: Evgeny Fiksman Sponsored by: High Speed Digital Systems.
UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.
1 Input/Output Chapter 3 TOPICS Principles of I/O hardware Principles of I/O software I/O software layers Disks Clocks Reference: Operating Systems Design.
1 What is an operating system? CSC330Patricia Van Hise.
Virtual Architecture For Partially Reconfigurable Embedded Systems (VAPRES) Architecture for creating partially reconfigurable embedded systems Module.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
Processes Part I Processes & Threads* *Referred to slides by Dr. Sanjeev Setia at George Mason University Chapter 3.
CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 8 – JPEG Compression (Part 3) Klara Nahrstedt Spring 2012.
ECE472/572 - Lecture 12 Image Compression – Lossy Compression Techniques 11/10/11.
Lecture 18 Lecture 18: Case Study of SoC Design ECE 412: Microcomputer Laboratory.
Department of Electrical Engineering Electronics Computers Communications Technion Israel Institute of Technology High Speed Digital Systems Lab. High.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
NETW 3005 I/O Systems. Reading For this lecture, you should have read Chapter 13 (Sections 1-4, 7). NETW3005 (Operating Systems) Lecture 10 - I/O Systems2.
1 Lecture 20: I/O n I/O hardware n I/O structure n communication with controllers n device interrupts n device drivers n streams.
Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze Unit OS6: Device Management 6.1. Principles of I/O.
Three fundamental concepts in computer security: Reference Monitors: An access control concept that refers to an abstract machine that mediates all accesses.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Computer Architecture Lecture10: Input/output devices Piotr Bilski.
Performance Enhancement of Video Compression Algorithms using SIMD Valia, Shamik Jamkar, Saket.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
High Performance Embedded Computing © 2007 Elsevier Lecture 18: Hardware/Software Codesign Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
Image Processing Architecture, © 2001, 2002, 2003 Oleh TretiakPage 1 ECE-C490 Image Processing Architecture MP-3 Compression Course Review Oleh Tretiak.
LINUX System : Lecture 7 Bong-Soo Sohn Lecture notes acknowledgement : The design of UNIX Operating System.
Hardware Image Signal Processing and Integration into Architectural Simulator for SoC Platform Hao Wang University of Wisconsin, Madison.
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
We will focus on operating system concepts What does it do? How is it implemented? Apply to Windows, Linux, Unix, Solaris, Mac OS X. Will discuss differences.
Computer Architecture Lecture 27 Fasih ur Rehman.
VAPRES A Virtual Architecture for Partially Reconfigurable Embedded Systems Presented by Joseph Antoon Abelardo Jara-Berrocal, Ann Gordon-Ross NSF Center.
CIS250 OPERATING SYSTEMS Chapter One Introduction.
Lecture 1: Review of Computer Organization
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Chapter 6 Storage and Other I/O Topics. Chapter 6 — Storage and Other I/O Topics — 2 Introduction I/O devices can be characterized by Behaviour: input,
Computer and Operating Systems
IT3002 Computer Architecture
Processor Memory Processor-memory bus I/O Device Bus Adapter I/O Device I/O Device Bus Adapter I/O Device I/O Device Expansion bus I/O Bus.
Survey of Reconfigurable Logic Technologies
Chapter 2 Introduction to OS Chien-Chung Shen CIS/UD
Modeling and Codesign Methods for Data Adaptable Reconfigurable Embedded Systems Roman Lysecky Department of Electrical and Computer Engineering University.
Implementing JPEG Encoder for FPGA ECE 734 PROJECT Deepak Agarwal.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
1 load [2], [9] Transfer contents of memory location 9 to memory location 2. Illegal instruction.
Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.
Introduction to Operating Systems Concepts
Input/Output (I/O) Important OS function – control I/O
Lab 4 HW/SW Compression and Decompression of Captured Image
CS 286 Computer Organization and Architecture
Implementation of IDEA on a Reconfigurable Computer
Chapter 1: Intro (excerpt)
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
Chapter 8: Memory management
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Prof. Leonardo Mostarda University of Camerino
LINUX System : Lecture 7 Lecture notes acknowledgement : The design of UNIX Operating System.
In Today’s Class.. General Kernel Responsibilities Kernel Organization
Presentation transcript:

Fall 2006Lecture 16 Lecture 16: Accelerator Design in the XUP Board ECE 412: Microcomputer Laboratory

Fall 2006Lecture 16 Objectives Understand accelerator design considerations in a practical FPGA environment Gain knowledge in some details of the XUP platform required for efficient accelerator design

Fall 2006Lecture 16 Four Fundamental Models of Accelerator Design Base No OS Service (in simple embedded systems) OS service acc as User space mmaped I/O device Virtualized Device with OS sched support

Fall 2006Lecture 16 Hybrid Hardware/Software Execution Model CPU FPGA accele- rators memory devices Linux OS Linker/Loader Application DLL OS modules Compiler analysis/transformations Synthesis Soft object Hard object User level function or device driver: Source code Resource manager Compile Time User Runtime Kernel Runtime Human designed hardware Hardware Accelerator as a DLL –Seamless integration of hardware accelerators into the Linux software stack for use by mainstream applications –The DLL approach enables transparent interchange of software and hardware components Application level execution model –Compiler deep analysis and transformations generate CPU code, hardware library stubs and synthesized components –FPGA bitmaps as hardware counterpart to existing software modules. –Same dynamic linking library interfaces and stubs apply to both software and hardware implementation OS resource management –Services (API) for allocation, partial reconfiguration, saving and restoring the status, and monitoring –Multiprogramming scheduler can pre-fetch hardware accelerators in time for next use –Control the access to the new hardware to ensure trust under private or shared use

Fall 2006Lecture 16 MP3 Decoder: Madplay Lib. Dithering as DLL Madplay shared library dithering function as software and FPGA DLL –Audio_linear_dither() software profiling shows 97% of application time –DL (dynamic linker) can switch the call to hardware or software implementation Used by ~100 video and audio applications Application Sound driver AC’97 OS FPGA Stub Software Dithering DLL QuantizationClippingDitheringRandom generatorBiasing Noise Shaping QuantizationClippingDithering Random generator Biasing Noise Shaping Hardware Dithering DLL Hardware Dithering 6 cycles Decode MP3 Block Read Sample DL Write Sample Application Sound driver AC’97 OS FPGA Stub Software Dithering DLL QuantizationClippingDitheringRandom generatorBiasing Noise Shaping QuantizationClippingDitheringRandom generatorBiasing Noise Shaping QuantizationClippingDithering Random generator Biasing Noise Shaping QuantizationClippingDithering Random generator Biasing Noise Shaping Hardware Dithering DLL Hardware Dithering Decode MP3 Block Read Sample DL Write Sample

Fall 2006Lecture 16 CPU-Accelerator Interconnect Options PLB (Processor Local Bus) –Wide transfer – 64 bits –Access to DRAM channel –1/3 CPU frequency –Big penalty if bus is busy during first attempt to access bus OCM (On-chip Memory) interconnect –Narrower – 32 bits –No direct access to DRAM channel –CPU clock frequency

Fall 2006Lecture 16 Motion Estimation Design & Experience Significant overhead in mmap, open calls –This arrangement can only support accelerators that will be invoked many times –Notice dramatic reduction in computation time –Notice large overhead in data marshalling and white Full Search gives 10% better compression –Diamond Search is sequential, not suitable for acceleration

Fall 2006Lecture 16 JPEG: An Example RGB 2D Discrete Cosine Transform (DCT) Run-Length Encoding (RLE) Huffman Coding (HC) Quantization (QUANT) Original Image Compressed Image Parallel Execution on Independent Blocks Inherently Sequential Region Implemented as Reconfigurable Logic Accelerator Candidate Downsampl e Y U V RGB to YUV

Fall 2006Lecture 16 JPEG Accelerator Design & Experience Based on Model (d) –System call overhead for each invocation –Better protection DCT and Quant are accelerated –Data flows directly from DCT to Quant Data copy to user DMA buffer dominates cost

Fall 2006Lecture 16 Execution Flow of DCT System Call ApplicationOperating SystemHardware Time  open(/dev/accel); /* only once*/ … /* construct macroblocks */ macroblock = … syscall(&macroblock, num_blocks) … /* macroblock now has transformed data */ … Data copy PPC Flush Cache Range Setup DMA Transfer PPC Poll DMA Controller Setup DMA Transfer Invalidate Cache Range Memory PLB PPC Accelerator (Executing) DCR Data Copy PPC Memory PLB PPC DMA Controller PLB PPC Memory PLB PPC Memory PLB Enable Accelerator Access for Application

Fall 2006Lecture 16 Software Versus Hardware Acceleration Overhead is a major issue!

Fall 2006Lecture 16 Device Driver Access Cost