Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Computer Architecture
CS364 CH16 Control Unit Operation
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerators zExample: video accelerator.
Chapter 10 Input/Output Organization. Connections between a CPU and an I/O device Types of bus (Figure 10.1) –Address bus –Data bus –Control bus.
PradeepKumar S K Asst. Professor Dept. of ECE, KIT, TIPTUR. PradeepKumar S K, Asst.
COMP3221: Microprocessors and Embedded Systems Lecture 17: Computer Buses and Parallel Input/Output (I) Lecturer: Hui.
1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
CS4101 嵌入式系統概論 Design and Development 金仲達教授 國立清華大學資訊工程學系 Slides from Computers as Components: Principles of Embedded Computing System Design, Wayne Wolf,
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Multiprocessors zWhy multiprocessors? zCPUs and accelerators. zMultiprocessor performance.
Input-output and Communication Prof. Sin-Min Lee Department of Computer Science.
Chapter 16 Control Unit Operation No HW problems on this chapter. It is important to understand this material on the architecture of computer control units,
Fall 2006Lecture 16 Lecture 16: Accelerator Design in the XUP Board ECE 412: Microcomputer Laboratory.
Chapter 16 Control Unit Implemntation. A Basic Computer Model.
Chapter 15 IA 64 Architecture Review Predication Predication Registers Speculation Control Data Software Pipelining Prolog, Kernel, & Epilog phases Automatic.
Presenter : Cheng-Ta Wu Antti Rasmus, Ari Kulmala, Erno Salminen, and Timo D. Hämäläinen Tampere University of Technology, Institute of Digital and Computer.
GCSE Computing - The CPU
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
INPUT/OUTPUT ARCHITECTURE By Truc Truong. Input Devices Keyboard Keyboard Mouse Mouse Scanner Scanner CD-Rom CD-Rom Game Controller Game Controller.
Input/Output. Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower.
Chapter 8 Input/Output. Busses l Group of electrical conductors suitable for carrying computer signals from one location to another l Each conductor in.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.
CHAPTER 9: Input / Output
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
I/O Example: Disk Drives To access data: — seek: position head over the proper track (8 to 20 ms. avg.) — rotational latency: wait for desired sector (.5.
Computer Architecture Lecture10: Input/output devices Piotr Bilski.
Operating Systems Lecture 02: Computer System Overview Anda Iamnitchi
Lecture 10 Hardware Accelerators Ingo Sander
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #21 – HW/SW.
2009 Sep 10SYSC Dept. Systems and Computer Engineering, Carleton University F09. SYSC2001-Ch7.ppt 1 Chapter 7 Input/Output 7.1 External Devices 7.2.
Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CE-321: Computer.
High Performance Embedded Computing © 2007 Elsevier Lecture 18: Hardware/Software Codesign Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
EEE440 Computer Architecture
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Input-Output Organization
CH10 Input/Output DDDData Transfer EEEExternal Devices IIII/O Modules PPPProgrammed I/O IIIInterrupt-Driven I/O DDDDirect Memory.
L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 3: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
FPGA-Based System Design: Chapter 7 Copyright  2004 Prentice Hall PTR Topics n Hardware/software co-design.
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
© 2000 Morgan Kaufman Overheads for Computers as Components Accelerators zAccelerated systems. zSystem design: yperformance analysis; yscheduling and.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
Overview von Neumann Architecture Computer component Computer function
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 8 Networks and Multiprocessors.
Basic Elements of Processor ALU Registers Internal data pahs External data paths Control Unit.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
BASIC COMPUTER ARCHITECTURE HOW COMPUTER SYSTEMS WORK.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
Big Picture Lab 4 Operating Systems C Andras Moritz
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
EC6703 EMBEDDED AND REAL TIME SYSTEMS
Sum of Absolute Differences Hardware Accelerator
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Chapter 13: I/O Systems.
Computer Architecture
Presentation transcript:

Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)

Operating Systems-1 Overview  CPUs and accelerators  Accelerated system design performance analysis scheduling and allocation  Design example: video accelerator

Operating Systems-2 Accelerated systems  Use additional computational unit dedicated to some functions? Hardwired logic. Extra CPU.  Hardware/software co-design: joint design of hardware and software architectures.

Operating Systems-3 Accelerated system architecture CPU accelerator memory I/O request data result data

Operating Systems-4 Accelerator vs. co-processor  A co-processor connects to the internals of the CPU and executes instructions. Instructions are dispatched by the CPU.  An accelerator appears as a device on the bus. Accelerator is controlled by registers, just like I/O devices CPU and accelerator may also communicate via shared memory, using synchronization mechanisms Designed to perform a specific function

Operating Systems-5 Accelerator implementations  Application-specific integrated circuit.  Field-programmable gate array (FPGA).  Standard component. Example: graphics processor.

Operating Systems-6 System design tasks  Similar to design a heterogeneous multiprocessor architecture Processing element (PE): CPU, accelerator, etc.  Program the system.

Operating Systems-7 cost performance Why accelerators?  Better cost/performance. Custom logic may be able to perform operation faster than a CPU of equivalent cost. CPU cost is a non-linear function of performance. => better split application on multiple cheaper PEs

Operating Systems-8 Why accelerators? cont ’ d.  Better real-time performance. Put time-critical functions on less-loaded processing elements. Remember RMS utilization---extra CPU cycles must be reserved to meet deadlines. cost performance deadline deadline w. RMS overhead

Operating Systems-9 Why accelerators? cont ’ d.  Good for processing I/O in real-time.  May consume less energy.  May be better at streaming data.  May not be able to do all the work on even the largest single CPU.

Operating Systems-10 Overview  CPUs and accelerators  Accelerated system design performance analysis scheduling and allocation  Design example: video accelerator

Operating Systems-11 Accelerated system design  First, determine that the system really needs to be accelerated. How much faster is the accelerator on the core function? How much data transfer overhead?  Design the accelerator itself.  Design CPU interface to accelerator.

Operating Systems-12 Performance analysis  Critical parameter is speedup: how much faster is the system with the accelerator?  Must take into account: Accelerator execution time. Data transfer time. Synchronization with the master CPU.

Operating Systems-13 Accelerator execution time  Total accelerator execution time: t accel = t in + t x + t out  t in and t out must reflect the time for bus transactions Data input Accelerated computation Data output

Operating Systems-14 Data input/output times  Bus transactions include: flushing register/cache values to main memory; time required for CPU to set up transaction; overhead of data transfers by bus packets, handshaking, etc.

Operating Systems-15 Accelerator speedup  Assume loop is executed n times.  Compare accelerated system to non-accelerated system: S = n(t CPU - t accel ) = n[t CPU - (t in + t x + t out )] Execution time on CPU

Operating Systems-16 Single- vs. multi-threaded  One critical factor is available parallelism: single-threaded/blocking: CPU waits for accelerator; multithreaded/non-blocking: CPU continues to execute along with accelerator.  To multithread, CPU must have useful work to do. But software must also support multithreading.

Operating Systems-17 Two modes of operations  Single-threaded:  Multi-threaded: P2 P1 A1 P3 P4 P2 P1 A1 P3 P4 CPU Accelerator CPU Accelerator

Operating Systems-18 Execution time analysis  Single-threaded: Count execution time of all component processes  Multi-threaded: Find longest path through execution.  Sources of parallelism: Overlap I/O and accelerator computation.  Perform operations in batches, read in second batch of data while computing on first batch. Find other work to do on the CPU.  May reschedule operations to move work after accelerator initiation.

Operating Systems-19 Overview  CPUs and accelerators  Accelerated system design performance analysis scheduling and allocation  Design example: video accelerator

Operating Systems-20 Accelerator/CPU interface  Accelerator registers provide control registers for CPU.  Data registers can be used for small data objects.  Accelerator may include special-purpose read/write logic. Especially valuable for large data transfers.

Operating Systems-21 Caching problems  Main memory provides the primary data transfer mechanism to the accelerator.  Programs must ensure that caching does not invalidate main memory data. CPU reads location S. Accelerator writes location S. CPU writes location S. BAD

Operating Systems-22 Synchronization  As with cache, main memory writes to shared memory may cause invalidation: CPU reads S. Accelerator writes S. CPU reads S.

Operating Systems-23 Partitioning  Divide functional specification into units. Map units onto PEs. Units may become processes.  Determine proper level of parallelism: f3(f1(),f2()) f1()f2() f3() vs.

Operating Systems-24 Partitioning methodology  Divide CDFG into pieces, shuffle functions between pieces.  Hierarchically decompose CDFG to identify possible partitions.

Operating Systems-25 Partitioning example Block 1 Block 2 Block 3 cond 1 cond 2 P1P2P3 P4 P5

Operating Systems-26 Scheduling and allocation  Must: schedule operations in time; allocate computations to processing elements.  Scheduling and allocation interact, but separating them helps. Alternatively allocate, then schedule.

Operating Systems-27 Example: scheduling and allocation P1P2 P3 d1d2 Task graph Hardware platform M1M2

Operating Systems-28 Example process execution times

Operating Systems-29 Example communication model  Assume communication within PE is free.  Cost of communication from P1 to P3 is d1 =2; cost of P2->P3 communication is d2 = 4.

Operating Systems-30 First design  Allocate P1, P2 -> M1; P3 -> M2. time M1 M2 network P1P2 d2 P3 Time = 19

Operating Systems-31 Second design  Allocate P1 -> M1; P2, P3 -> M2: M1 M2 network P1 P2 d2 P3 Time = 18

Operating Systems-32 System integration and debugging  Try to debug the CPU/accelerator interface separately from the accelerator core.  Build scaffolding to test the accelerator.  Hardware/software co-simulation can be useful.

Operating Systems-33 Overview  CPUs and accelerators  Accelerated system design performance analysis scheduling and allocation  Design example: video accelerator

Operating Systems-34 Concept  Build accelerator for block motion estimation, one step in video compression.  Perform two-dimensional correlation: Frame 1 f2

Operating Systems-35 Block motion estimation  MPEG divides frame into 16 x 16 macroblocks for motion estimation.  Search for best match within a search range.  Measure similarity with sum-of-absolute- differences (SAD):  | M(i,j) - S(i-o x, j-o y ) |

Operating Systems-36 Best match  Best match produces motion vector for motion block:

Operating Systems-37 Full search algorithm bestx = 0; besty = 0; bestsad = MAXSAD; for (ox = - SEARCHSIZE; ox < SEARCHSIZE; ox++) { for (oy = -SEARCHSIZE; oy < SEARCHSIZE; oy++) { int result = 0; for (i=0; i<MBSIZE; i++) { for (j=0; j<MBSIZE; j++) { result += iabs(mb[i][j] - search[i- ox+XCENTER][j-oy-YCENTER]); } if (result <= bestsad) { bestsad = result; bestx = ox; besty = oy; } }

Operating Systems-38 Computational requirements  Let MBSIZE = 16, SEARCHSIZE = 8.  Search area is in each dimension.  Must perform: n ops = (16 x 16) x (17 x 17) = ops  CIF format has 352 x 288 pixels -> 22 x 18 macroblocks.

Operating Systems-39 Accelerator requirements

Operating Systems-40 Accelerator data types, basic classes Motion-vector x, y : pos Macroblock pixels[] : pixelval Search-area pixels[] : pixelval PC memory[] Motion-estimator compute-mv()

Operating Systems-41 Sequence diagram :PC:Motion-estimator compute-mv() memory[] Search area macroblocks

Operating Systems-42 Architectural considerations  Requires large amount of memory: macroblock has 256 pixels; search area has 1,089 pixels.  May need external memory (especially if buffering multiple macroblocks/search areas).

Operating Systems-43 Motion estimator organization Address generator search area macroblock network ctrl network PE 0 comparator PE 1 PE 15 Motion vector...

Operating Systems-44 Pixel schedules M(0,0) S(0,2)

Operating Systems-45 System testing  Testing requires a large amount of data.  Use simple patterns with obvious answers for initial tests.  Extract sample data from JPEG pictures for more realistic tests.