© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Multiprocessors zWhy multiprocessors? zCPUs and accelerators. zMultiprocessor performance.

Slides:



Advertisements
Similar presentations
© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Bus-Based Computer Systems zBusses. zMemory devices. zI/O devices: yserial links ytimers.
Advertisements

Computer Architecture
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Processes and operating systems zScheduling policies: yRMS; yEDF. zScheduling modeling.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
PradeepKumar S K Asst. Professor Dept. of ECE, KIT, TIPTUR. PradeepKumar S K, Asst.
COMP3221: Microprocessors and Embedded Systems Lecture 17: Computer Buses and Parallel Input/Output (I) Lecturer: Hui.
1 Architectural Complexity: Opening the Black Box Methods for Exposing Internal Functionality of Complex Single and Multiple Processor Systems EECC-756.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Bus-Based Computer Systems zBusses. zMemory devices. zI/O devices: yserial links ytimers.
Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)
© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Processes and operating systems zInterprocess communication. zOperating system performance.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
Embedded Software for Video Wayne Wolf Princeton University and MediaWorks Technology.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
I/O Systems CSCI 444/544 Operating Systems Fall 2008.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
INPUT/OUTPUT ARCHITECTURE By Truc Truong. Input Devices Keyboard Keyboard Mouse Mouse Scanner Scanner CD-Rom CD-Rom Game Controller Game Controller.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Input/Output. Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
NETW 3005 I/O Systems. Reading For this lecture, you should have read Chapter 13 (Sections 1-4, 7). NETW3005 (Operating Systems) Lecture 10 - I/O Systems2.
I/O Systems I/O Hardware Application I/O Interface
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
Top Level View of Computer Function and Interconnection.
Computer Architecture Lecture10: Input/output devices Piotr Bilski.
Lecture 10 Hardware Accelerators Ingo Sander
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #21 – HW/SW.
High Performance Embedded Computing © 2007 Elsevier Lecture 18: Hardware/Software Codesign Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
© 2000 Morgan Kaufman Overheads for Computers as Components Networks zNetwork-based design. yCommunication analysis. ySystem performance analysis. zInternet-enabled.
EEE440 Computer Architecture
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem.
Computer Organization. This module surveys the physical resources of a computer system.  Basic components  CPU  Memory  Bus  I/O devices  CPU structure.
Computer Architecture 2 nd year (computer and Information Sc.)
12/8/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam King,
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
FPGA-Based System Design: Chapter 7 Copyright  2004 Prentice Hall PTR Topics n Hardware/software co-design.
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
© 2000 Morgan Kaufman Overheads for Computers as Components Accelerators zAccelerated systems. zSystem design: yperformance analysis; yscheduling and.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 8 Networks and Multiprocessors.
Lecture 3: Computer Architectures
Chapter 1: How are computers organized?. Software, data, & processing ? A computers has no insight or intuition A computers has no insight or intuition.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
BASIC COMPUTER ARCHITECTURE HOW COMPUTER SYSTEMS WORK.
© 2004 Wayne Wolf Overheads for Computers as Components 2e Overview zWhy multiprocessors? zThe structure of multiprocessors. zElements of multiprocessors:
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Computer Architecture. Top level of Computer A top level of computer consists of CPU, memory, an I/O components, with one or more modules of each type.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
CHAPTER 4 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
Bus-Based Computer Systems
Chapter 1: How are computers organized?
Multithreaded Programming
Processes and operating systems
Process.
Chapter 13: I/O Systems.
Presentation transcript:

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Multiprocessors zWhy multiprocessors? zCPUs and accelerators. zMultiprocessor performance analysis.

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Why multiprocessors? zBetter cost/performance. yMatch each CPU to its tasks or use custom logic (smaller, cheaper). yCPU cost is a non-linear function of performance. cost performance

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Why multiprocessors? cont’d. zBetter real-time performance. yPut time-critical functions on less-loaded processing elements. yRemember RMS utilization---extra CPU cycles must be reserved to meet deadlines. cost performance deadline deadline w. RMS overhead

Why multiprocessors? cont’d. zUsing specialized processors or custom logic saves power. zDesktop uniprocessors are not power-efficient enough for battery- powered applications. © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. [Aus04] © 2004 IEEE Computer Society

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Why multiprocessors? cont’d. zGood for processing I/O in real-time. zMay consume less energy. zMay be better at streaming data. zMay not be able to do all the work on even the largest single CPU.

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerated systems zUse additional computational unit dedicated to some functions? yHardwired logic. yExtra CPU. zHardware/software co-design: joint design of hardware and software architectures.

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerated system architecture CPU accelerator memory I/O request data result data

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerator vs. co- processor zA co-processor executes instructions. yInstructions are dispatched by the CPU. zAn accelerator appears as a device on the bus. yThe accelerator is controlled by registers.

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerator implementations zApplication-specific integrated circuit. zField-programmable gate array (FPGA). zStandard component. yExample: graphics processor.

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. System design tasks zDesign a heterogeneous multiprocessor architecture. yProcessing element (PE): CPU, accelerator, etc. zProgram the system.

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerated system design zFirst, determine that the system really needs to be accelerated. yHow much faster is the accelerator on the core function? yHow much data transfer overhead? zDesign the accelerator itself. zDesign CPU interface to accelerator.

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerated system platforms zSeveral off-the-shelf boards are available for acceleration in PCs: yFPGA-based core; yPC bus interface.

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerator/CPU interface zAccelerator registers provide control registers for CPU. zData registers can be used for small data objects. zAccelerator may include special-purpose read/write logic. yEspecially valuable for large data transfers.

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. System integration and debugging zTry to debug the CPU/accelerator interface separately from the accelerator core. zBuild scaffolding to test the accelerator. zHardware/software co-simulation can be useful.

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Caching problems zMain memory provides the primary data transfer mechanism to the accelerator. zPrograms must ensure that caching does not invalidate main memory data. yCPU reads location S. yAccelerator writes location S. yCPU writes location S. BAD

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Synchronization zAs with cache, main memory writes to shared memory may cause invalidation: yCPU reads S. yAccelerator writes S. yCPU reads S.

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Multiprocessor performance analysis zEffects of parallelism (and lack of it): yProcesses. yCPU and bus. yMultiple processors.

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerator speedup zCritical parameter is speedup: how much faster is the system with the accelerator? zMust take into account: yAccelerator execution time. yData transfer time. ySynchronization with the master CPU.

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerator execution time zTotal accelerator execution time: yt accel = t in + t x + t out Data input Accelerated computation Data output

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerator speedup zAssume loop is executed n times. zCompare accelerated system to non- accelerated system: yS = n(t CPU - t accel ) y = n[t CPU - (t in + t x + t out )] Execution time on CPU

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Single- vs. multi-threaded zOne critical factor is available parallelism: ysingle-threaded/blocking: CPU waits for accelerator; ymultithreaded/non-blocking: CPU continues to execute along with accelerator. zTo multithread, CPU must have useful work to do. yBut software must also support multithreading.

© 2008 Wayne Wolf Overheads for Computers as Components Total execution time zSingle-threaded:z Multi-threaded: P2 P1 A1 P3 P4 P2 P1 A1 P3 P4

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Execution time analysis zSingle-threaded: yCount execution time of all component processes. z Multi-threaded: yFind longest path through execution.

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Sources of parallelism zOverlap I/O and accelerator computation. yPerform operations in batches, read in second batch of data while computing on first batch. zFind other work to do on the CPU. yMay reschedule operations to move work after accelerator initiation.

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Data input/output times zBus transactions include: yflushing register/cache values to main memory; ytime required for CPU to set up transaction; yoverhead of data transfers by bus packets, handshaking, etc.

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Scheduling and allocation zMust: yschedule operations in time; yallocate computations to processing elements. zScheduling and allocation interact, but separating them helps. yAlternatively allocate, then schedule.

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Example: scheduling and allocation P1P2 P3 d1d2 Task graph Hardware platform M1M2

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. First design zAllocate P1, P2 -> M1; P3 -> M2. time M1 M2 P1P2 P3 P1CP2C

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Second design zAllocate P1 -> M1; P2, P3 -> M2: M1 M2 P1 P2P3 P1C time

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Example: adjusting messages to reduce delay zTask graph:z Network: P1P2 P3 d1 d2 M1M2M3 allocation execution time Transmission time = 4

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Initial schedule time M1 M2 M3 network P1 P2 d1d2 P3 Time = 15

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. New design zModify P3: yreads one packet of d1, one packet of d2 ycomputes partial result ycontinues to next packet

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. New schedule time M1 M2 M3 network P1 P2 d1 P3 d2d1 P3 d2d1 P3 d2d1 P3 d2 Time = 12

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Buffering and performance zBuffering may sequentialize operations. yNext process must wait for data to enter buffer before it can continue. zBuffer policy (queue, RAM) affects available parallelism.

Buffers and latency zThree processes separated by buffers: © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. B1A B2BB3C

Buffers and latency schedules A[0] A[1] … B[0] B[1] … C[0] C[1] … A[0] B[0] C[0] A[1] B[1] C[1] … © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Must wait for all of A before getting any B