Download presentation
Presentation is loading. Please wait.
Published byShirley Hatchell Modified over 9 years ago
1
PradeepKumar S K Asst. Professor Dept. of ECE, KIT, TIPTUR. E-Mail: pradeepsk13@gmail.compradeepsk13@gmail.com pradeepsk@kitece.com PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 1
2
How to improve the performance of a microprocessor system? Choose a faster version of your microprocessor Add additional computational units that are perform special functions? Standard Component (Graphics Processor) Coprocessor (Floating-Point Processor) Additional Microprocessor Hardware Accelerator PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 2
3
Hardware Accelerators If the overall performance of a uniprocessor system is too slow, additional hardware can be used to speed up the system. This hardware is called hardware accelerator! The hardware accelerator is a component that works together with the processor and executes key functions much faster than the processor. PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 3
4
Accelerated System Architecture PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 4
5
Amdahl’s Law PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 5
6
Amdahl's law, also known as Amdahl's argument, [1] is used to find the maximum expected improvement to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors. The law is named after computer architect Gene Amdahl, and was presented at the AFIPS Spring Joint Computer Conference in 1967. [1]parallel computingspeedupcomputer architectGene Amdahl AFIPS The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. For example, if a program needs 20 hours using a single processor core, and a particular portion of the program which takes one hour to execute cannot be parallelized, while the remaining 19 hours (95%) of execution time can be parallelized, then regardless of how many processors are devoted to a parallelized execution of this program, the minimum execution time cannot be less than that critical one hour. Hence the speedup is limited to at most 20×. PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 6
7
Amdahl's law is a model for the relationship between the expected speedup of parallelized implementations of an algorithm relative to the serial algorithm, under the assumption that the problem size remains the same when parallelized. For example, if for a given problem size a parallelized implementation of an algorithm can run 12% of the algorithm's operations arbitrarily quickly (while the remaining 88% of the operations are not parallelizable), Amdahl's law states that the maximum speedup of the parallelized version is 1/(1 – 0.12) = 1.136 times as fast as the non-parallelized implementation. More technically, the law is concerned with the speedup achievable from an improvement to a computation that affects a proportion P of that computation where the improvement has a speedup of S. (For example, if 30% of the computation may be the subject of a speed up, P will be 0.3; if the improvement makes the portion affected twice as fast, S will be 2.) Amdahl's law states that the overall speedup of applying the improvement will be: PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 7
8
8 If F is the fraction of a calculation that is sequential, and (1-F) is the fraction that can be parallelized, then the maximum speed- up that can be achieved by using P processors is 1/(F+(1-F)/P). Examples: If 90% of a calculation can be parallelized (i.e. 10% is sequential) then the maximum speed-up which can be achieved on 5 processors is 1/(0.1+(1-0.1)/5) or roughly 3.6 (i.e. the program can theoretically run 3.6 times faster on five processors than on one) If 90% of a calculation can be parallelized then the maximum speed-up on 10 processors is 1/(0.1+(1-0.1)/10) or 5.3 (i.e. investing twice as much hardware speeds the calculation up by about 50%). If 90% of a calculation can be parallelized then the maximum speed-up on 20 processors is 1/(0.1+(1-0.1)/20) or 6.9 (i.e. doubling the hardware again speeds up the calculation by only 30%).
9
PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 9
10
10
11
PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 11
12
An Accelerator is not a Co- Processor A co-processor is connected to the CPU and executes special instructions. Instructions are dispatched by the CPU. An accelerator appears as a device on the bus. PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 12
13
Design of a hardware accelerator Which functions shall be implemented in hardware and which functions in software? Hardware/software co-design: joint design of hardware and software architectures The hardware accelerator can be implemented in Application-specific integrated circuit. Field-programmable gate array (FPGA). PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 13
14
PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 14
15
Hardware/Software Co- Design Hardware/Software Co-design covers the following problems Co-Specification: the creations of specifications that describe both the hardware and software of a system Co-Synthesis: The automatic or semi-automatic design of hardware and software to meet a specification Co-Simulation: The simultaneous simulation of hardware and software elements on different levels of abstraction PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 15
16
Co-Synthesis Four tasks are included in co-synthesis Partitioning: The functionality of the system is divided into smaller, interacting computation units Allocation: The decision, which computational resources are used to implement the functionality of the system Scheduling: If several system functions have to share the same resource, the usage of the resource must be scheduled in time Mapping: The selection of a particular allocated computational unit for each computation unit All these tasks depend on each other! PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 16
17
Partitioning During partitioning the functionality of the system is partitioned into several parts (corresponding to the allocated/available components) Many possible partitions exist Analysis is done by evaluating the costs of different partitions PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 17
18
Estimation In order to get a good partitioning, there is a need for good figures about performance for a function on different components execution time for communication time Accuracy and Fidelity The accuracy of an estimate is a measure how close the estimate is to the actual value on the real implementation The fidelity of an estimation method is defined as percentage of correctly predicted comparisons between design implementations PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 18
19
PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 19
20
Hardware/Software Co-Design PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 20 Strategies : Start with an ”all-software”-configuration While (Constraints are not satisfied) Move the SW function that gives the best improvement to HW (implemented in COSYMA [Ernst, Henkel, Brenner 1993]) Start with an ”all-hardware”-configuration While (Constraints are satisfied) Move the most costly HW component to SW (implemented in Vulcan [Gupta, DeMicheli 1995]) System design tasks: Design a heterogeneous multiprocessor architecture. Processing element (PE): CPU, accelerator, etc. Divide Tasks to Processing Elements Verify that Functionality of the system is correct System meets the performance constraints
21
PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 21
22
PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 22
23
Why accelerators? cont’d. Good for processing I/O in real-time. May consume less energy. May be better at streaming data. May not be able to do all the work on even the largest single CPU. Accelerated system design First, determine that the system really needs to be accelerated. Which core function(s) shall be accelerated? (Partitioning) How much faster is the accelerator on the core function? How much is the data transfer overhead? Design Tasks Performance analysis; Scheduling and allocation. Design the accelerator itself. Design CPU interface to accelerator. PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 23
24
Sources of parallelism Overlap I/O and accelerator computation. Perform operations in batches, read in second batch of data while computing on first batch. Find other work to do on the CPU. May reschedule operations to move work after accelerator initiation. PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 24
25
PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 25
26
PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 26
27
PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 27
28
Data input/output times Bus transactions include: Flushing register/cache values to main memory; Time required for CPU to set up transaction; Overhead of data transfers by bus packets, handshaking, etc. PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 28
29
PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 29
30
Accelerator/CPU interface Accelerator registers provide control registers for CPU. Data registers can be used for small dataobjects. Accelerator may include special-purpose read/write logic. Especially valuable for large data transfers Caching problems Main memory provides the primary data transfer mechanism to the accelerator. Programs must ensure that caching does not invalidate main memory data (Assume a cache in CPU). PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 30
31
PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 31
32
PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 32
33
PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 33
34
Scheduling and allocation Must: Schedule operations in time; Allocate computations to processing elements. Scheduling and allocation interact, but separating them helps. Alternatively allocate, then schedule PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 34
35
PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 35
36
PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 36
37
PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 37
38
PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 38
39
System integration and debugging Try to debug the CPU/accelerator interface separately from the accelerator core. Build equipment to test the accelerator. Hardware/software co-simulation can be useful. PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 39
40
Summary The use of a hardware accelerator can lead to a more efficient solution In particular when the parallelism in the functionality can be exploited Hardware/Software co-design techniques can be used for the design of an accelerator You have to be aware of cache coherence problems, if the processor or accelerator uses a cache PradeepKumar S K, Asst. Professor,Dept. of ECE, KIT,Tiptur. 40
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.