Download presentation
Presentation is loading. Please wait.
Published byKarson Hoyle Modified over 9 years ago
1
LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology
2
Performance & cost
3
Performance Evolution 1970s Mainframes dominated – performance improved 25—30%/yr Mostly due to improved architecture + some technology aids 1980s VLSI + microprocessor became the foundation Technology improves at 35%/yr
4
Performance Evolution (Cont.) 1980s (Cont.) Compiler focus brought on the great CISC vs. RISC debate With the exception of Intel – RISC won the argument RISC performance improved by 50%/year initially Of course RISC is not as simple anymore and the compiler is a key part of the game –Does not matter how fast your computer is, if the compiler wastes most of it due to the inability to generate efficient code With the exploitation of instruction-level parallelism (pipeline + super-scalar) and the use of caches, performance is further enhanced CISC: Complex Instruction Set Computing RISC: Relegate Important Stuff to the Compiler (Reduced Instruction Set Computing)
5
Growth in Performance (Figure 1.1) Mainly due to advanced architecture ideas Technology driven
6
Optimizing the Design Usually the functional requirements are set by the company/marketplace Which design is optimal dependent on the choice of metric Cost minimized simple design Performance maximized complex design or better technology Time to market minimized also favors simplicity Oh – and you only get one shot Requires heaps of simulation and must quantify everything Inherent requirements for deep infrastructure and support Plus you must predict the trends…
7
LOGO Cost, Price, and Their Trends
8
Cost Clearly a market place issue -- profit as a function of volume Let’s focus on hardware costs Factors impacting cost Learning curve – manufacturing costs decrease over time Yield – the percentage of manufactured devices that survives the testing procedure Volume is also a key factor in determine cost Commodities are products that are sold by multiple vendors in large volumes and are essentially identical. (laptops)
9
Learning Curve at Work
10
Integrated Circuits Costs Die Cost goes roughly with die area
11
Cost of an Integrated Circuit Die Yield is the fraction or percentage of good dies on a wafer number is a parameter that corresponds roughly to the number of masking level, a measure on manufacturing complexity, critical to die yield ( = 4.0 is a good estimate).
12
Example: Finding the number of dies Find the number of die per 30-cm wafer for a die that is 0.7 cm on a side. Ans: The total die area is 049 cm 2. Thus (30/2) 2 30 Dies per wafer = ------------- ---------------- = 1347 0.49 ( 2 0.49) 0.5
13
Example: Finding the die yield Find the die yield for dies that are 1 cm on a side and 0.7 cm on a side, assuming a defect density of 0.6 per cm 2. Ans: The total die areas are 1 cm 2 and 0.49 cm 2. For the larger die yield is Die yield={1+(0.6 1)/4} -4 =0.57 For the smaller die, it is Die yield = {1+(0.6 0.49)/4} -4 =0.75
14
Computer Designers and Chip Costs The computer designer affects die size, and hence cost, both by what functions are included on or excluded from the die and by the number of I/O pins
15
LOGO Measuring and Reporting Performance
16
Definitions of Time Time can be defined in different ways, depending on what we are measuring: Response time : Total time to complete a task, including time spent executing on the CPU, accessing disk and memory, waiting for I/O and other processes, and operating system overhead. CPU execution time : Total time a CPU spends computing on a given task (excludes time for I/O or running other programs). This is also referred to as simply CPU time. User CPU time : Total time CPU spends in the program System CPU execution time : Total time operating systems spends executing tasks for the program. For example, a program may have a system CPU time of 22 sec., a user CPU time of 90 sec., a CPU execution time of 112 sec., and a response time of 162 sec..
17
performance Time to do the task (Execution Time) – execution time, response time, latency Tasks per day, hour, week, sec, ns... (Performance) – performance, throughput, bandwidth Response time– the time between the start and the completion of a task Thus, to maximize performance, need to minimize execution time If X is n times faster than Y, then Throughput – the total amount of work done in a given time Important to data center managers Decreasing response time almost always improves throughput
18
Calculating CPU Performance Want to distinguish elapsed time and the time spent on our task CPU execution time (CPU time) – time the CPU spends working on a task Does not include time waiting for I/O or running other programs Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program
19
Calculating CPU Performance (Cont.) We tend to count instructions executed = IC Note looking at the object code is just a start What we care about is the dynamic count - e.g. don’t forget loops, recursion, branches, etc. CPI (Clock Per Instruction) is a figure of merit
20
Calculating CPU Performance (Cont.) 3 Focus Factors -- Cycle Time, CPI, IC Sadly - they are interdependent and making one better often makes another worse (but small or predictable impacts) Cycle time depends on HW technology and organization CPI depends on organization (pipeline, caching...) and ISA IC depends on ISA and compiler technology Often CPI’s are easier to deal with on a per instruction basis
21
Clock Cycles per Instruction Not all instructions take the same amount of time to execute One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute A way to compare two different implementations of the same ISA # CPU clock cycles # Instructions Average clock cycles for a program for a program per instruction = x CPI for this instruction class ABC CPI123
22
Effective CPI Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging Overall effective CPI = (CPI i x IC i ) i = 1 n Where IC i is the count (percentage) of the number of instructions of class i executed CPI i is the (average) number of clock cycles per instruction for that instruction class n is the number of instruction classes The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs
23
A Simple Example How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once? OpFreqCPI i Freq x CPI i ALU50%1. Load20%5 Store10%3 Branch20%2 =
24
A Simple Example How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once? OpFreqCPI i Freq x CPI i ALU50%1. Load20%5 Store10%3 Branch20%2 =.5 1.0.3.4 2.2.5.4.3.4 2.2 CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster.5 1.0.3.2 2.2 CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster.25 1.0.3.4 2.2 CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster
25
Example of Computing CPU time If a computer has a clock rate of 50 MHz, how long does it take to execute a program with 1,000 instructions, if the CPI for the program is 3.5? Using the equation CPU time = Instruction count x CPI / clock rate gives CPU time = 1000 x 3.5 / (50 x 10 6 ) If a computer’s clock rate increases from 200 MHz to 250 MHz and the other factors remain the same, how many times faster will the computer be? CPU time old clock rate new 250 MHz ------------------- = ---------------------- = ---------------- = 1.25 CPU time new clock rate old 200 MHZ
26
Comparing and Summarizing Performance Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.)) How do we summarize the performance for benchmark set with a single number? The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM) AM = 1/n Time i i = 1 n Where Time i is the execution time for the i th program of a total of n programs in the workload A smaller mean indicates a smaller average execution time and thus improved performance
27
Choosing Programs to Evaluate Performance Real applications – clearly the right choice Porting and eliminating system-dependent activities User burden -- to know which of your programs you really care about Modified (or scripted) applications Enhance portability or focus on particular aspects of system performance Kernels – small, key pieces of real programs Best used to isolate performance of individual features to explain the reasons from differences in performance of real programs i.e. testing memory/ALU/branches intructions Not real programs however -- no user really uses them
28
Choosing Programs to Evaluate Performance (Cont.) Toy benchmarks – quicksort, puzzle Beginning programming assignment Synthetic benchmarks Try to match the average frequency of operations and operands of a large set of programs No user really runs them -- not even pieces of real programs They typically reside in cache & don’t test memory performance At the very least you must understand what the benchmark code is in order to understand what it might be measuring Companies thrive or bust on benchmark performance Hence they optimize for the benchmark BEWARE ALWAYS!!
29
Benchmark Suites SPEC (Standard Performance Evaluation Corporation) http://www.spec.org Desktop benchmarks CPU-intensive: SPEC CPU2000 Graphic-intensive: SPECviewperf Server benchmarks CPU throughput-oriented: SPECrate I/O activity: SPECSFS (NFS), SPECWeb Transaction processing: TPC (Transaction Processing Council) Embedded benchmarks EEMBC (EDN Embedded Microprocessor Benchmark Consortium)
30
SPEC Benchmarks www.spec.orgwww.spec.org Integer benchmarksFP benchmarks gzipcompressionwupwiseQuantum chromodynamics vprFPGA place & routeswimShallow water model gccGNU C compilermgridMultigrid solver in 3D fields mcfCombinatorial optimizationappluParabolic/elliptic pde craftyChess programmesa3D graphics library parserWord processing programgalgelComputational fluid dynamics eonComputer visualizationartImage recognition (NN) perlbmkperl applicationequakeSeismic wave propagation simulation gapGroup theory interpreterfacerecFacial image recognition vortexObject oriented databaseammpComputational chemistry bzip2compressionlucasPrimality testing twolfCircuit place & routefma3dCrash simulation fem sixtrackNuclear physics accel apsiPollutant distribution
31
Other Performance Metrics Power consumption – especially in the embedded market where battery life is important (and passive cooling) For power-limited applications, the most important metric is energy efficiency
32
Evaluating ISAs Design-time metrics: Can it be implemented, in how long, at what cost? Can it be programmed? Ease of compilation? Static Metrics: How many bytes does the program occupy in memory? Dynamic Metrics: How many instructions are executed? How many bytes does the processor fetch to execute the program? How many clocks are required per instruction? Best Metric: Time to execute the program! CPI Inst. CountCycle Time depends on the instructions set, the processor organization, and compilation techniques.
33
Other Problems Let’s assume we can get the test jig specified properly See the following example Which is better? By how much? Are the program equally important?
34
Some Aggregate Job Mix Options Arithmetic Mean - provides a simple average Does not account for weight - all programs treated equal Weighted arithmetic mean Weight is the frequency % of use Better but beware the dominant program time Depend on the reference machine
35
Weighted Arithmetic Mean ABCW(1)W(2)W(3) Program P1 (secs)1.0010.0020.000.500.9090.999 Program P2 (secs)1000.00100.0020.000.500.0910.001 Arithmetic mean: W(1)500.5055.0020.00 Arithmetic mean: W(2)91.8218.1820.00 Arithmetic mean: W(3)2.0010.0920.00
36
Normalized Time Metrics Geometric Mean Has the nice property that: Ratio of the means = Mean of the ratios Consistent no matter which machine is the reference Better than arithmetic means but Don’t form accurate prediction models – don’t predict execution time Still have to remain cautious
37
Normalized Time Metrics Arithmetic mean should not be used to average normalized execution time
38
LOGO Quantitative Principles of Computer Design
39
Make the Common Case Fast Need to validate that it is common or uncommon Often Common cases are simpler than uncommon cases e.g. exceptions like overflow, interrupts,... Truly simple is usually both cheap and fast - best of both worlds Trick is to quantify the advantage of a proposed enhancement
40
Amdahl’s Law Defines speedup gained from a particular feature Depends on 2 factors Fraction of original computation time that can take advantage of the enhancement - e.g. the commonality of the feature Level of improvement gained by the feature Amdahl’s law Quantification of the diminishing return principle
41
Amdahl's Law (Cont.) Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected
42
Simple Example Important Application: FPSQRT 20% FP instructions account for 50% Other 30% Designers say same cost to speedup: FPSQRT by 40x FP by 2x Other by 8x Which one should you invest? Straightforward plug in the numbers & compare BUT what’s your guess?? Amdahl’s Law says nothing about cost
43
And the Winner Is…?
44
Example of Amdahl’s Law Floating point instructions are improved to run twice as fast, but only 10% of the time was spent on these instructions originally. How much faster is the new machine? °The new machine is 1.053 times as fast, or 5.3% faster. °How much faster would the new machine be if floating point instructions become 100 times faster? Speedup = ExTime old ExTime new = 1 (1 - Fraction enhanced ) + Fraction enhanced Speedup enhanced Speedup = 1 (1 - 0.1) + 0.1/2 = 1.053 Speedup = 1 (1 - 0.1) + 0.1/100 = 1.109
45
Estimating Performance Improvements Assume a processor currently requires 10 seconds to execute a program and processor performance improves by 50 percent per year. By what factor does processor performance improve in 5 years? (1 + 0.5)^5 = 7.59 How long will it take a processor to execute the program after 5 years? ExTime new = 10/7.59 = 1.32 seconds
46
Performance Example Computers M1 and M2 are two implementations of the same instruction set. M1 has a clock rate of 50 MHz and M2 has a clock rate of 75 MHz. M1 has a CPI of 2.8 and M2 has a CPI of 3.2 for a given program. How many times faster is M2 than M1 for this program? What would the clock rate of M1 have to be for them to have the same execution time? ExTime M1 IC M1 x CPI M1 / Clock Rate M1 = ExTime M2 IC M2 x CPI M2 / Clock Rate M2 = 2.8/50 3.2/75 = 1.31
47
Simple Example Suppose we have made the following measurements: Frequency of FP operations (other than FPSQR) =25% Average CPI of FP operations=4.0 Average CPI of other instructions=1.33 Frequency of FPSQR=2% CPI of FPSQR=20 Two design alternatives Reduce the CPI of FPSQR to 2 Reduce the average CPI of all FP operations to 2
48
And The Winner is…
49
LOGO
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.