Topic IV - Cont’d Performance Measurement

Topic IV - Cont’d Performance Measurement
Introduction to Computer Systems Engineering (CPEG 323) 11/21/2018 CPEG323-05F\Topic4a

Relative MIPS Time reference Relative MIPS = * MIPS reference
Where Time reference = execution time of a program on the reference machine Time unrated = execution time of the same program on machine to be rated MIPS reference = agreed-upon MIPS rating of the reference machine Time unrated 11/21/2018 CPEG323-05F\Topic4a

Relative MIPS Cont’d Relative MIPS only tracks execution time for the given program and input. Even when they are identified, it becomes harder to find a reference machine 11/21/2018 CPEG323-05F\Topic4a

Relative MIPS Cont’d The question also arises whether the older machine should be run with the newest release of the compiler and operating system, or whether the software should fixed so the reference machine does not get faster over time. 11/21/2018 CPEG323-05F\Topic4a

Relative MIPS Cont’d In summary, the advantage of relative MIPS is questionable: Which program and input? Which ref. M to use? (time changes) Which compiler or/and OS to use? (time changes) Which benchmark to use? Patterson: The advantage is small 11/21/2018 CPEG323-05F\Topic4a

Peak Rates vs. Sustained Rates
The peak rate is the rate which can be attained if every resource involved in the measurement can be used at its maximum rate. For example, a 1 GHz processor can do 1 floating-point addition and 1 floating-point multiplication each clock cycle. Therefore, we can say this processor has a peak rate of 2 GFLOPS. OK for the theory: can we get this in practice? 11/21/2018 CPEG323-05F\Topic4a

Limitations on Peak Rates
Other resources may not be able to keep up Your program may not be able to use the resources in the manner needed to get the peak performance - Example: your program only does floating-point +, not x The peak rate may not be physically sustainable 11/21/2018 CPEG323-05F\Topic4a

Peak Rate Example The i860 (1991, Intel) was advertised as having an 80 MFLOPS peak rate (1 FP add and 1 FP multiply per cycle of a 40 MHz clock). However, when compiling and running various linear algebra programs, experimenters found the actual rate ranged from 15.0 MFLOPS (19% of peak) down to 3.2 MFLOPS (4% of peak)! What’s wrong? 11/21/2018 CPEG323-05F\Topic4a

Benchmarks Real programs Kernels: Synthesis benchmarks:
C, Tex, Spice Kernels: Livermore Loops LINPACK best for isolating performance of individual features of the machine Toy benchmarks: 10 ~ 100 lines Sieve of Erastosthenes Puzzle Quicksort N-Queen Synthesis benchmarks: Try to match average frequency of a large set of programs 11/21/2018 CPEG323-05F\Topic4a

More Benchmarks Drystone [Weicker84] Whestone: [Currow & Wichmann76]
University computer center jobs 12 loops SPEC Benchmarks SPEC 89 SPEC 92 SDPEC 95 SPEC2000 11/21/2018 CPEG323-05F\Topic4a

More Benchmarks MediaBench CommBench . 11/21/2018 CPEG323-05F\Topic4a

Small Benchmarks and Kernels
Early benchmarks used “toy problems” (quicksort, Towers of Hanoi) Other benchmarks took small fragments of code from inside application loops. One early example was the Livermore Loops (21 code fragments) 11/21/2018 CPEG323-05F\Topic4a

Small Benchmarks: Pluses and Minuses
+ Drawn from real applications – seems to be realistic + easy to understand and analyze + Highly portable (can even be converted to other languages) + Emphasizes the “make common case fast” principle - Still too much like MIPS if your app is not like theirs - Not representative if your app is complex and has many different parts 11/21/2018 CPEG323-05F\Topic4a

Synthetic Benchmarks A synthetic benchmark attempts to exercise the hardware in a manner which mimics real-world applications, but in a small piece of code. Examples: Whetstone, Dhrystone – Each repeatedly executes a loop which performs a varied mix of instructions and uses the memory in various ways; figure of merit is how many “Whetstones” or “Dhrystones” per second your computer can do. 11/21/2018 CPEG323-05F\Topic4a

Synthetic Benchmarks: Pluses and Minuses
+ Seem to be more realistic than kernels + Still easy to understand and analyze + Still highly portable Reliance on a single benchmark skews perceptions Nobody can agree on a single benchmark Easy to abuse – designers focus on improving that benchmark instead of real apps 11/21/2018 CPEG323-05F\Topic4a

Application Benchmarks
OK, we admit it; you can’t capture real-world complexity in a few dozen (or hundred) lines of C code! So use some real programs instead. If you’re going to buy a machine, you’re best off trying the apps you will use. But you may not always be able to do this. 11/21/2018 CPEG323-05F\Topic4a

Using Real Applications for Benchmarks
LINPACK (Linear algebra Package) is used to rank world’s 500 fastest computers ( Tom’s hardware ( and some other hardware reviewers run a game (such as Quake) and measure the FPS (frames per second) 11/21/2018 CPEG323-05F\Topic4a

Application Benchmarks: Pluses
+ Closer to applications – more realistic + Better at exposing weaknesses and performance bottlenecks in systems 11/21/2018 CPEG323-05F\Topic4a

Application Benchmarks: Minuses
Harder to compare different machines unless you use a common standard (e.g., ANSI C) Difficult to determine why a particular program runs fast or slow, due to complexity Whose benchmark? (You can always find one benchmark which makes your product look best) Takes too long to simulate (bit issue for researchers) 11/21/2018 CPEG323-05F\Topic4a

11/21/2018 CPEG323-05F\Topic4a

Benchmark Suites - Objectives
Run a bunch of programs and combine the results. Get everyone to agree to use the same benchmarks Lay down common ground rules Develop a method for reporting and disseminating results Put caveats everywhere and try to educate the people using the results May be targeted toward applications domains (e.t., web servers, transaction processing, multimedia, HOPC) 11/21/2018 CPEG323-05F\Topic4a

The SPEC Benchmarks SPEC (System Performance Evaluation Cooperative) formed to write “standard” benchmark suites with industry acceptance Main releases: SPEC89, SPEC92, SPEC95, SPEC2000 Divided into integer and FP-intensive apps, e.g., SPECfp95. CINT2000, CFP2000, etc. Recent domains-specific suites (e.g., SPEC HPC2002, SPECweb99) On ECE/CIS machines, type “sysinfo hostname” to see scores 11/21/2018 CPEG323-05F\Topic4a

SPEC RATIO: Measuring Latency
Results for each individual benchmark of the SPEC benchmark suites, expressed as the ratio of the wall clock time to execute one single copy of the benchmark, compared to a fixed "SPEC reference time", which was chosen as the execution time on a a SUN Ultra 5_10 with a 300 MHz processor. From: P&H: Third Ed., p259 11/21/2018 CPEG323-05F\Topic4a

SPEC RATE: Measuring Throughput
Several copies of a given SPEC benchmark are executed. The method is particularly suitable for multiprocessor systems. The results, called SPEC rate, express how many jobs of a particular type (characterised by the individual benchmark) can be executed in a given time (The SPEC reference time happens to be a week, the execution times are normalized with respect to a VAX 11/780). From: 11/21/2018 CPEG323-05F\Topic4a

SPEC Ground Rules and Reproducibility
Everyone uses the same code – modifying the code not allowed Also use the same data inputs! You must describe the configuration, including the compiler If you report numbers, it must be commercially available Everything compiled the same way with “standard” optimizations - A separate score is allowed for program-specific tuning 11/21/2018 CPEG323-05F\Topic4a

The SPEC CINT2000 and CFP2000 ratings for the
Intel Pentium III and Pentium IV processors at different clock speed Note: This chart is or the “base case”. More detailed see 11/21/2018 CPEG323-05F\Topic4a

SPEC Examples (Integer Benchmarks)
10 9 8 7 6 SPECint 5 4 3 2 1 50 100 150 200 250 Pentium Pentium Pro (From Patterson and Henness, p. 73: COPYRIGHT 1998 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED) 11/21/2018 CPEG323-05F\Topic4a

SPEC Examples (FP Benchmarks)
10 9 8 7 6 SPECfp 5 4 3 2 1 50 100 150 200 250 Pentium Pentium Pro (From Patterson and Henness, p. 74: COPYRIGHT 1998 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED) 11/21/2018 CPEG323-05F\Topic4a

Tuning for SPEC Compiler Enhanced compiler 800 700 600
500 SPEC performance ratio 400 300 200 100 gcc expresso spice dodluc NASA7 li epntott matrix300 fppp tomcatv Compiler Enhanced compiler (From Patterson and Henness, p. 68: COPYRIGHT 1998 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED) 11/21/2018 CPEG323-05F\Topic4a

Summary of Performance Measurement
Latency: How long does it take to get a particular task done? Throughput: How many tasks can you perform in a unit of time? Performance Execution time 1 Performance  Execution time (Wall clock time) User time System time Other time 11/21/2018 CPEG323-05F\Topic4a

Summary of Performance Measurement(Con’t)
Clock rate (frequency) = cycles per second (1 Hz = 1 cycle/sec) CPI Cycles per instruction – smaller is better IPC Instruction per cycle – bigger is better CPU time = Clock rate Instruction count * CPI Weighted CPI n i=1 S CPU time =( (CPIi * Ii))/clock rate 11/21/2018 CPEG323-05F\Topic4a

Summary of Performance Measurement(Con’t)
MIPS (Millions of Instructions Per Second) MOPS (Millions of Operations Per Second) MFLOPS (Millions of Floating-point Operations Per Second) Instruction count Clock rate MIPS = = Execution time * 106 CPI * 106 Benchmarks SPEC ratio and rate 11/21/2018 CPEG323-05F\Topic4a

Topic IV - Cont’d Performance Measurement

Similar presentations

Presentation on theme: "Topic IV - Cont’d Performance Measurement"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Topic IV - Cont’d Performance Measurement

Similar presentations

Presentation on theme: "Topic IV - Cont’d Performance Measurement"— Presentation transcript:

Similar presentations

About project

Feedback