Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance of multiprocessing systems: Benchmarks and performance counters Miodrag Bolic ELG7187 Topics in Computers: Multiprocessor Systems on Chip.

Similar presentations


Presentation on theme: "Performance of multiprocessing systems: Benchmarks and performance counters Miodrag Bolic ELG7187 Topics in Computers: Multiprocessor Systems on Chip."— Presentation transcript:

1 Performance of multiprocessing systems: Benchmarks and performance counters Miodrag Bolic ELG7187 Topics in Computers: Multiprocessor Systems on Chip

2 Outline Benchmarks Measurements and monitoring Performance counters

3 Types of benchmarks [1] Synthetic benchmarks – small artificial programs containing a mixture of statements which are selected such that they are representative for a large class of real applications. Kernel benchmarks – small but relevant parts of real applications which typically capture a large portion of the execution time of real applications. Real application benchmarks

4 4 Benchmarks: challenges Challenges in developing benchmarks – Testing a whole system: CPU, cache, main memory, compilers – Selecting a suitable sets of applications – How to make portable benchmarks (ANSI C: How big is a long? How big is a pointer? Does this platform implement calloc? Is it little endian or big endian? ) Fixed workload benchmarks - how fast was the workload completed; – EEMBC MPEG-x benchmark – time to process the entire video Throughput benchmarks -how many workload units per unit time were completed. – EEMBC MPEG-x benchmark – number of frames processed for the fixed amount of time The base metrics – same compiler flags must be used in the same order for all benchmarks.. The peak metrics – different compiler options may be used on each benchmark.

5 Available Benchmarks [2] SPEC CPU (general purpose), MediaBench (media) BioPerf (bioinformatics) PARSEC multi-threaded workloads on multicore processors, DaCapo to evaluate Java workloads, STAMP to evaluate transactional memory

6 SPEC Each of the programs is executed three times on the computer system U to be tested. For each of the programs Ai an average xecution time TU (Ai ) in seconds is determined by taking the median of the three execution times measured. For each program, the execution time TU (Ai ) determined in step (1) is normalized with respect to the reference computer R by dividing the execution time TR(Ai) on R by the execution time TU (Ai) on U. This yields an execution factor FU (Ai ) = TR(Ai )/TU (Ai ) – R - Sun Ultra Enterprise 2 with a 296MHz UltraSparc II processor SPECint2006 is computed as the geometric mean of the execution factors of the 12 SPEC integer programs Geometric mean: – the comparison between two machines is independent of the choice of the reference computer. – does not provide information about the actual execution time of the programs

7 Measurement [3] It is based on direct measurements of the system under study using a software or/and hardware monitor. Monitor performs three tasks: – data acquisition, – data analysis, – result output An event is a change in the system state. – Examples are process context switching, beginning of seek on a disk, and arrival of a packet. A trace is a log of events – includes the time of the event, the type of event, etc

8 Activating a monitor [3] Tracing - event-driven monitor - When an event occurs, the monitor is activated to capture the data about the state of the system. This gives a complete trace of the executing program. Sampling -The monitor is activated by clock interrupts.

9 Monitoring parallel software [3] Instrumentation perturbations Measuring degree of parallelism Detecting phases in execution profiles

10 Performance counters Time-based profiles - where your software spends its time, Hardware performance measurements - what the processor is doing and how effectively the processor is being utilized. Hardware measurements also pinpoint particular reasons why the CPU is stalling rather than accomplishing useful work. http://perfsuite.ncsa.uiuc.edu/publications/LJ135 /t1.html http://perfsuite.ncsa.uiuc.edu/publications/LJ135 /t1.html

11 Advantages [4] The application and operating system remain largely unmodified, apart from the addition of drivers in the operating system to enable access to the hardware performance counters. Not using a simulation of the application, operating system, or processor ensures that the accuracy of the collected event counts. Performance-monitoring hardware collects data on the fly as the application executes, allowing full-speed data collection and avoiding the slowness of simulation-based approaches. This approach can collect data for both the application and the operating system.

12 Performance monitoring [4] Performance events can be grouped into: – program characterization, – memory accesses, – pipeline stalls, – branch prediction, – resource utilization. Performance-monitoring hardware has two components: – performance event detectors – event counters.

13 MIPS R10000 [5] User, Supervisor, Kernel, and/or Exception level mode. Any combination of count enable bits may be asserted. Event select IP[7] interrupt enable

14 MIPS R10000 [5]

15 Intel’s solution Hardware performance counters are defined outside the "architectural" register set, and they are not saved and restored on process context switches. The measurements are therefore attached to the processor, and not to a process or thread. It is possible to separate user code from system code according to the privilege level The Intel Pentium-series processors include a 64-bit cycle counter, and two 40-bit event counters, with a list of events and additional semantics that depend on the particular processor. The AMD Athlon processor has a 64-bit cycle counter, and four 48-bit event counters

16 Using performance counters [4] Scheduling – Single per-core metric (such as IPC or cache miss rate) is not sufficient to categorize application behavior Different thread types often have highly varying characteristics. Threads behave differently based on what thread was scheduled beforehand Tuning memory access Communication pattern

17 Problem with perf. Counters [6]

18 Advanced performance counters [6]

19 Software [4] The Performance Application Programming Interface (PAPI) tool – provides a common interface to performance- monitoring hardware for many different processors, including Alpha, Athlon, Cray, Itanium, MIPS, Pentium, PowerPC, and UltraSparc. – Initiate and reset counters, read them Intel’s VTune Performance Analyzer – Supports all Intel Pentium and Itanium processors, – provides additional performance analysis tools such as call graph profiling and processor-specific tuning advice.

20 Other approaches for collecting processor performance data [4] Software monitoring – Modify code to collect data – Need to have available source code and to be able to rebuild the application. Simulators

21 References 1.Thomas Rauber, Gudula Runger, Parallel programming:For Multicore and Cluster Systems, Springer, 2010 (Chapter 4). 2.Lieven Eeckhout, Computer Architecture Performance Evaluation Methods, Synthesis Lectures on Computer Architecture, June 2010. 3.Lei Hu and Ian Gorton, Performance Evaluation for Parallel Systems: A Survey, University of NSW, Australia, UNSW-CSE-TR-9707, October 1997. 4.B. Sprunt, The Basics of Performance Monitoring Hardware, IEEE Micro, July-August, page 64-71, 2002. 5.MIPS Technologies, MIPS R10000 Microprocessor User’s Manual, Ver 2.0, 1996. http://techpubs.sgi.com/library/manuals/2000/007-2490- 001/pdf/007-2490-001.pdfhttp://techpubs.sgi.com/library/manuals/2000/007-2490- 001/pdf/007-2490-001.pdf 6.V. Salapura et al, “Next Generation Performance Counters: Towards Monitoring over thousand concurent events,” IBM Research Report, RC24351, 2007

22 Additional material covered in the lecture 1.Geometric mean computation [1]


Download ppt "Performance of multiprocessing systems: Benchmarks and performance counters Miodrag Bolic ELG7187 Topics in Computers: Multiprocessor Systems on Chip."

Similar presentations


Ads by Google