Performance of multiprocessing systems: Benchmarks and performance counters Miodrag Bolic ELG7187 Topics in Computers: Multiprocessor Systems on Chip.

Slides:



Advertisements
Similar presentations
Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
Advertisements

Superscalar and VLIW Architectures Miodrag Bolic CEG3151.
1 Enterprise Platforms Group Pinpointing Representative Portions of Large Intel Itanium Programs with Dynamic Instrumentation Harish Patil, Robert Cohn,
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
Lecture 6: Multicore Systems
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Flash: An efficient and portable Web server Authors: Vivek S. Pai, Peter Druschel, Willy Zwaenepoel Presented at the Usenix Technical Conference, June.
Performance Analysis of Multiprocessor Architectures
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Review: Chapters 1 – Chapter 1: OS is a layer between user and hardware to make life easier for user and use hardware efficiently Control program.
Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.
CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.
1 Process Description and Control Chapter 3. 2 Process Management—Fundamental task of an OS The OS is responsible for: Allocation of resources to processes.
Operating System Kernels1 Operating System Support for Performance Monitoring Witawas Srisa-an Chapter: not in the book.
Chapter 7 Interupts DMA Channels Context Switching.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
1 Chapter 4 The Central Processing Unit and Memory.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
SPEC 2006 CSE 820. Michigan State University Computer Science and Engineering Q1. What is SPEC? SPEC is the Standard Performance Evaluation Corporation.
Computer Organization
Computer Performance Computer Engineering Department.
CS3350B Computer Architecture Winter 2015 Performance Metrics I Marc Moreno Maza
Introduction CSE 410, Spring 2008 Computer Systems
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
Recap Technology trends Cost/performance Measuring and Reporting Performance What does it mean to say “computer X is faster than computer Y”? E.g. Machine.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Cosc 2150: Computer Organization Chapter 11: Performance Measurement.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
Advanced Computer Architecture Fundamental of Computer Design Instruction Set Principles and Examples Pipelining:Basic and Intermediate Concepts Memory.
Computer Architecture
CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Performance Performance
1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
بسم الله الرحمن الرحيم MEMORY AND I/O.
EGRE 426 Computer Organization and Design Chapter 4.
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 2: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed. Oct. 4,
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Introduction CSE 410, Spring 2005 Computer Systems
June 20, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 1: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed.
4- Performance Analysis of Parallel Programs
ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance
Multi-core processors
Assembly Language for Intel-Based Computers, 5th Edition
CSCE 212 Chapter 4: Assessing and Understanding Performance
What we need to be able to count to tune programs
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Performance of computer systems
Performance of computer systems
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
CMSC 611: Advanced Computer Architecture
Superscalar and VLIW Architectures
Performance of computer systems
Presentation transcript:

Performance of multiprocessing systems: Benchmarks and performance counters Miodrag Bolic ELG7187 Topics in Computers: Multiprocessor Systems on Chip

Outline Benchmarks Measurements and monitoring Performance counters

Types of benchmarks [1] Synthetic benchmarks – small artificial programs containing a mixture of statements which are selected such that they are representative for a large class of real applications. Kernel benchmarks – small but relevant parts of real applications which typically capture a large portion of the execution time of real applications. Real application benchmarks

4 Benchmarks: challenges Challenges in developing benchmarks – Testing a whole system: CPU, cache, main memory, compilers – Selecting a suitable sets of applications – How to make portable benchmarks (ANSI C: How big is a long? How big is a pointer? Does this platform implement calloc? Is it little endian or big endian? ) Fixed workload benchmarks - how fast was the workload completed; – EEMBC MPEG-x benchmark – time to process the entire video Throughput benchmarks -how many workload units per unit time were completed. – EEMBC MPEG-x benchmark – number of frames processed for the fixed amount of time The base metrics – same compiler flags must be used in the same order for all benchmarks.. The peak metrics – different compiler options may be used on each benchmark.

Available Benchmarks [2] SPEC CPU (general purpose), MediaBench (media) BioPerf (bioinformatics) PARSEC multi-threaded workloads on multicore processors, DaCapo to evaluate Java workloads, STAMP to evaluate transactional memory

SPEC Each of the programs is executed three times on the computer system U to be tested. For each of the programs Ai an average xecution time TU (Ai ) in seconds is determined by taking the median of the three execution times measured. For each program, the execution time TU (Ai ) determined in step (1) is normalized with respect to the reference computer R by dividing the execution time TR(Ai) on R by the execution time TU (Ai) on U. This yields an execution factor FU (Ai ) = TR(Ai )/TU (Ai ) – R - Sun Ultra Enterprise 2 with a 296MHz UltraSparc II processor SPECint2006 is computed as the geometric mean of the execution factors of the 12 SPEC integer programs Geometric mean: – the comparison between two machines is independent of the choice of the reference computer. – does not provide information about the actual execution time of the programs

Measurement [3] It is based on direct measurements of the system under study using a software or/and hardware monitor. Monitor performs three tasks: – data acquisition, – data analysis, – result output An event is a change in the system state. – Examples are process context switching, beginning of seek on a disk, and arrival of a packet. A trace is a log of events – includes the time of the event, the type of event, etc

Activating a monitor [3] Tracing - event-driven monitor - When an event occurs, the monitor is activated to capture the data about the state of the system. This gives a complete trace of the executing program. Sampling -The monitor is activated by clock interrupts.

Monitoring parallel software [3] Instrumentation perturbations Measuring degree of parallelism Detecting phases in execution profiles

Performance counters Time-based profiles - where your software spends its time, Hardware performance measurements - what the processor is doing and how effectively the processor is being utilized. Hardware measurements also pinpoint particular reasons why the CPU is stalling rather than accomplishing useful work. /t1.html /t1.html

Advantages [4] The application and operating system remain largely unmodified, apart from the addition of drivers in the operating system to enable access to the hardware performance counters. Not using a simulation of the application, operating system, or processor ensures that the accuracy of the collected event counts. Performance-monitoring hardware collects data on the fly as the application executes, allowing full-speed data collection and avoiding the slowness of simulation-based approaches. This approach can collect data for both the application and the operating system.

Performance monitoring [4] Performance events can be grouped into: – program characterization, – memory accesses, – pipeline stalls, – branch prediction, – resource utilization. Performance-monitoring hardware has two components: – performance event detectors – event counters.

MIPS R10000 [5] User, Supervisor, Kernel, and/or Exception level mode. Any combination of count enable bits may be asserted. Event select IP[7] interrupt enable

MIPS R10000 [5]

Intel’s solution Hardware performance counters are defined outside the "architectural" register set, and they are not saved and restored on process context switches. The measurements are therefore attached to the processor, and not to a process or thread. It is possible to separate user code from system code according to the privilege level The Intel Pentium-series processors include a 64-bit cycle counter, and two 40-bit event counters, with a list of events and additional semantics that depend on the particular processor. The AMD Athlon processor has a 64-bit cycle counter, and four 48-bit event counters

Using performance counters [4] Scheduling – Single per-core metric (such as IPC or cache miss rate) is not sufficient to categorize application behavior Different thread types often have highly varying characteristics. Threads behave differently based on what thread was scheduled beforehand Tuning memory access Communication pattern

Problem with perf. Counters [6]

Advanced performance counters [6]

Software [4] The Performance Application Programming Interface (PAPI) tool – provides a common interface to performance- monitoring hardware for many different processors, including Alpha, Athlon, Cray, Itanium, MIPS, Pentium, PowerPC, and UltraSparc. – Initiate and reset counters, read them Intel’s VTune Performance Analyzer – Supports all Intel Pentium and Itanium processors, – provides additional performance analysis tools such as call graph profiling and processor-specific tuning advice.

Other approaches for collecting processor performance data [4] Software monitoring – Modify code to collect data – Need to have available source code and to be able to rebuild the application. Simulators

References 1.Thomas Rauber, Gudula Runger, Parallel programming:For Multicore and Cluster Systems, Springer, 2010 (Chapter 4). 2.Lieven Eeckhout, Computer Architecture Performance Evaluation Methods, Synthesis Lectures on Computer Architecture, June Lei Hu and Ian Gorton, Performance Evaluation for Parallel Systems: A Survey, University of NSW, Australia, UNSW-CSE-TR-9707, October B. Sprunt, The Basics of Performance Monitoring Hardware, IEEE Micro, July-August, page 64-71, MIPS Technologies, MIPS R10000 Microprocessor User’s Manual, Ver 2.0, /pdf/ pdfhttp://techpubs.sgi.com/library/manuals/2000/ /pdf/ pdf 6.V. Salapura et al, “Next Generation Performance Counters: Towards Monitoring over thousand concurent events,” IBM Research Report, RC24351, 2007

Additional material covered in the lecture 1.Geometric mean computation [1]