Performance Analysis of Multiprocessor Architectures

Slides:

Advertisements

Similar presentations

COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Yaohang Li.

Advertisements

Performance What differences do we see in performance? Almost all computers operate correctly (within reason) Most computers implement useful operations.

Computer Abstractions and Technology

Power calculation for transistor operation What will cause power consumption to increase? CS2710 Computer Organization1.

Performance of multiprocessing systems: Benchmarks and performance counters Miodrag Bolic ELG7187 Topics in Computers: Multiprocessor Systems on Chip.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.

Introduction CS 524 – High-Performance Computing.

CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.

Chapter 4 Assessing and Understanding Performance Bo Cheng.

CIS629 Fall Lecture Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two important.

Computer Performance Evaluation: Cycles Per Instruction (CPI)

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.

1 Lecture 11: Digital Design Today’s topics:  Evaluating a system  Intro to boolean functions.

Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.

Computer Architecture Lecture 2 Instruction Set Principles.

Chapter 4 Assessing and Understanding Performance

1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.

CIS429/529 Winter 07 - Performance - 1 Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two.

CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

SPEC 2006 CSE 820. Michigan State University Computer Science and Engineering Q1. What is SPEC? SPEC is the Standard Performance Evaluation Corporation.

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

CMSC 611: Advanced Computer Architecture Benchmarking Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.

1 Computer Performance: Metrics, Measurement, & Evaluation.

Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.

Ch4b- 2 EE/CS/CPE Computer Organization  Seattle Pacific University Performance metrics I’m concerned with how long it takes to run my program.

Performance Evaluation of Parallel Processing. Why Performance?

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Lecture 2b: Performance Metrics. Performance Metrics Measurable characteristics of a computer system: Count of an event Duration of a time interval Size.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

Advanced Computer Architecture Fundamental of Computer Design Instruction Set Principles and Examples Pipelining:Basic and Intermediate Concepts Memory.

Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.

1 CS/COE0447 Computer Organization & Assembly Language CHAPTER 4 Assessing and Understanding Performance.

Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.

Parallel Programming with MPI and OpenMP

Classic Model of Parallel Processing

Pipelining and Parallelism Mark Staveley

1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:

September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!

EGRE 426 Computer Organization and Design Chapter 4.

Chapter 1 — Computer Abstractions and Technology — 1 Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency.

CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.

Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.

Concurrency and Performance Based on slides by Henri Casanova.

CS203 – Advanced Computer Architecture Performance Evaluation.

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

BITS Pilani, Pilani Campus Today’s Agenda Role of Performance.

Chapter 1 Performance & Technology Trends. Outline What is computer architecture? Performance What is performance: latency (response time), throughput.

Lecture 3. Performance Prof. Taeweon Suh Computer Science & Engineering Korea University COSE222, COMP212, CYDF210 Computer Architecture.

Computer Architecture & Operations I

Measuring Performance II and Logic Design

CS203 – Advanced Computer Architecture

Lecture 2: Performance Evaluation

Computer Architecture & Operations I

CS161 – Design and Architecture of Computer Systems

4- Performance Analysis of Parallel Programs

Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.

September 2 Performance Read 3.1 through 3.4 for Tuesday

Uniprocessor Performance

Morgan Kaufmann Publishers

CSCE 212 Chapter 4: Assessing and Understanding Performance

Chapter 3: Principles of Scalable Performance

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture

COMS 361 Computer Organization

Presentation transcript:

Performance Analysis of Multiprocessor Architectures CEG 4131 Computer Architecture III Miodrag Bolic

Plan for today Speedup Efficiency Scalability Parallelism profile in programs Benchmarks

Terminology What is this? T=Ic*CPI*Tclk, Ic is the number of instructions in a program, CPI is the number of cycles per instruction For a given instruction set we can compute average CPI MIPSrate=Ic/(T*10^6) MIPSrate is based on the instruction mix which is very poor basis for performance comparisons because of instruction set differences. What is this?

Speedup Speedup is the ratio of the execution time of the best possible serial algorithm on a single processor T(1) to the parallel execution time of the chosen algorithm on n-processor parallel system T(n): S(n) = T(1)/T(n) Speedup measure the absolute merits of parallel algorithms with respect to the “optimal” sequential version.

Amdahl’s Law [2]  pure sequential mode 1 -  ~ a probability that the system operates in a fully parallel mode using n processors.  S = T(1)/T(n) T(1)(1-  ) The system is either used in a pure sequential mode or in parallel mode using n processors n->infinity, S->1/betta The best speedup one can expect is bounded by 1/betta T(n) = T(1) + n 1 n S = = (1-  )  + n + (1-  ) n

Efficiency The system efficiency for an n-processor system: Efficiency is a measure of the speedup achieved per processor.

Communication overhead [1] tc is the communication overhead Speedup Efficiency n S = n + (1-  )+ntc/T(1)

Parallelism Profile in Programs [2] Degree of Parallelism For each time period, the number of processors used to execute a program is defined as the degree of parallelism (DOP). The plot of the DOP as a function of time is called the parallelism profile of a given program. Fluctuation of the profile during an observation period depends on the algorithmic structure, program optimization, resource utilization, and run-time conditions of a computer system. DOP defines the extent to which software parallelism matches hardware parallelism. The execution of the program on parallel computer may use different number of processors at different time Question: How would the parallelism profile look for Amdahl’s low?

Average Parallelism [2] The average parallelism A is computed by: where: m is the maximum parallelism in a profile ti is the total amount of time that DOP = i n>>m Available parallelism: Computationally intensive applications: up to 3500 operations in a clock cycle If the computation is less numeric, the available parallelism is smaller. Instructions level parallelism is only 2 to 5

Example [2] The parallelism profile of an example divide-and-conquer algorithm increases from 1 to its peak value m = 8 and then decreases to 0 during the observation period (tl, t2). A = (1  5 + 2  3 + 3  4 + 4  6 + 5  2 + 6  2 + 8  3)/ /(5 + 3 + 4 + 6 + 2 + 2 + 3)=93/25= 3.72.

Scalability of Parallel Algorithms [1] Scalability analysis determines whether parallel processing of a given problem can offer the desired improvement in performance. Parallel system is scalable if its efficiency can be kept fixed as the number of processors is increased assuming that the problem size is also increased. Example: Adding m numbers using n processors. Communication and computation take one unit time. Steps: Each processor adds m/n numbers The processors combine their sums

Scalability Example [1] Efficiency for different values of m and n n m 2 4 8 16 32 64 0.94 0.8 0.57 0.33 0.167 128 0.97 0.888 0.73 0.5 0.285 256 0.985 0.84 0.67 0.444 512 0.99 0.91 0.062 1024 0.995 0.89 0.76

Benchmarks [4] A benchmark is "a standard of measurement or evaluation" (Webster’s II Dictionary). Running the same computer benchmark on multiple computers allows a comparison to be made. A computer benchmark is typically a computer program that performs a strictly defined set of operations - a workload Returns some form of result - a metric - describing how the tested computer performed.

Benchmarks Challenges in developing benchmarks Testing a whole system: CPU, cache, main memory, compilers Selecting a suitable sets of applications How to make portable benchmarks (ANSI C: How big is a long? How big is a pointer? Does this platform implement calloc? Is it little endian or big endian? ) Fixed workload benchmarks - how fast was the workload completed; EEMBC MPEG-x benchmark – time to process the entire video Throughput benchmarks -how many workload units per unit time were completed. EEMBC MPEG-x benchmark – number of frames processed for the fixed amount of time Some benchmarks Dhrystone SPEC EEMBC Indeed, the process for these newly released media benchmarks began nearly three years ago. At the earliest stages of development, selecting a representative set of applications to sufficiently test the processors and systems under consideration provided the first challenge. The next challenge involved determining how to make these benchmarks portable enough to run on the wide variety of processors and configurations that constitute the targeted digital consumer applications. Many popular benchmarks perform a fixed workload. Throughput benchmarks, on the other hand, have no concept of finishing a fixed amount of work. Developers use throughput benchmarks to measure the rate at which work gets done. Portability: The DENBench suite runs natively, directly on the processor hardware and without an operating system. Although this represents a deviation from most real-world implementations, it supports portability because it eliminates any operating system dependencies or application-programming interface issues. C do you know if it is? Even if your application never does any I/O, it’s not just the speed of the CPU that dictates performance—cache, main memory, and compilers also play a role—and different software applications have differing performance requirements. And whom do you trust to provide this information?

The Dhrystone Results This is a CPU-intensive benchmark consisting of a mix of about 100 high-level language instructions and data types found in system programming applications where floating-point operations are not used. The Dhrystone statements are balanced with respect to statement type, data type, and locality of reference, with no operating system calls and making no use of library functions or subroutines. Dhrystone MIPS (sometimes just called DMIPS). The program fits in a cache memory so that it cannot be used for testing caches

EEMBC [3] The Embedded Microprocessor Benchmark Consortium’s (www.eembc.org) Benchmarks telecommunications, networking, digital media, Java, automotive/industrial, consumer, office equipment products Out-of-the-box portable code Cannot take advantage of a multiprocessing or multithreading system’s resources Optimized implementations take advantage of hardware accelerators or coprocessors or special instructions Future: it will supprot multiprocessing embedded applications Show the following: Automotive -> Basic Integer and Floating Point Digital entertainment -> RGB to YIQ Conversion

SPEC [4] The Standard Performance Evaluation Corporation www.spec.org/. SPEC CPU2000 focuses on compute intensive performance, and emphasize the performance of: the computer's processor, the memory architecture, the compilers. CINT2000 integer programs CFP2000 floating point programs

SPEC Features The base metrics The peak metrics Benchmark programs are developed from actual end-user applications as opposed to being synthetic benchmarks (like gcc). Multiple vendors use the suite and support it. SPEC CPU2000 is highly portable. The base metrics same compiler flags must be used in the same order for all benchmarks.. The peak metrics different compiler options may be used on each benchmark.

References Advanced Computer Architecture and Parallel Processing, by Hesham El-Rewini and Mostafa Abd-El-Barr, John Wiley and Sons, 2005. Advanced Computer Architecture Parallelism, Scalability, Programmability, by K. Hwang, McGraw-Hill 1993. The Embedded Microprocessor Benchmark Consortium’s (www.eembc.org) The Standard Performance Evaluation Corporation www.spec.org/.