CS3350B Computer Architecture Winter 2015 Performance Metrics I Marc Moreno Maza www.csd.uwo.ca/Courses/CS3350b.

Slides:



Advertisements
Similar presentations
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Yaohang Li.
Advertisements

Computer Abstractions and Technology
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
810:142 Lecture 2: Performance Fall 2006 Chapter 4: Performance Adapted from Mary Jane Irwin at Penn State University for Computer Organization and Design,
ECE-C355 Computer Structures Winter 2008 Chapter 04: Understanding Performance Slides are adapted from Professor Mary Jane Irwin (
CSE431 L04 Understanding Performance.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 04: Understanding Performance Mary Jane Irwin (
Read Section 1.4, Section 1.7 (pp )
Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.
CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.
CS/ECE 3330 Computer Architecture Chapter 1 Performance / Power.
EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.
Lecture 3: Computer Performance
CS3350B Computer Architecture Winter 2015 Introduction Marc Moreno Maza
Introduction to Computer Architecture SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING SUMMER 2015 RAMYAR SAEEDI.
CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Chapter 1 Section 1.4 Dr. Iyad F. Jafar Evaluating Performance.
Morgan Kaufmann Publishers Computer Abstractions and Technology
CSE 340 Computer Architecture Summer 2014 Understanding Performance
Chapter 1 Computer Abstractions and Technology Part II.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology.
EET 4250: Chapter 1 Computer Abstractions and Technology Acknowledgements: Some slides and lecture notes for this course adapted from Prof. Mary Jane Irwin.
Chapter 1 - The Computer Revolution Chapter 1 — Computer Abstractions and Technology — 1  Progress in computer technology  Underpinned by Moore’s Law.
Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
10/19/2015Erkay Savas1 Performance Computer Architecture – CS401 Erkay Savas Sabanci University.
Chapter 1 — Computer Abstractions and Technology — 1 Understanding Performance Algorithm Determines number of operations executed Programming language,
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
Performance.
Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Chapter 1 Technology Trends and Performance. Chapter 1 — Computer Abstractions and Technology — 2 Technology Trends Electronics technology continues to.
Morgan Kaufmann Publishers
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
CS35101 – Computer Architecture Week 9: Understanding Performance Paul Durand ( ) [Adapted from M Irwin (
Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.
CPS3340 COMPUTER ARCHITECTURE Fall Semester, /03/2013 Lecture 3: Computer Performance Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology.
Chapter 1 — Computer Abstractions and Technology — 1 Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Yaohang Li.
Performance COE 301 / ICS 233 Computer Organization Prof. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum.
Chapter 1 Performance & Technology Trends. Outline What is computer architecture? Performance What is performance: latency (response time), throughput.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface ARM Edition Chapter 1 Computer Abstractions and Technology.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
Lecture 3. Performance Prof. Taeweon Suh Computer Science & Engineering Korea University COSE222, COMP212, CYDF210 Computer Architecture.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology.
Computer Architecture & Operations I
Morgan Kaufmann Publishers Technology Trends and Performance
Measuring Performance II and Logic Design
Computer Architecture & Operations I
Computer Architecture & Operations I
CS161 – Design and Architecture of Computer Systems
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Assessing and Understanding Performance
Morgan Kaufmann Publishers Computer Abstractions and Technology
Computer Architecture & Operations I
Uniprocessor Performance
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers Computer Abstractions and Technology
COSC 3406: Computer Organization
CSCE 212 Chapter 4: Assessing and Understanding Performance
Morgan Kaufmann Publishers Computer Performance
Morgan Kaufmann Publishers Computer Abstractions and Technology
Morgan Kaufmann Publishers Computer Abstractions and Technology
Morgan Kaufmann Publishers Computer Abstractions and Technology
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
CS161 – Design and Architecture of Computer Systems
Presentation transcript:

CS3350B Computer Architecture Winter 2015 Performance Metrics I Marc Moreno Maza

Components of a Computer CPU Computer Control Datapath MemoryDevices Input Output

Levels of Program Code  High-level language l Level of abstraction closer to problem domain l Provides for productivity and portability  Assembly language l Textual representation of instructions  Hardware representation l Binary digits (bits) l Encoded instructions and data 3

Old School Machine Structures (Layers of Abstraction) 4 I/O systemProcessor Compiler Operating System (Mac OSX) Application (ex: browser) Digital Design Circuit Design Instruction Set Architecture Datapath & Control Transistors Memory Hardware Software Assembler

New-School Machine Structures  Parallel Requests Assigned to computer e.g., Search “Katz”  Parallel Threads Assigned to core e.g., Lookup, Ads  Parallel Instructions >1 one time e.g., 5 pipelined instructions  Parallel Data >1 data one time e.g., Add of 4 pairs of words  Hardware descriptions All gates working in parallel at same time Smart Phone Warehouse Scale Computer Software Hardware Harness Parallelism & Achieve High Performance Logic Gates Core … Memory (Cache) Input/Output Computer Main Memory Core Instruction Unit(s) Functional Unit(s) A 3 +B 3 A 2 +B 2 A 1 +B 1 A 0 +B 0 5

 Eight Great Ideas in Pursuing Performance l Design for Moore’s Law l Use abstraction to simplify design l Make the common case fast l Performance via parallelism l Performance via pipelining l Performance via prediction l Hierarchy of memories l Dependability via redundancy 6

Abstractions  Abstraction helps us deal with complexity l Hide lower-level detail  Instruction set architecture (ISA) l The hardware/software interface  Application binary interface l The ISA plus system software interface  Implementation l The details underlying and interface 7

Understanding Performance  Algorithm l Determines number of operations executed  Programming language, compiler, architecture l Determine number of machine instructions executed per operation  Processor and memory system l Determine how fast instructions are executed  I/O system (including OS) l Determines how fast I/O operations are executed 8

Performance Metrics  Purchasing perspective l given a collection of machines, which has the -best performance ? -least cost ? -best cost/performance?  Design perspective l faced with design options, which has the -best performance improvement ? -least cost ? -best cost/performance?  Both require l basis for comparison l metric for evaluation  Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors 9

CPU Performance  Normally interested in reducing l Response time (aka execution time) – the time between the start and the completion of a task -Important to individual users l Thus, to maximize performance, need to minimize execution time performance X = 1 / execution_time X If X is n times faster than Y, then performance X execution_time Y = = n performance Y execution_time X  And increasing l Throughput – the total amount of work done in a given time -Important to data center managers l Decreasing response time almost always improves throughput 10

Performance Factors  Want to distinguish elapsed time and the time spent on our task  CPU execution time (CPU time) – time the CPU spends working on a task l Does not include time waiting for I/O or running other programs CPU execution time # CPU clock cycles for a program for a program = x clock cycle time CPU execution time # CPU clock cycles for a program for a program clock rate =  Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program or 11

CPU Clocking  Operation of digital hardware governed by a constant- rate clock  Clock period (cycle): duration of a clock cycle l e.g., 250ps = 0.25ns = 250×10 –12 s  Clock frequency (rate): cycles per second l e.g., 3.0GHz = 3000MHz = 3.0×10 9 Hz  CR = 1 / CC Clock (cycles) Data transfer and computation Update state Clock period 12

Clock Cycles per Instruction  Not all instructions take the same amount of time to execute l One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction CPI for this instruction class ABC CPI123  Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute l A way to compare two different implementations of the same ISA # CPU clock cycles # Instructions Average clock cycles for a program for a program per instruction = x 13

Effective CPI  Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging Overall effective CPI =  (CPI i x IC i ) i = 1 n l Where IC i is the count (percentage) of the number of instructions of class i executed l CPI i is the (average) number of clock cycles per instruction for that instruction class l n is the number of instruction classes  The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs 14

THE Performance Equation  Our basic performance equation is then CPU time = Instruction_count x CPI x clock_cycle Instruction_count x CPI clock_rate CPU time = or  These equations separate the three key factors that affect performance l Can measure the CPU execution time by running the program l The clock rate is usually given l Can measure overall instruction count by using profilers/ simulators without knowing all of the implementation details l CPI varies by instruction type and ISA implementation for which we must know the implementation details 15

Determinates of CPU Performance Instruction_ count CPIclock_cycle Algorithm Programming language Compiler ISA Processor organization Technology CPU time = Instruction_count x CPI x clock_cycle 16

Determinates of CPU Performance Instruction_ count CPIclock_cycle Algorithm Programming language Compiler ISA Processor organization Technology CPU time = Instruction_count x CPI x clock_cycle X XX XX XX X X X X X 17

A Simple Example  How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?  How does this compare with using branch prediction to shave a cycle off the branch time?  What if two ALU instructions could be executed at once? OpFreqCPI i Freq x CPI i ALU50%1 Load20%5 Store10%3 Branch20%2  = 18

A Simple Example  How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?  How does this compare with using branch prediction to shave a cycle off the branch time?  What if two ALU instructions could be executed at once? OpFreqCPI i Freq x CPI i ALU50%1 Load20%5 Store10%3 Branch20%2  = CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster 19

20 Performance Summary  Performance depends on l Algorithm: affects IC, possibly CPI l Programming language: affects IC, CPI l Compiler: affects IC, CPI l Instruction set architecture: affects IC, CPI, T c

21 Power Trends  In complementary metal–oxide–semiconductor (CMOS) integrated circuit technology ×1000 5V → 1V ×30

22 Reducing Power  Suppose a new CPU has l 85% of capacitive load of old CPU l 15% voltage and 15% frequency reduction The power wall We can’t reduce voltage further We can’t remove more heat How else can we improve performance?

23 Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency

24 Multiprocessors  Multicore microprocessors l More than one processor per chip  Requires explicitly parallel programming l Compare with instruction level parallelism -Hardware executes multiple instructions at once -Hidden from the programmer l Hard to do -Programming for performance -Load balancing -Optimizing communication and synchronization

25 SPEC CPU Benchmark  Programs used to measure performance l Supposedly typical of actual workload  Standard Performance Evaluation Corp (SPEC) l Develops benchmarks for CPU, I/O, Web, …  SPEC CPU2006 l Elapsed time to execute a selection of programs -Negligible I/O, so focuses on CPU performance l Normalize relative to reference machine l Summarize as geometric mean of performance ratios -CINT2006 (integer) and CFP2006 (floating-point)

26 CINT2006 for Intel Core i7 920

Profiling Tools  Many profiling tools l gprof (static instrumentation) l cachegrind, Dtrace (dynamic instrumentation) l perf (performance counters)  perf in linux-tools, based on event sampling l Keep a list of where “interesting events” (cycle, cache miss, etc) happen l CPU Feature: Counters for hundreds of events - Performance: Cache misses, branch misses, instructions per cycle, … l Intel® 64 and IA-32 Architectures Software Developer's Manual: Appendix A lists all counters l perf user guide:

Exercise 1 void copymatrix1(int n, int (*src)[n], int (*dst)[n]) { int i,j; for (i = 0; i < n; i++) for (j = 0; j < n; j++) dst[i][j] = src[i][j]; } void copymatrix2(int n, int (*src)[n], int (*dst)[n]) { int i,j; for (j = 0; j < n; j++) for (i = 0; i < n; i++) dst[i][j] = src[i][j]; }  copymatrix1 vs copymatrix2 l What do they do? l What is the difference? l Which one performs better? Why?  perf stat –e cycles –e cache-misses./copymatrix1 perf stat –e cycles –e cache-misses./copymatrix2 l What’s the output like? l How to interpret it? l Which program performs better? 28

Exercise 2 void lower1 (char* s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= 'A'-'a'; } void lower2 (char* s) { int i; int n = strlen(s); for (i = 0; i < n; i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= 'A'-'a‘; }  lower1 vs lower2 l What do they do? l What is the difference? l Which one performs better? Why?  perf stat –e cycles –e cache-misses./lower1 perf stat –e cycles –e cache-misses./lower2 l What does the output look like? l How to interpret it? l Which program performs better? 29