Nicholas Weaver& Vladimir Stojanovic

Slides:



Advertisements
Similar presentations
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Yaohang Li.
Advertisements

CS2100 Computer Organisation Performance (AY2014/2015) Semester 2.
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
Evaluating Performance
Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.
CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.
Chapter 4 Assessing and Understanding Performance Bo Cheng.
Computer Performance Evaluation: Cycles Per Instruction (CPI)
CS/ECE 3330 Computer Architecture Chapter 1 Performance / Power.
Chapter 4 Assessing and Understanding Performance
Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
Chapter 1 Section 1.4 Dr. Iyad F. Jafar Evaluating Performance.
1 Computer Performance: Metrics, Measurement, & Evaluation.
Where Has This Performance Improvement Come From? Technology –More transistors per chip –Faster logic Machine Organization/Implementation –Deeper pipelines.
1 CHAPTER 2 THE ROLE OF PERFORMANCE. 2 Performance Measure, Report, and Summarize Make intelligent choices Why is some hardware better than others for.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.
1 CS/COE0447 Computer Organization & Assembly Language CHAPTER 4 Assessing and Understanding Performance.
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
CEN 316 Computer Organization and Design Assessing and Understanding Performance Mansour AL Zuair.
1  1998 Morgan Kaufmann Publishers How to measure, report, and summarize performance (suorituskyky, tehokkuus)? What factors determine the performance.
Performance Performance
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Performance and Floating-Point Arithmetic Instructors: Krste Asanovic & Vladimir Stojanovic.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Lecture 5: 9/10/2002CS170 Fall CS170 Computer Organization and Architecture I Ayman Abdel-Hamid Department of Computer Science Old Dominion University.
Lec2.1 Computer Architecture Chapter 2 The Role of Performance.
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
CS 61C: Great Ideas in Computer Architecture Performance and Floating Point Arithmetic Instructors: John Wawrzynek & Vladimir Stojanovic
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Yaohang Li.
Chapter 1 Performance & Technology Trends. Outline What is computer architecture? Performance What is performance: latency (response time), throughput.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
June 20, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 1: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed.
CS 110 Computer Architecture Lecture 17: Performance and Floating Point Arithmetic Instructor: Sören Schwertfeger School.
Measuring Performance and Benchmarks Instructor: Dr. Mike Turi Department of Computer Science and Computer Engineering Pacific Lutheran University Lecture.
Computer Architecture & Operations I
Two notions of performance
Measuring Performance II and Logic Design
CSCI206 - Computer Organization & Programming
Lecture 2: Performance Evaluation
Computer Architecture & Operations I
Computer Architecture & Operations I
CS161 – Design and Architecture of Computer Systems
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
September 2 Performance Read 3.1 through 3.4 for Tuesday
Performance Performance The CPU Performance Equation:
How do we evaluate computer architectures?
Defining Performance Which airplane has the best performance?
Computer Architecture & Operations I
Prof. Hsien-Hsin Sean Lee
Morgan Kaufmann Publishers
CSCE 212 Chapter 4: Assessing and Understanding Performance
CS2100 Computer Organisation
Instructors: Yuanqing Cheng
Defining Performance Section /14/2018 9:52 PM.
Chapter 1 Computer Abstractions & Technology Performance Evaluation
Computer Performance He said, to speed things up we need to squeeze the clock.
CSCI206 - Computer Organization & Programming
CMSC 611: Advanced Computer Architecture
Instructor: Michael Greenbaum
CMSC 611: Advanced Computer Architecture
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Parameters that affect it How to improve it and by how much
Computer Performance Read Chapter 4
Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.
CS161 – Design and Architecture of Computer Systems
Computer Organization and Design Chapter 4
CS2100 Computer Organisation
Presentation transcript:

Nicholas Weaver& Vladimir Stojanovic CS 61C: Great Ideas in Computer Architecture Performance Iron Law, Amdahl’s Law Instructors: Nicholas Weaver& Vladimir Stojanovic http://inst.eecs.berkeley.edu/~cs61c/

New-School Machine Structures (It’s a bit more complicated!) Software Hardware Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions >1 instruction @ one time e.g., 5 pipelined instructions Parallel Data >1 data item @ one time e.g., Add of 4 pairs of words Hardware descriptions All gates @ one time Programming Languages Warehouse Scale Computer Smart Phone Harness Parallelism & Achieve High Performance How do we know? Core … Memory (Cache) Input/Output Computer Cache Memory Core Instruction Unit(s) Functional Unit(s) A3+B3 A2+B2 A1+B1 A0+B0 Logic Gates

What is Performance? Latency (or response time or execution time) Time to complete one task Bandwidth (or throughput) Tasks completed per unit time If you have sufficient independent tasks, you can always throw more money at the problem: Throughput/$ often a more important metric than just throughput CS252 S05

Cloud Performance: Why Application Latency Matters Key figure of merit: application responsiveness Longer the delay, the fewer the user clicks, the less the user happiness, and the lower the revenue per user

Defining CPU Performance What does it mean to say X is faster than Y? Ferrari vs. School Bus? 2013 Ferrari 599 GTB   2 passengers, quarter mile in 10 secs 2013 Type D school bus 50 passengers, quarter mile in 20 secs Response Time (Latency): e.g., time to travel ¼ mile Throughput (Bandwidth): e.g., passenger-mi in 1 hour

Defining Relative CPU Performance PerformanceX = 1/Program Execution TimeX PerformanceX > PerformanceY => 1/Execution TimeX > 1/Execution Timey => Execution TimeY > Execution TimeX Computer X is N times faster than Computer Y PerformanceX / PerformanceY = N or Execution TimeY / Execution TimeX = N Bus to Ferrari performance: Program: Transfer 1000 passengers for 1 mile Bus: 3,200 sec, Ferrari: 40,000 sec

Measuring CPU Performance Computers use a clock to determine when events takes place within hardware Clock cycles: discrete time intervals aka clocks, cycles, clock periods, clock ticks Clock rate or clock frequency: clock cycles per second (inverse of clock cycle time) 3 GigaHertz clock rate => clock cycle time = 1/(3x109) seconds clock cycle time = 333 picoseconds (ps)

CPU Performance Factors To distinguish between processor time and I/O, CPU time is time spent in processor CPU Time/Program = Clock Cycles/Program x Clock Cycle Time Or CPU Time/Program = Clock Cycles/Program ÷ Clock Rate

Iron Law of Performance by Emer and Clark A program executes instructions CPU Time/Program = Clock Cycles/Program x Clock Cycle Time = Instructions/Program x Average Clock Cycles/Instruction x Clock Cycle Time 1st term called Instruction Count 2nd term abbreviated CPI for average Clock Cycles Per Instruction 3rd term is 1 / Clock rate

Restating Performance Equation Time = Seconds Program Instructions Clock cycles Seconds Program Instruction Clock Cycle × =

What Affects Each Component? A)Instruction Count, B)CPI, C)Clock Rate Affects What? (click in letter of component not affected) Algorithm Programming Language Compiler Instruction Set Architecture Algorithm Instruction count, possibly CPI The algorithm determines the number of source program instructions executed and hence the number of processor instructions executed. The algorithm may also affect the CPI, by favoring slower or faster instructions. For example, if the algorithm uses more floating-point operations, it will tend to have a higher CPI. Programming language Instruction count, CPI The programming language certainly affects the instruction count, since statements in the language are translated to processor instructions, which determine instruction count. The language mayalso affect the CPI because of its features; for example, a language with heavy support for data abstraction (e.g., Java) will require indirect calls, which will use higher CPI instructions. Compiler The efficiency of the compiler affects both the instruction count and average cycles per instruction, since the compiler determines the translation of the source language instructions into computer instructions. The compiler’s role can be very complex and affect the CPI in complex ways. Instruction set architecture Instruction count, clock rate, CPI The instruction set architecture affects all three a ts of CPU performance, since it affects the instructions needed for a function, the cost in cycles of each instruction, and the overall clock rate of the processor

What Affects Each Component? Instruction Count, CPI, Clock Rate Affects What? Algorithm Instruction Count, CPI Programming Language Compiler Instruction Set Architecture Instruction Count, Clock Rate, CPI Algorithm Instruction count, possibly CPI The algorithm determines the number of source program instructions executed and hence the number of processor instructions executed. The algorithm may also affect the CPI, by favoring slower or faster instructions. For example, if the algorithm uses more floating-point operations, it will tend to have a higher CPI. Programming language Instruction count, CPI The programming language certainly affects the instruction count, since statements in the language are translated to processor instructions, which determine instruction count. The language mayalso affect the CPI because of its features; for example, a language with heavy support for data abstraction (e.g., Java) will require indirect calls, which will use higher CPI instructions. Compiler The efficiency of the compiler affects both the instruction count and average cycles per instruction, since the compiler determines the translation of the source language instructions into computer instructions. The compiler’s role can be very complex and affect the CPI in complex ways. Instruction set architecture Instruction count, clock rate, CPI The instruction set architecture affects all three aspects of CPU performance, since it affects the instructions needed for a function, the cost in cycles of each instruction, and the overall clock rate of the processor

Clickers Computer Clock frequency Clock cycles per instruction #instructions per program A 1GHz 2 1000 B 2GHz 5 800 C 500MHz 1.25 400 D 5GHz 10 2000 A: 1ns * 2 * 1000 = 2us B: 0.5ns * 5*800 = 2us C: 2ns*1.25*400 = 1us D: 0.2ns*10*2000 = 4us Which computer has the highest performance for a given program?

Workload and Benchmark Workload: Set of programs run on a computer Actual collection of applications run or made from real programs to approximate such a mix Specifies programs, inputs, and relative frequencies Benchmark: Program selected for use in comparing computer performance Benchmarks form a workload Usually standardized so that many use them

SPEC (System Performance Evaluation Cooperative) Computer Vendor cooperative for benchmarks, started in 1989 SPECCPU2006 12 Integer Programs 17 Floating-Point Programs Often turn into number where bigger is faster SPECratio: reference execution time on old reference computer divide by execution time on new computer to get an effective speed-up

SPECINT2006 on AMD Barcelona Description Instruc-tion Count (B) CPI Clock cycle time (ps) Execu-tion Time (s) Refer-ence Time (s) SPEC-ratio Interpreted string processing 2,118 0.75 400 637 9,770 15.3 Block-sorting compression 2,389 0.85 817 9,650 11.8 GNU C compiler 1,050 1.72 724 8,050 11.1 Combinatorial optimization 336 10.0 1,345 9,120 6.8 Go game 1,658 1.09 721 10,490 14.6 Search gene sequence 2,783 0.80 890 9,330 10.5 Chess game 2,176 0.96 837 12,100 14.5 Quantum computer simulation 1,623 1.61 1,047 20,720 19.8 Video compression 3,102 993 22,130 22.3 Discrete event simulation library 587 2.94 690 6,250 9.1 Games/path finding 1,082 1.79 773 7,020 XML parsing 1,058 2.70 1,143 6,900 6.0

Summarizing Performance … System Rate (Task 1) Rate (Task 2) A 10 20 B Clickers: Which system is faster? A: System A B: System B C: Same performance D: Unanswerable question! CS252 S05

… Depends Who’s Selling System Rate (Task 1) Rate (Task 2) A 10 20 B Average 15 Average throughput System Rate (Task 1) Rate (Task 2) A 0.50 2.00 B 1.00 Average 1.25 Throughput relative to B System Rate (Task 1) Rate (Task 2) A 1.00 B 2.00 0.50 Average 1.25 Throughput relative to A CS252 S05

Summarizing SPEC Performance Varies from 6x to 22x faster than reference computer Geometric mean of ratios: N-th root of product of N ratios Geometric Mean gives same relative answer no matter what computer is used as reference Geometric Mean for Barcelona is 11.7

Administrivia Midterm 2 in the evening next Monday Project 2.1 grades in

Big Idea: Amdahl’s (Heartbreaking) Law Speedup due to enhancement E is Speedup w/ E = ---------------------- Exec time w/o E Exec time w/ E Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected Execution Time w/o E  [ (1-F) + F/S] Execution Time w/ E = Speedup w/ E = 1 / [ (1-F) + F/S ]

Big Idea: Amdahl’s Law Speedup = 1 (1 - F) + F S Non-speed-up part Speed-up part Example: the execution time of half of the program can be accelerated by a factor of 2. What is the program speed-up overall? 1 0.5 + 0.5 2 1 0.5 + 0.25 = = 1.33

Example #1: Amdahl’s Law Speedup w/ E = 1 / [ (1-F) + F/S ] Consider an enhancement which runs 20 times faster but which is only usable 25% of the time Speedup w/ E = 1/(.75 + .25/20) = 1.31 What if its usable only 15% of the time? Speedup w/ E = 1/(.85 + .15/20) = 1.17 Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar! To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less Speedup w/ E = 1/(.001 + .999/100) = 90.99

If the portion of the program that can be parallelized is small, then the speedup is limited The non-parallel portion limits the performance

Strong and Weak Scaling To get good speedup on a parallel processor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem. Strong scaling: when speedup can be achieved on a parallel processor without increasing the size of the problem Weak scaling: when speedup is achieved on a parallel processor by increasing the size of the problem proportionally to the increase in the number of processors Load balancing is another important factor: every processor doing same amount of work Just one unit with twice the load of others cuts speedup almost in half Add load balancing example here – page 638

Clickers/Peer Instruction Suppose a program spends 80% of its time in a square root routine. How much must you speedup square root to make the program run 5 times faster? A: 5 B: 16 C: 20 D: 100 E: None of the above Speedup w/ E = 1 / [ (1-F) + F/S ] VertLeftWhiteCheck1 Answer: 4th Need infinitely fast to get to 5X 5 = 2.8 10 = 3.6 20 = 4.2 100 = 4.8

And In Conclusion, … Time (seconds/program) is measure of performance Instructions Clock cycles Seconds Program Instruction Clock Cycle Floating-point representations hold approximations of real numbers in a finite number of bits × =