LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

Slides:

Advertisements

Similar presentations

CS1104: Computer Organisation School of Computing National University of Singapore.

Advertisements

Computer Abstractions and Technology

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Assessing and Understanding Performance Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

810:142 Lecture 2: Performance Fall 2006 Chapter 4: Performance Adapted from Mary Jane Irwin at Penn State University for Computer Organization and Design,

ECE-C355 Computer Structures Winter 2008 Chapter 04: Understanding Performance Slides are adapted from Professor Mary Jane Irwin (

CSE431 L04 Understanding Performance.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 04: Understanding Performance Mary Jane Irwin (

Read Section 1.4, Section 1.7 (pp )

TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.

Evaluating Performance

Performance COE 308 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals.

100 Performance ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.

Computer Organization and Architecture 18 th March, 2008.

Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.

Performance ICS 233 Computer Architecture and Assembly Language Dr. Aiman El-Maleh College of Computer Sciences and Engineering King Fahd University of.

CIS629 Fall Lecture Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two important.

1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.

Chapter 4 Assessing and Understanding Performance

CIS429/529 Winter 07 - Performance - 1 Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two.

1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.

Performance ICS 233 Computer Architecture and Assembly Language Dr. Aiman El-Maleh College of Computer Sciences and Engineering King Fahd University of.

CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

CMSC 611: Advanced Computer Architecture Benchmarking Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Chapter 1 Section 1.4 Dr. Iyad F. Jafar Evaluating Performance.

Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.

Computer Organization and Design Performance Montek Singh Mon, April 4, 2011 Lecture 13.

1 Computer Performance: Metrics, Measurement, & Evaluation.

CSE 340 Computer Architecture Summer 2014 Understanding Performance

Fundamentals of Computer Design

Recap Technology trends Cost/performance Measuring and Reporting Performance What does it mean to say “computer X is faster than computer Y”? E.g. Machine.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.

10/19/2015Erkay Savas1 Performance Computer Architecture – CS401 Erkay Savas Sabanci University.

1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.

1 CS/COE0447 Computer Organization & Assembly Language CHAPTER 4 Assessing and Understanding Performance.

Computer Architecture

Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.

CEN 316 Computer Organization and Design Assessing and Understanding Performance Mansour AL Zuair.

Cost and Performance.

Morgan Kaufmann Publishers

Lecture2: Performance Metrics Computer Architecture By Dr.Hadi Hassan 1/3/2016Dr. Hadi Hassan Computer Architecture 1.

1  1998 Morgan Kaufmann Publishers How to measure, report, and summarize performance (suorituskyky, tehokkuus)? What factors determine the performance.

Performance Performance

CS35101 – Computer Architecture Week 9: Understanding Performance Paul Durand ( ) [Adapted from M Irwin (

TEST 1 – Tuesday March 3 Lectures 1 - 8, Ch 1,2 HW Due Feb 24 –1.4.1 p.60 –1.4.4 p.60 –1.4.6 p.60 –1.5.2 p –1.5.4 p.61 –1.5.5 p.61.

September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!

Lec2.1 Computer Architecture Chapter 2 The Role of Performance.

L12 – Performance 1 Comp 411 Computer Performance He said, to speed things up we need to squeeze the clock Study

Performance COE 301 Computer Organization Dr. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum and Minerals.

EGRE 426 Computer Organization and Design Chapter 4.

CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.

Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.

Computer Architecture CSE 3322 Web Site crystal.uta.edu/~jpatters/cse3322 Send to Pramod Kumar, with the names and s.

CS203 – Advanced Computer Architecture Performance Evaluation.

Chapter 1 Performance & Technology Trends. Outline What is computer architecture? Performance What is performance: latency (response time), throughput.

CSE 340 Computer Architecture Summer 2016 Understanding Performance.

CS203 – Advanced Computer Architecture

Lecture 2: Performance Evaluation

September 2 Performance Read 3.1 through 3.4 for Tuesday

Assessing and Understanding Performance

ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance

How do we evaluate computer architectures?

Defining Performance Which airplane has the best performance?

Morgan Kaufmann Publishers

CS2100 Computer Organisation

August 30, 2000 Prof. John Kubiatowicz

Computer Organization and Design Chapter 4

CS2100 Computer Organisation

Presentation transcript:

LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology

Performance & cost

Performance Evolution  1970s  Mainframes dominated – performance improved 25—30%/yr  Mostly due to improved architecture + some technology aids  1980s  VLSI + microprocessor became the foundation  Technology improves at 35%/yr

Performance Evolution (Cont.)  1980s (Cont.)  Compiler focus brought on the great CISC vs. RISC debate With the exception of Intel – RISC won the argument RISC performance improved by 50%/year initially Of course RISC is not as simple anymore and the compiler is a key part of the game –Does not matter how fast your computer is, if the compiler wastes most of it due to the inability to generate efficient code  With the exploitation of instruction-level parallelism (pipeline + super-scalar) and the use of caches, performance is further enhanced CISC: Complex Instruction Set Computing RISC: Relegate Important Stuff to the Compiler (Reduced Instruction Set Computing)

Growth in Performance (Figure 1.1) Mainly due to advanced architecture ideas Technology driven

Optimizing the Design  Usually the functional requirements are set by the company/marketplace  Which design is optimal dependent on the choice of metric  Cost minimized  simple design  Performance maximized  complex design or better technology  Time to market minimized  also favors simplicity  Oh – and you only get one shot  Requires heaps of simulation and must quantify everything  Inherent requirements for deep infrastructure and support  Plus you must predict the trends…

LOGO Cost, Price, and Their Trends

Cost  Clearly a market place issue -- profit as a function of volume  Let’s focus on hardware costs  Factors impacting cost  Learning curve – manufacturing costs decrease over time  Yield – the percentage of manufactured devices that survives the testing procedure  Volume is also a key factor in determine cost  Commodities are products that are sold by multiple vendors in large volumes and are essentially identical. (laptops)

Learning Curve at Work

Integrated Circuits Costs Die Cost goes roughly with die area

Cost of an Integrated Circuit Die Yield is the fraction or percentage of good dies on a wafer number  is a parameter that corresponds roughly to the number of masking level, a measure on manufacturing complexity, critical to die yield (  = 4.0 is a good estimate).

Example: Finding the number of dies  Find the number of die per 30-cm wafer for a die that is 0.7 cm on a side.  Ans: The total die area is 049 cm 2. Thus   (30/2) 2   30 Dies per wafer =  = ( 2  0.49) 0.5

Example: Finding the die yield  Find the die yield for dies that are 1 cm on a side and 0.7 cm on a side, assuming a defect density of 0.6 per cm 2. Ans: The total die areas are 1 cm 2 and 0.49 cm 2. For the larger die yield is Die yield={1+(0.6  1)/4} -4 =0.57 For the smaller die, it is Die yield = {1+(0.6  0.49)/4} -4 =0.75

Computer Designers and Chip Costs  The computer designer affects die size, and hence cost, both by what functions are included on or excluded from the die and by the number of I/O pins

LOGO Measuring and Reporting Performance

Definitions of Time  Time can be defined in different ways, depending on what we are measuring:  Response time : Total time to complete a task, including time spent executing on the CPU, accessing disk and memory, waiting for I/O and other processes, and operating system overhead.  CPU execution time : Total time a CPU spends computing on a given task (excludes time for I/O or running other programs). This is also referred to as simply CPU time.  User CPU time : Total time CPU spends in the program  System CPU execution time : Total time operating systems spends executing tasks for the program.  For example, a program may have a system CPU time of 22 sec., a user CPU time of 90 sec., a CPU execution time of 112 sec., and a response time of 162 sec..

performance Time to do the task (Execution Time) – execution time, response time, latency Tasks per day, hour, week, sec, ns... (Performance) – performance, throughput, bandwidth Response time– the time between the start and the completion of a task Thus, to maximize performance, need to minimize execution time If X is n times faster than Y, then Throughput – the total amount of work done in a given time Important to data center managers Decreasing response time almost always improves throughput

Calculating CPU Performance  Want to distinguish elapsed time and the time spent on our task  CPU execution time (CPU time) – time the CPU spends working on a task  Does not include time waiting for I/O or running other programs  Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program

Calculating CPU Performance (Cont.)  We tend to count instructions executed = IC  Note looking at the object code is just a start  What we care about is the dynamic count - e.g. don’t forget loops, recursion, branches, etc.  CPI (Clock Per Instruction) is a figure of merit

Calculating CPU Performance (Cont.)  3 Focus Factors -- Cycle Time, CPI, IC  Sadly - they are interdependent and making one better often makes another worse (but small or predictable impacts) Cycle time depends on HW technology and organization CPI depends on organization (pipeline, caching...) and ISA IC depends on ISA and compiler technology  Often CPI’s are easier to deal with on a per instruction basis

Clock Cycles per Instruction  Not all instructions take the same amount of time to execute  One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction  Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute  A way to compare two different implementations of the same ISA # CPU clock cycles # Instructions Average clock cycles for a program for a program per instruction = x CPI for this instruction class ABC CPI123

Effective CPI  Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging Overall effective CPI =  (CPI i x IC i ) i = 1 n  Where IC i is the count (percentage) of the number of instructions of class i executed  CPI i is the (average) number of clock cycles per instruction for that instruction class  n is the number of instruction classes  The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs

A Simple Example  How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?  How does this compare with using branch prediction to shave a cycle off the branch time?  What if two ALU instructions could be executed at once? OpFreqCPI i Freq x CPI i ALU50%1. Load20%5 Store10%3 Branch20%2  =

A Simple Example  How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?  How does this compare with using branch prediction to shave a cycle off the branch time?  What if two ALU instructions could be executed at once? OpFreqCPI i Freq x CPI i ALU50%1. Load20%5 Store10%3 Branch20%2  = CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster

Example of Computing CPU time  If a computer has a clock rate of 50 MHz, how long does it take to execute a program with 1,000 instructions, if the CPI for the program is 3.5?  Using the equation CPU time = Instruction count x CPI / clock rate gives CPU time = 1000 x 3.5 / (50 x 10 6 )  If a computer’s clock rate increases from 200 MHz to 250 MHz and the other factors remain the same, how many times faster will the computer be? CPU time old clock rate new 250 MHz = = = 1.25 CPU time new clock rate old 200 MHZ

Comparing and Summarizing Performance  Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.))  How do we summarize the performance for benchmark set with a single number?  The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM) AM = 1/n  Time i i = 1 n  Where Time i is the execution time for the i th program of a total of n programs in the workload  A smaller mean indicates a smaller average execution time and thus improved performance

Choosing Programs to Evaluate Performance  Real applications – clearly the right choice  Porting and eliminating system-dependent activities  User burden -- to know which of your programs you really care about  Modified (or scripted) applications  Enhance portability or focus on particular aspects of system performance  Kernels – small, key pieces of real programs  Best used to isolate performance of individual features to explain the reasons from differences in performance of real programs  i.e. testing memory/ALU/branches intructions  Not real programs however -- no user really uses them

Choosing Programs to Evaluate Performance (Cont.)  Toy benchmarks – quicksort, puzzle  Beginning programming assignment  Synthetic benchmarks  Try to match the average frequency of operations and operands of a large set of programs  No user really runs them -- not even pieces of real programs  They typically reside in cache & don’t test memory performance  At the very least you must understand what the benchmark code is in order to understand what it might be measuring  Companies thrive or bust on benchmark performance Hence they optimize for the benchmark  BEWARE ALWAYS!!

Benchmark Suites  SPEC (Standard Performance Evaluation Corporation)   Desktop benchmarks  CPU-intensive: SPEC CPU2000  Graphic-intensive: SPECviewperf  Server benchmarks  CPU throughput-oriented: SPECrate  I/O activity: SPECSFS (NFS), SPECWeb  Transaction processing: TPC (Transaction Processing Council)  Embedded benchmarks  EEMBC (EDN Embedded Microprocessor Benchmark Consortium)

SPEC Benchmarks Integer benchmarksFP benchmarks gzipcompressionwupwiseQuantum chromodynamics vprFPGA place & routeswimShallow water model gccGNU C compilermgridMultigrid solver in 3D fields mcfCombinatorial optimizationappluParabolic/elliptic pde craftyChess programmesa3D graphics library parserWord processing programgalgelComputational fluid dynamics eonComputer visualizationartImage recognition (NN) perlbmkperl applicationequakeSeismic wave propagation simulation gapGroup theory interpreterfacerecFacial image recognition vortexObject oriented databaseammpComputational chemistry bzip2compressionlucasPrimality testing twolfCircuit place & routefma3dCrash simulation fem sixtrackNuclear physics accel apsiPollutant distribution

Other Performance Metrics  Power consumption – especially in the embedded market where battery life is important (and passive cooling)  For power-limited applications, the most important metric is energy efficiency

Evaluating ISAs  Design-time metrics:  Can it be implemented, in how long, at what cost?  Can it be programmed? Ease of compilation?  Static Metrics:  How many bytes does the program occupy in memory?  Dynamic Metrics:  How many instructions are executed? How many bytes does the processor fetch to execute the program?  How many clocks are required per instruction? Best Metric: Time to execute the program! CPI Inst. CountCycle Time depends on the instructions set, the processor organization, and compilation techniques.

Other Problems  Let’s assume we can get the test jig specified properly  See the following example  Which is better?  By how much?  Are the program equally important?

Some Aggregate Job Mix Options  Arithmetic Mean - provides a simple average  Does not account for weight - all programs treated equal  Weighted arithmetic mean  Weight is the frequency % of use  Better but beware the dominant program time  Depend on the reference machine

Weighted Arithmetic Mean ABCW(1)W(2)W(3) Program P1 (secs) Program P2 (secs) Arithmetic mean: W(1) Arithmetic mean: W(2) Arithmetic mean: W(3)

Normalized Time Metrics  Geometric Mean  Has the nice property that:  Ratio of the means = Mean of the ratios  Consistent no matter which machine is the reference  Better than arithmetic means but  Don’t form accurate prediction models – don’t predict execution time  Still have to remain cautious

Normalized Time Metrics Arithmetic mean should not be used to average normalized execution time

LOGO Quantitative Principles of Computer Design

Make the Common Case Fast  Need to validate that it is common or uncommon  Often  Common cases are simpler than uncommon cases  e.g. exceptions like overflow, interrupts,...  Truly simple is usually both cheap and fast - best of both worlds  Trick is to quantify the advantage of a proposed enhancement

Amdahl’s Law  Defines speedup gained from a particular feature  Depends on 2 factors  Fraction of original computation time that can take advantage of the enhancement - e.g. the commonality of the feature  Level of improvement gained by the feature  Amdahl’s law Quantification of the diminishing return principle

Amdahl's Law (Cont.) Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected

Simple Example  Important Application:  FPSQRT 20%  FP instructions account for 50%  Other 30%  Designers say same cost to speedup:  FPSQRT by 40x  FP by 2x  Other by 8x  Which one should you invest?  Straightforward plug in the numbers & compare BUT what’s your guess?? Amdahl’s Law says nothing about cost

And the Winner Is…?

Example of Amdahl’s Law  Floating point instructions are improved to run twice as fast, but only 10% of the time was spent on these instructions originally. How much faster is the new machine? °The new machine is times as fast, or 5.3% faster. °How much faster would the new machine be if floating point instructions become 100 times faster? Speedup = ExTime old ExTime new = 1 (1 - Fraction enhanced ) + Fraction enhanced Speedup enhanced Speedup = 1 ( ) + 0.1/2 = Speedup = 1 ( ) + 0.1/100 = 1.109

Estimating Performance Improvements  Assume a processor currently requires 10 seconds to execute a program and processor performance improves by 50 percent per year.  By what factor does processor performance improve in 5 years? ( )^5 = 7.59  How long will it take a processor to execute the program after 5 years? ExTime new = 10/7.59 = 1.32 seconds

Performance Example  Computers M1 and M2 are two implementations of the same instruction set.  M1 has a clock rate of 50 MHz and M2 has a clock rate of 75 MHz.  M1 has a CPI of 2.8 and M2 has a CPI of 3.2 for a given program.  How many times faster is M2 than M1 for this program?  What would the clock rate of M1 have to be for them to have the same execution time? ExTime M1 IC M1 x CPI M1 / Clock Rate M1 = ExTime M2 IC M2 x CPI M2 / Clock Rate M2 = 2.8/50 3.2/75 = 1.31

Simple Example  Suppose we have made the following measurements:  Frequency of FP operations (other than FPSQR) =25%  Average CPI of FP operations=4.0  Average CPI of other instructions=1.33  Frequency of FPSQR=2%  CPI of FPSQR=20  Two design alternatives  Reduce the CPI of FPSQR to 2  Reduce the average CPI of all FP operations to 2

And The Winner is…

LOGO