COSC121: Computer Systems. Basic Architecture and Performance

COSC121: Computer Systems. Basic Architecture and Performance
Jeremy Bolton, PhD Assistant Teaching Professor Constructed using materials: - Patt and Patel Introduction to Computing Systems (2nd) - Patterson and Hennessy Computer Organization and Design (4th) **A special thanks to Eric Roberts and Mary Jane Irwin

Notes Project 3 Assigned due soon. Read PH.1

Outline Review: From Electron to Programming Languages
Basic Computer Architecture How can we improve our computing model? (Improve wrt efficiency) Performance Measures Speed Cost Power consumption Memory Hardware is fast, software is slow CISC RISC Parallelism ILP DLP MLP Going to memory is slow Caching schemes Interrupt-based communication: Polling can be slow. More and more parallelism Multi-threading, multi-cores, distributed computing … and more

This week … our journey takes us …
COSC 121: Computer Systems Application (Browser) Operating System (Win, Linux) Compiler COSC 255: Operating Systems Software Assembler Drivers Instruction Set Architecture Hardware Processor Memory I/O system Datapath & Control Digital Design COSC 120: Computer Hardware Circuit Design transistors

Assessing the performance of a Computer
Consider Computers are defined by and limited by their hardware / ISAs Many different computing models

Classes of Computers Personal computers Servers Supercomputers
Designed to deliver good performance to a single user at low cost usually executing 3rd party software, usually incorporating a graphics display, a keyboard, and a mouse Servers Used to run larger programs for multiple, simultaneous users typically accessed only via a network and that places a greater emphasis on dependability and (often) security Supercomputers A high performance, high cost class of servers with hundreds to thousands of processors, terabytes of memory and petabytes of storage that are used for high- end scientific and engineering applications Embedded computers (processors) A computer inside another device used for running one predetermined application

How do we define “better”: Performance Metrics
Purchasing perspective given a collection of machines, which has the best performance/cost? least cost ? best cost/performance? Design perspective faced with design options, which has the best performance improvement ? Both require basis for comparison metric for evaluation Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors Or smallest/lightest Longest battery life Most reliable/durable (in space)

Throughput versus Response Time
Response time (execution time) – the time between the start and the completion of a task Important to individual users Throughput (bandwidth) – the total amount of work done in a given time Important to data center managers Will need different performance metrics as well as a different set of applications to benchmark embedded and desktop computers, which are more focused on response time, versus servers, which are more focused on throughput

Defining (Speed) Performance
To maximize performance, need to minimize execution time performanceX = 1 / execution_timeX If X is n times faster than Y, then performanceX execution_timeY = = n performanceY execution_timeX Increasing performance requires decreasing execution time Decreasing response time almost always improves throughput

Relative Performance Example
If computer A runs a program in 10 seconds and computer B runs the same program in 15 seconds, how much faster is A than B? We know that A is n times faster than B if performanceA execution_timeB = = n performanceB execution_timeA The performance ratio is 15 = 1.5 10 So A is 1.5 times faster than B

Review: Machine Clock Rate
Clock rate (clock cycles per second in MHz or GHz) is inverse of clock cycle time (clock period) CC = 1 / CR one clock period 10 nsec clock cycle => 100 MHz clock rate 5 nsec clock cycle => 200 MHz clock rate 2 nsec clock cycle => 500 MHz clock rate 1 nsec (10-9) clock cycle => 1 GHz (109) clock rate 500 psec clock cycle => 2 GHz clock rate 250 psec clock cycle => 4 GHz clock rate 200 psec clock cycle => 5 GHz clock rate A clock cycle is the basic unit of time to execute one operation/pipeline stage/etc.

Performance Factors CPU execution time (CPU time) – time the CPU spends working on a task Does not include time waiting for I/O or running other programs CPU execution time # CPU clock cycles for a program for a program = x clock cycle time or CPU execution time # CPU clock cycles for a program for a program clock rate = Many techniques that decrease the number of clock cycles also increase the clock cycle time Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program

Improving Performance Example
A program runs on computer A with a 2 GHz clock in 10 seconds. What clock rate must computer B run at to run this program in 6 seconds? Assume (unfortunately), to accomplish this, computer B will require 1.2 times as many clock cycles as computer A to run the program. First: we need to know the number of clock cycles on A CPU timeA CPU clock cyclesA clock rateA = CPU clock cyclesA = 10 sec x 2 x 109 cycles/sec = 20 x 109 cycles For lecture CPU timeB x 20 x 109 cycles clock rateB = Given that we know B will require 1.2 times as many cycles, and must execute in 6s, we can solve for clock rate. clock rateB x 20 x 109 cycles 6 seconds = = 4 GHz

THE Performance Equation
Our basic performance equation is then CPU time = Instruction_count x CPI x clock_cycle or Instruction_count x CPI clock_rate CPU time = These equations separate the three key factors that affect performance Can measure the CPU execution time by running the program The clock rate is usually given Can measure overall instruction count by using profilers/ simulators without knowing all of the implementation details CPI varies by instruction type and ISA implementation for which we must know the implementation details Note that instruction count is dynamic – i.e., its not the number of lines in the code, but THE NUMBER OF INSTRUCTIONS EXECUTED

Using the Performance Equation: Example
Computers A and B implement the same ISA. Computer A has a clock cycle time of 250 ps and an effective CPI of 2.0 for some program and computer B has a clock cycle time of 500 ps and an effective CPI of 1.2 for the same program. Which computer is faster and by how much? Each computer executes the same number of instructions, I, so CPU timeA = I x 2.0 x 250 ps = 500 x I ps CPU timeB = I x 1.2 x 500 ps = 600 x I ps For lecture Clearly, A is faster … by the ratio of execution times performanceA execution_timeB x I ps = = = 1.2 performanceB execution_timeA x I ps

Determinates of CPU Performance
CPU time = Instruction_count x CPI x clock_cycle Instruction_ count CPI clock_cycle Algorithm Programming language Compiler ISA X X X X For lecture X X X X X

Clock Cycles per Instruction
Not all instructions take the same amount of time to execute One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction # CPU clock cycles # Instructions Average clock cycles for a program for a program per instruction = x Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute A way to compare two different implementations of the same ISA CPI for this instruction class A B C CPI 1 2 3

Effective (Average) CPI
Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging n Overall effective CPI =  (CPIi x ICi) i = 1 Where ICi is the count (percentage) of the number of instructions of class i executed CPIi is the (average) number of clock cycles per instruction for that instruction class n is the number of instruction classes The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs

A Simple Example  = Op Freq CPIi Freq x CPIi ALU 50% 1 Load 20% 5
Store 10% 3 Branch 2  = .5 1.0 .3 .4 .5 .4 .3 .5 1.0 .3 .2 .25 1.0 .3 .4 2.2 1.6 2.0 1.95 How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once? For lecture CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster

Workloads and Benchmarks
Benchmarks – a set of programs that form a “workload” specifically chosen to measure performance SPEC (System Performance Evaluation Cooperative) creates standard sets of benchmarks starting with SPEC89. The latest is SPEC CPU2006 which consists of 12 integer benchmarks (CINT2006) and 17 floating-point benchmarks (CFP2006). There are also benchmark collections for power workloads (SPECpower_ssj2008), for mail workloads (SPECmail2008), for multimedia workloads (mediabench), …

… Other Performance Metrics
Power consumption – especially in the embedded market where battery life is important For power-limited applications, the most important metric is energy efficiency From Version 3 of the book – probably needs replaced or dropped (based on SPEC2000)

Morgan Kaufmann Publishers
September 11, 2018 Power Trends In CMOS IC technology ×30 5V → 1V ×1000 Chapter 1 — Computer Abstractions and Technology

The Power Issue: Relatively speaking
Fred Pollack, Intel Corp. Micro32 conference key note

September 11, 2018 Reducing Power Suppose a new CPU has 85% of capacitive load of old CPU 15% voltage and 15% frequency reduction The power wall We can’t reduce voltage further We can’t remove more heat How else can we improve performance? Chapter 1 — Computer Abstractions and Technology

Parallelism for improved performance
Given physical constraints Clock Speeds (Power) Transistor size (Moore’s Law is coming to an end … ?) most current improvements have come from parallelism Parallelism in both hardware and software

The University of Adelaide, School of Computer Science
11 September 2018 Parallelism Classes of parallelism in applications: Data-Level Parallelism (DLP) Task-Level Parallelism (TLP) Classes of architectural parallelism: Instruction-Level Parallelism (ILP) Vector architectures/Graphic Processor Units (GPUs) Thread-Level Parallelism Request-Level Parallelism Chapter 2 — Instructions: Language of the Computer

Instruction Level Parallelism
F You have already seen parallelism in action! Pipelining allows for multiple instructions to be executed simultaneously Lets briefly investigate another ISA: MIPS D EA OP EX S

11 September, 2018 MIPS Pipeline Five stages, one step per stage IF: Instruction fetch from memory ID: Instruction decode & register read EX: Execute operation or calculate address MEM: Access memory operand WB: Write result back to register ALU IM Reg DM Chapter 4 — The Processor

11 September, 2018 Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup = 2n/0.5n ≈ 4 = number of stages Chapter 4 — The Processor

Why Pipeline? For Performance!
Time (clock cycles) Once the pipeline is full, one instruction is completed every cycle, so CPI = 1 ALU IM Reg DM Inst 0 I n s t r. O r d e ALU IM Reg DM Inst 1 ALU IM Reg DM Inst 2 ALU IM Reg DM Inst 3 ALU IM Reg DM Inst 4 Time to fill the pipeline

Can Pipelining Get Us Into Trouble?
Yes: Pipeline Hazards structural hazards: attempt to use the same resource by two different instructions at the same time data hazards: attempt to use data before it is ready An instruction’s source operand(s) are produced by a prior instruction still in the pipeline control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated branch and jump instructions, exceptions Note that data hazards can come from R-type instructions or lw instructions Exceptions can’t be resolved by waiting! Can usually resolve hazards by waiting pipeline control must detect the hazard and take action to resolve hazards

A Single Memory Would Be a Structural Hazard
Time (clock cycles) Reading data from memory ALU Mem Reg lw I n s t r. O r d e ALU Mem Reg Inst 1 ALU Mem Reg Inst 2 ALU Mem Reg Inst 3 Reading instruction from memory ALU Mem Reg Inst 4 Fix with separate instr and data memories (I$ and D$)

How About Register File Access?
Time (clock cycles) Fix register file access hazard by doing reads in the second half of the cycle and writes in the first half ALU IM Reg DM add $1, I n s t r. O r d e ALU IM Reg DM Inst 1 ALU IM Reg DM Inst 2 ALU IM Reg DM add $2,$1, Define register reads to occur in the second half of the cycle and register writes in the first half clock edge that controls loading of pipeline state registers clock edge that controls register writing

Register Usage Can Cause Data Hazards
Dependencies backward in time cause hazards ALU IM Reg DM add $1, I n s t r. O r d e ALU IM Reg DM sub $4,$1,$5 ALU IM Reg DM and $6,$1,$7 ALU IM Reg DM or $8,$1,$9 For class handout ALU IM Reg DM xor $4,$1,$5 Read before write data hazard

Register Usage Can Cause Data Hazards
Dependencies backward in time cause hazards ALU IM Reg DM add $1, ALU IM Reg DM sub $4,$1,$5 ALU IM Reg DM and $6,$1,$7 ALU IM Reg DM or $8,$1,$9 For lecture ALU IM Reg DM xor $4,$1,$5 Read before write data hazard

Loads Can Cause Data Hazards
Dependencies backward in time cause hazards ALU IM Reg DM lw $1,4($2) I n s t r. O r d e ALU IM Reg DM sub $4,$1,$5 ALU IM Reg DM and $6,$1,$7 ALU IM Reg DM or $8,$1,$9 Note that lw is just another example of register usage (beyond ALU ops) ALU IM Reg DM xor $4,$1,$5 Load-use data hazard

Branch Instructions Cause Control Hazards
Dependencies backward in time cause hazards beq ALU IM Reg DM I n s t r. O r d e ALU IM Reg DM lw ALU IM Reg DM Inst 3 ALU IM Reg DM Inst 4

Even with all the hazards …
Pipelining is worth the overhead! We will discuss the different hazards in more detail …

11 September, 2018 Pipeline Performance Assume time for stages is 100ps for register read or write 200ps for other stages Compare pipelined datapath with single-cycle datapath Instr Instr fetch Register read ALU op Memory access Register write Total time lw 200ps 100 ps 800ps sw 700ps R-format 600ps beq 500ps Chapter 4 — The Processor

11 September, 2018 Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) Chapter 4 — The Processor

11 September, 2018 Pipeline Speedup If all stages are balanced i.e., all take the same time Time between instructionspipelined= Time between instructionsnonpipelined Number of stages If not balanced, speedup is less! Speedup due to increased throughput Latency (time for each instruction) does not decrease Chapter 4 — The Processor

Pipelining and ISA Design
Morgan Kaufmann Publishers 11 September, 2018 Pipelining and ISA Design MIPS ISA designed for pipelining All instructions are 32-bits Easier to fetch and decode in one cycle c.f. x86: 1- to 17-byte instructions Few and regular instruction formats Can decode and read registers in one step Load/store addressing Can calculate address in 3rd stage, access memory in 4th stage Alignment of memory operands Memory access takes only one cycle Chapter 4 — The Processor

Current Trends in Architecture
The University of Adelaide, School of Computer Science 11 September 2018 Current Trends in Architecture Benefits reaped from Instruction-Level parallelism (ILP) have largely been maxed out. New models for performance: Data-level parallelism (DLP) Thread-level parallelism (TLP) These require explicit restructuring of the application Chapter 2 — Instructions: Language of the Computer

Summary All modern day processors use pipelining
Pipelining doesn’t help latency of single task, it helps throughput of entire workload Potential speedup: a CPI of 1 and fast a CC Pipeline rate limited by slowest pipeline stage Unbalanced pipe stages makes for inefficiencies The time to “fill” pipeline and time to “drain” it can impact speedup for deep pipelines and short code runs Must detect and resolve hazards Stalling negatively affects CPI (makes CPI less than the ideal of 1)

Future Topics (time permitting)
Parallel Computing Distributed and Cloud Computing GPUs and Vectored Architectures

Appendix

September 11, 2018 CPU Time Performance improved by Reducing number of clock cycles Increasing clock rate Chapter 1 — Computer Abstractions and Technology

September 11, 2018 CPU Time Example Computer A: 2GHz clock, 10s CPU time Designing Computer B Aim for 6s CPU time Can do faster clock, but causes 1.2 × clock cycles How fast must Computer B clock be? Chapter 1 — Computer Abstractions and Technology

Instruction Count and CPI
Morgan Kaufmann Publishers September 11, 2018 Instruction Count and CPI Instruction Count for a program Determined by program, ISA and compiler Average cycles per instruction Determined by CPU hardware If different instructions have different CPI Average CPI affected by instruction mix Chapter 1 — Computer Abstractions and Technology

September 11, 2018 CPI Example Computer A: Cycle Time = 250ps, CPI = 2.0 Computer B: Cycle Time = 500ps, CPI = 1.2 Same ISA Which is faster, and by how much? A is faster… …by this much Chapter 1 — Computer Abstractions and Technology

September 11, 2018 CPI in More Detail If different instruction classes take different numbers of cycles Weighted average CPI Relative frequency Chapter 1 — Computer Abstractions and Technology

September 11, 2018 CPI Example Alternative compiled code sequences using instructions in classes A, B, C Class A B C CPI for class 1 2 3 IC in sequence 1 IC in sequence 2 4 Sequence 1: IC = 5 Clock Cycles = 2×1 + 1×2 + 2×3 = 10 Avg. CPI = 10/5 = 2.0 Sequence 2: IC = 6 Clock Cycles = 4×1 + 1×2 + 1×3 = 9 Avg. CPI = 9/6 = 1.5 Chapter 1 — Computer Abstractions and Technology

September 11, 2018 Performance Summary The BIG Picture Performance depends on Algorithm: affects IC, possibly CPI Programming language: affects IC, CPI Compiler: affects IC, CPI Instruction set architecture: affects IC, CPI, Tc Chapter 1 — Computer Abstractions and Technology

The University of Adelaide, School of Computer Science
11 September 2018 Flynn’s Taxonomy Single instruction stream, single data stream (SISD) Single instruction stream, multiple data streams (SIMD) Vector architectures Multimedia extensions Graphics processor units Multiple instruction streams, single data stream (MISD) No commercial implementation Multiple instruction streams, multiple data streams (MIMD) Tightly-coupled MIMD Loosely-coupled MIMD Chapter 2 — Instructions: Language of the Computer

COSC121: Computer Systems. Basic Architecture and Performance

Similar presentations

Presentation on theme: "COSC121: Computer Systems. Basic Architecture and Performance"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COSC121: Computer Systems. Basic Architecture and Performance

Similar presentations

Presentation on theme: "COSC121: Computer Systems. Basic Architecture and Performance"— Presentation transcript:

Similar presentations

About project

Feedback