Lecture 4: Performance (conclusion) & Instruction Set Architecture

Lecture 4: Performance (conclusion) & Instruction Set Architecture
Michael B. Greenwald Computer Architecture CIS 501 Spring 1999

Philosophy Open-ended problems, messy solutions Positive: Negative:
More like the real world (but still contrived!) Get good at approximating and back-of-envelope calculation Negative: How do you know you have the right answer?

How to Summarize Performance
Track total execution time of all (weighted) benchmarks (Convert to exec. time, sum, convert back to measure, and average). Arithmetic mean (weighted arithmetic mean) : (Ti)/n or (Wi*Ti) Harmonic mean (weighted harmonic mean) of rates (e.g., MFLOPS) : n/ (1/Ri) or n/ (Wi/Ri) Compare to reference architecture (spec92 vs. VAX11/780): normalized execution time handy for scaling performance (e.g., X times faster than SPARCstation 10) But do not take the arithmetic mean of normalized execution time, use the geometric((norm timei)^(1/n)

Normalized Execution Times and Geometric Means
Geometric mean independent of normalization. Numerators the same, regardless of reference machine. Denominator cancels out in any machine-machine comparison.

Normalized Execution Times and Geometric Means
Geometric mean does not track execution time. Halving a 1 usec. prog. has same effect as halving a 10 hour prog.

SPEC First Round One program: 99% of time in single line of code
New front-end compiler could improve dramatically

Impact of Means on SPECmark89 for IBM 550
Ratio to VAX: Time: Weighted Time: Program Before After Before After Before After gcc espresso spice doduc nasa li eqntott matrix fpppp tomcatv Mean Geometric Arithmetic Weighted Arith. Ratio 1.33 Ratio 1.16 Ratio 1.09

Marketing Metrics MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6 Machines with different instruction sets ? Programs with different instruction mixes ? Dynamic frequency of instructions Uncorrelated with performance MFLOP/s = FP Operations / Time * 10^6 Machine dependent Often not where time is spent Normalized: add,sub,compare,mult 1 divide, sqrt exp, sin,

Normalized Performance k-MIPS machine
MIPS is normalized, on a per-program basis, to a VAX 11/780 Same problems as specmarks Also needs to be summarized by geometric mean (although performance 1/Time, inverse of a geometric mean is g.m. of inverse).

Performance Evaluation
“For better or worse, benchmarks shape a field” Good products created when have: Good benchmarks Good ways to summarize performance Given sales is a function in part of performance relative to competition, investment in improving product as reported by performance summary If benchmarks/summary inadequate, then choose between improving product for real programs vs. improving product to get more sales; Sales almost always wins! For computer systems the key performance metric is total time

Summary, #1 Designing to Last through Trends Time to run the task
Capacity Speed Logic 2x in 3 years 2x in 3 years DRAM 4x in 3 years 2x in 10 years Disk 4x in 3 years 2x in 10 years 6yrs to graduate => 16X CPU speed, DRAM/Disk size Time to run the task Execution time, response time, latency Tasks per day, hour, week, sec, ns, … Throughput, bandwidth “X is n times faster than Y” means ExTime(Y) Performance(X) = ExTime(X) Performance(Y)

Summary, #2 Amdahl’s Law: CPI Law:
Execution time is the REAL measure of computer performance! Good products created when have: Good benchmarks, good ways to summarize performance Die Cost goes roughly with die area4 Can PC industry support engineering/research investment? Speedupoverall = ExTimeold ExTimenew = 1 (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle

Instruction Set Architecture (Introduction and/or Review)

Computer Architecture Is …
the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation. Amdahl, Blaaw, and Brooks, 1964 SOFTWARE

Architecture, Organization, and Hardware
Instruction set architecture: programmer visible interface between software and hardware. Organization: High-level aspects of computer’s design such as memory system, bus structure, and the internal CPU to support ISA. Hardware: Detailed logic design, packaging, etc. can be thought of as implementation of organization.

Interface Design A good interface:
Lasts through many implementations (portability, compatability) Is used in many differeny ways (generality) Provides convenient functionality to higher levels Permits an efficient implementation at lower levels use time imp 1 Interface use imp 2 use imp 3

Instruction Set Architectures
Interface exposed to programmer Assembly language programming declining; mostly compiler back-ends. Even so, simplicity and regularity still useful: compiler writers: restrict choices and make tradeoffs clear. implementability: simple i/f leads to simple implementation. Fewer restrictions makes it amenable to different types of implementation. Generality and Flexibility still useful: Must last several generations Must be useful to a range of different clients

Evolution of Instruction Sets
Single Accumulator (EDSAC 1950) Accumulator + Index Registers (Manchester Mark I, IBM 700 series 1953) Separation of Programming Model from Implementation High-level Language Based Concept of a Family (B ) (IBM ) General Purpose Register Machines Complex Instruction Sets Load/Store Architecture (CDC 6600, Cray ) (Vax, Intel ) RISC (Mips,Sparc,88000,IBM RS6000, )

Evolution of Instruction Sets
Major advances in computer architecture are typically associated with landmark instruction set designs Ex: Stack vs GPR (System 360) Design decisions must take into account: technology machine organization programming langauges compiler technology operating systems And design decisions, in turn, influence these

Design Space of ISA Five Primary Dimensions Other Aspects
Operand Storage Where besides memory? Number of (explicit) operands ( 0, 1, 2, 3 ) Effective Address How is memory location specified? Type & Size of Operands byte, int, float, vector, . . . How is it specified? Operations add, sub, mul, . . . How is it specifed? Other Aspects Successor How is it specified? Conditions How are they determined? Encodings Fixed or variable? Wide? Parallelism

ISA Metrics Aesthetics: Orthogonality Completeness Regularity
No special registers, few special cases, all operand modes available with any data type or instruction type Completeness Support for a wide range of operations and target applications Regularity No overloading for the meanings of instruction fields Streamlined Resource needs easily determined Ease of compilation (programming?) Ease of implementation Scalability

Design Space of ISA Five Primary Dimensions Other Aspects
Operand Storage Where besides memory? Number of (explicit) operands ( 0, 1, 2, 3 ) Effective Address How is memory location specified? Type & Size of Operands byte, int, float, vector, . . . How is it specified? Operations add, sub, mul, . . . How is it specifed? Other Aspects Successor How is it specified? Conditions How are they determined? Encodings Fixed or variable? Wide? Parallelism

Basic ISA Classes Accumulator: 1 address add A acc acc + mem[A] 1+x address addx A acc acc + mem[A + x] Stack: 0 address add tos tos + next General Purpose Register: 2 address add A B EA(A) EA(A) + EA(B) 3 address add A B C EA(A) EA(B) + EA(C) Conventionally viewed as 3 distinct architectures. Actually, continuum. N registers. Accumulator & Stack just have different type of restrictions on register usage and load/store pattern.

C = A + B; Basic ISA Classes GPR Register-Mem. GPR Register-Reg. Stack
Accumulator Push A Push B Add Pop C Load A Add B Store C Load R1,A Add R1, B Store C, R1 Load R1,A Load R2,B Add R3,R1, R2 Store C, R3 C = A + B; Conventionally viewed as 3 distinct architectures. Actually, continuum. N registers. Accumulator & Stack just have different type of restrictions on register usage and load/store pattern.

Stack Machines Instruction set: Example: a*b - (a+c*b) - * a a b c b
+, -, *, /, . . . push A, pop A Example: a*b - (a+c*b) push a push b * push c + - A A B A*B C B C*B A+C*B A*B A C A A*B A A*B A A*B A*B - * + a * a b c b

Continuum of ISA Classes
All can have arbitrary number of registers Accumulator: each register is special, implicit in op-code Stack: top-of-stack register cache GPR: no special meaning, so can keep adding. What values are stored in each register? Accumulator: forced by instruction Stack: mostly order of eval GPR: almost no restrictions

The Case Against Special Purpose Registers
Performance is derived from the existence of several fast registers, not from the way they are organized Data does not always “surface” when needed Constants, repeated operands, common subexpressions so TOP and Swap instructions are required Code density is about equal to that of GPR instruction sets Registers have short addresses Keep things in registers and reuse them Slightly simpler to write a poor compiler, but not an optimizing compiler

Lecture 4: Performance (conclusion) & Instruction Set Architecture

Similar presentations

Presentation on theme: "Lecture 4: Performance (conclusion) & Instruction Set Architecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 4: Performance (conclusion) & Instruction Set Architecture

Similar presentations

Presentation on theme: "Lecture 4: Performance (conclusion) & Instruction Set Architecture"— Presentation transcript:

Similar presentations

About project

Feedback