Fall 2011 Prof. Hyesoon Kim. Instructor: Hyesoon Kim (KACB 2344) Homepage –http://www.cc.gatech.edu/~hyesoon/fall11 –T-square.

Slides:



Advertisements
Similar presentations
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Advertisements

© 2006 Edward F. Gehringer ECE 463/521 Lecture Notes, Spring 2006 Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Spring 2006.
Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.
Computer Architecture Lecture 2 Abhinav Agarwal Veeramani V.
CS1104: Computer Organisation School of Computing National University of Singapore.
CS2100 Computer Organisation Performance (AY2014/2015) Semester 2.
Performance Evaluation of Architectures Vittorio Zaccaria.
ECE 4100/6100 Advanced Computer Architecture Lecture 3 Performance Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute.
CS 6290 Evaluation & Metrics. Performance Two common measures –Latency (how long to do X) Also called response time and execution time –Throughput (how.
Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.
CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.
CIS629 Fall Lecture Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two important.
Computer ArchitectureFall 2007 © September 17, 2007 Karem Sakallah CS-447– Computer Architecture.
©UCB CS 162 Computer Architecture Lecture 1 Instructor: L.N. Bhuyan
1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.
Appendix A Pipelining: Basic and Intermediate Concepts
1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
CIS429/529 Winter 07 - Performance - 1 Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two.
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.
Performance & Benchmarking. What Matters? Which airplane has best performance:
1 Computer Performance: Metrics, Measurement, & Evaluation.
Computer Performance Computer Engineering Department.
Lecture 2b: Performance Metrics. Performance Metrics Measurable characteristics of a computer system: Count of an event Duration of a time interval Size.
Memory/Storage Architecture Lab Computer Architecture Performance.
Recap Technology trends Cost/performance Measuring and Reporting Performance What does it mean to say “computer X is faster than computer Y”? E.g. Machine.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
1 Seoul National University Performance. 2 Performance Example Seoul National University Sonata Boeing 727 Speed 100 km/h 1000km/h Seoul to Pusan 10 hours.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
CEN 316 Computer Organization and Design Assessing and Understanding Performance Mansour AL Zuair.
Computer Architecture CPSC 350
CS252/Patterson Lec 1.1 1/17/01 CMPUT429/CMPE382 Winter 2001 Topic2: Technology Trend and Cost/Performance (Adapted from David A. Patterson’s CS252 lecture.
Pipelining and Parallelism Mark Staveley
1  1998 Morgan Kaufmann Publishers How to measure, report, and summarize performance (suorituskyky, tehokkuus)? What factors determine the performance.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Performance Performance
TEST 1 – Tuesday March 3 Lectures 1 - 8, Ch 1,2 HW Due Feb 24 –1.4.1 p.60 –1.4.4 p.60 –1.4.6 p.60 –1.5.2 p –1.5.4 p.61 –1.5.5 p.61.
1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Performance Analysis Topics Measuring performance of systems Reasoning about performance Amdahl’s law Systems I.
Performance 9 ways to fool the public Old Chapter 4 New Chapter 1.4.
CS203 – Advanced Computer Architecture Performance Evaluation.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
Performance 9 ways to fool the public #1 – Reporting Results.
June 20, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 1: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed.
Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.
Measuring Performance II and Logic Design
CS203 – Advanced Computer Architecture
COSC6385 Advanced Computer Architecture
CSCI206 - Computer Organization & Programming
Lecture 2: Performance Evaluation
CS161 – Design and Architecture of Computer Systems
Lecture 3: MIPS Instruction Set
September 2 Performance Read 3.1 through 3.4 for Tuesday
ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance
CS 286 Computer Architecture & Organization
Performance of Single-cycle Design
Morgan Kaufmann Publishers
CSCE 212 Chapter 4: Assessing and Understanding Performance
CSCI206 - Computer Organization & Programming
T Computer Architecture, Autumn 2005
Serial versus Pipelined Execution
Performance of computer systems
Performance of computer systems
Lecture 3: MIPS Instruction Set
Performance of computer systems
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Presentation transcript:

Fall 2011 Prof. Hyesoon Kim

Instructor: Hyesoon Kim (KACB 2344) Homepage – –T-square ( Office hours: 3:00-4:30 Tu/Th TA: TBA Group mailing list: mailing list: cs googlegroups.com Textbook: No required text book –Recommended book: Computer Architecture: AQA, 4 th Edition by Hennessy and Patterson –Jean-Loup Baer, Microprocessor Architecture: From Simple Pipelines to Chip Multiprocessors, 1st edition. Papers

A4 A5

Problem Algorithm ISA u-architecture Circuits Electrons ISA: Interface between s/w & h/w

This course requires heavy programming Don’t take too many program heavy courses! It is 3-credit course but you feel a 4- 5 credit course The most ECElike course in CS

can be fun or can be hard or look so easy…

Select target platforms –Identify important applications –Identify design specifications (area, power budget etc.) Design space explorations Develop new mechanisms Evaluate ideas using –High-level simulations –Detailed-level simulations Design is mostly fixed  hardware description languages VLSI Fabrications Testing

Simple performance model Detailed performance model VHDL performance model Circuit/layout design Benchmarks Performance evaluation VerificationFAB

Pipeline depth? # of cores? Cache sizes?, cache configurations? Memory configurations. Coherent, non-coherent? In-order/ out of order How many threads to support? Power requirements? Performance enhancement mechanisms –Instruction fetch: branch predictor, speculative execution –Data fetch : cache, prefetching –Execution : data forwarding

Two common measures –Latency (how long to do X) Also called response time and execution time –Throughput (how often can it do X) Example of car assembly line –Takes 6 hours to make a car (latency is 6 hours per car) –A car leaves every 5 minutes (throughput is 12 cars per hour) –Overlap results in Throughput > 1/Latency

Benchmarks –Real applications and application suites E.g., SPEC CPU2000, SPEC2006, TPC-C, TPC-H, EEMBC, MediaBench, PARSEC, SYSmark –Kernels “Representative” parts of real applications Easier and quicker to set up and run Often not really representative of the entire app –Toy programs, synthetic benchmarks, etc. Not very useful for reporting Sometimes used to test/stress specific functions/features

“Representative” applications keeps growing with time!

Test, train and ref Test: simple checkup Train: profile input, feedback compilation Ref: real measurement. Design to run long enough to use for real system –-> Simulation? Reduced input set Statistical simulation Sampling

Measure transaction-processing throughput Benchmarks for different scenarios –TPC-C: warehouses and sales transactions –TPC-H: ad-hoc decision support –TPC-W: web-based business transactions Difficult to set up and run on a simulator –Requires full OS support, a working DBMS –Long simulations to get stable results

SPLASH: Scientific computing kernels –Who used parallel computers? PARSEC: More desktop oriented benchmarks NPB: NASA parallel computing benchmarks GPGPU benchmark suites –Rodinia, Parboil, SHOC Not many

GFLOPS, TFLOPS MIPS (Million instructions per second)

Machine A with ISA “A”: 10 MIPS Machine B ISA “B”: 5 MIPS which one is faster? Alpha ISA LEA A LD R1 mem[A] Add R1, R1 #1 ST mem[A] R1 X86 ISA INC mem[A] Case 1 Case 2 Add, ADD, NOP ADD, ADD NOP, NOP ADD, NOP

Hardware Technology, Organization Organization, ISA ISA, Compiler Technology A.K.A. The “iron law” of performance

For each kind of instruction How many instructions of this kind are there in the program How many cycles it takes to execute an instruction of this kind

Instruction Type FrequencyCPI Integer40%1.0 Branch20%4.0 Load20%2.0 Store10%3.0 Total Insts = 50B, Clock speed = 2 GHz = (0.4* * * *3.0) * 50 *10^9*1/(2*10^9)

“X is n times faster than Y” “Throughput of X is n times that of Y”

“X is n times faster than Y on A” But what about different applications (or even parts of the same application) –X is 10 times faster than Y on A, and 1.5 times on B, but Y is 2 times faster than X on C, and 3 times on D, and… So does X have better performance than Y? Which would you buy?

Arithmetic mean –Average execution time –Gives more weight to longer-running programs Weighted arithmetic mean –More important programs can be emphasized –But what do we use as weights? –Different weight will make different machines look better

Machine AMachine B Program 15 sec4 sec Program 23 sec6 sec What is the speedup of A compared to B on Program 1? What is the speedup of A compared to B on Program 2? What is the average speedup? What is the speedup of A compared to B on Sum(Program1, Program2) ? 4/5 6/3 (4+6)/(5+3) = (4/5+6/3)/2 = 0.8

Speedup of arithmetic means != arithmetic mean of speedup Use geometric mean: Neat property of the geometric mean: Consistent whatever the reference machine Do not use the arithmetic mean for normalized execution times

Often when making comparisons in comp- arch studies: –Program (or set of) is the same for two CPUs –The clock speed is the same for two CPUs So we can just directly compare CPI’s and often we use IPC’s

Average CPI =(CPI 1 + CPI 2 + … + CPI n )/n A.M. of IPC =(IPC 1 + IPC 2 + … + IPC n )/n Must use Harmonic Mean to remain  to runtime Not Equal to A.M. of CPI!!!

A program is compiled with different compiler options. Can we use IPC to compare performance? A program is run with different cache size machine. Can we use IPC to compare performance?

H.M.(x 1,x 2,x 3,…,x n ) = n … + 1 x 1 x 2 x 3 x n What in the world is this? –Average of inverse relationships

“Average” IPC = 1 A.M.(CPI) = 1 CPI 1 + CPI 2 + CPI 3 + … + CPI n n n n n = n CPI 1 + CPI 2 + CPI 3 + … + CPI n = n … + 1 = H.M.(IPC) IPC 1 IPC 2 IPC 3 IPC n

One solution: use Gmean or show average without mcf and with mcf

Use Sum(base)-Sum(new)/Sum(base) = % AVERAGE(delta) = 9.75%

FEID EX MEM WB add r1, r2, r3 add mul add sub r4, r1, r3subaddsubadd sub mul r5, r2, r3 mulsub add Add: 2 cycles add sub mul LLLLL FE_stage

FEIDEXMEMWB br 0x800 br target 0x800 add r1, r2,r3 0x804 target sub r2,r3,r4 0x900 br 0x804 br 0x804 0x900 PC (latch) add sub 0x904 1 cycle addsub FE_stage

Example: MIPS R4000 IF ID MEM WB integer unit FP/int Multiply FP adder FP/int divider ex m1m2m3m4m5m6m7 a1a2a3a4 Div (lat = 25, Init inv=25)