EECS 470 Lecture 1 Computer Architecture Fall 2015 Slides developed in part by Profs. Brehob, Austin, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson,

Slides:



Advertisements
Similar presentations
Adding the Jump Instruction
Advertisements

ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
1 ITCS 3181 Logic and Computer Systems B. Wilkinson Slides9.ppt Modification date: March 30, 2015 Processor Design.
CS1104: Computer Organisation School of Computing National University of Singapore.
Instruction-Level Parallelism (ILP)
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
1 RISC Pipeline Han Wang CS3410, Spring 2010 Computer Science Cornell University See: P&H Chapter 4.6.
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University RISC Pipeline See: P&H Chapter 4.6.
1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.
Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
EECS 470 Pipeline Hazards Lecture 4 Coverage: Appendix A.
Computer Architecture Lecture 3 Coverage: Appendix A
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.
Lec 8: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.
Chapter 4 Assessing and Understanding Performance
Lec 9: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.
Appendix A Pipelining: Basic and Intermediate Concepts
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
Lecture 24: CPU Design Today’s topic –Multi-Cycle ALU –Introduction to Pipelining 1.
CSE378 Pipelining1 Pipelining Basic concept of assembly line –Split a job A into n sequential subjobs (A 1,A 2,…,A n ) with each A i taking approximately.
EECS 470 Further review: Pipeline Hazards and More Lecture 2 – Winter 2014 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.
1 Computer Performance: Metrics, Measurement, & Evaluation.
1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.
MIPS Pipeline Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University See P&H Chapter 4.6.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
Sample Code (Simple) Run the following code on a pipelined datapath: add1 2 3 ; reg 3 = reg 1 + reg 2 nand ; reg 6 = reg 4 & reg 5 lw ; reg.
Electrical and Computer Engineering University of Cyprus LAB3: IMPROVING MIPS PERFORMANCE WITH PIPELINING.
1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.
EECE 476: Computer Architecture Slide Set #5: Implementing Pipelining Tor Aamodt Slide background: Die photo of the MIPS R2000 (first commercial MIPS microprocessor)
CDA 5155 Computer Architecture Week 1.5. Start with the materials: Conductors and Insulators Conductor: a material that permits electrical current to.
Performance Performance
1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University MIPS Pipeline See P&H Chapter 4.6.
CDA 5155 Week 3 Branch Prediction Superscalar Execution.
ECE/CS 552: Performance and Cost © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and.
1 Lecture 3: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections C.1 - C.4) Reminders:  Sign up for the class mailing.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
CS161 – Design and Architecture of Computer Systems
Computer Organization
Variable Word Width Computation for Low Power
Morgan Kaufmann Publishers
15-740/ Computer Architecture Lecture 7: Pipelining
Morgan Kaufmann Publishers The Processor
Chapter 4 The Processor Part 2
Lecture 5: Pipelining Basics
Serial versus Pipelined Execution
Pipelining in more detail
Pipelining Basic concept of assembly line
The Processor Lecture 3.6: Control Hazards
The Processor Lecture 3.4: Pipelining Datapath and Control
Control unit extension for data hazards
Instruction Execution Cycle
Designing a Pipelined CPU
Pipelining Basic concept of assembly line
Control unit extension for data hazards
Pipelining Basic concept of assembly line
Introduction to Computer Organization and Architecture
Control unit extension for data hazards
Guest Lecturer: Justin Hsia
MIPS Pipeline Hakim Weatherspoon CS 3410, Spring 2013 Computer Science
Pipelined datapath and control
Presentation transcript:

EECS 470 Lecture 1 Computer Architecture Fall 2015 Slides developed in part by Profs. Brehob, Austin, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar, and Wenisch

What Is Computer Architecture? “The term architecture is used here to describe the attributes of a system as seen by the programmer, i.e., the conceptual structure and functional behavior as distinct from the organization of the dataflow and controls, the logic design, and the physical implementation.” Gene Amdahl, IBM Journal of R&D, April 1964

Architecture as used here… We use a wider definition –The stuff seen by the programmer Instructions, registers, programming model, etc. –Micro-architecture How the architecture is implemented.

Teaching Staff Instructor Dr. Mark Brehob, brehob –O–Office hours: Monday 10:00am-noon – 4632 Beyster (470 and 473) Tuesday 1:30pm-3:00pm – 4632 Beyster (470 has priority) Thursday 3:45pm-5:00pm Beyster (470 and 473) Friday 10:30am-noon – 2334 EECS (473 has priority) –I–I’ll also be around after class for minutes most days. GSIs/IA Akshay Moorthy, moorthya Steve Zekany, szekany Office hours: –Monday: 4: (Steve) in 1695 BBB –Tuesday: 10:30am - 12 (Steve) in 1695 BBB –Wednesday 4: (Steve) in 1695 BBB –Thursday: 4: (Akshay) in 1620 BBB –Friday: 2:30 - 3:30 (Steve and Akshay), both in 1620 BBB –Saturday: 4 - 5:30 (Akshay) in 1620 BBB –Sunday: 4: (Akshay) in 1620 BBB These times and locations may change slightly during the first week of class, especially if the rooms are busy or reserved for other classes. Intro

Lecture Today Class intro (30 minutes) –Class goals –Your grade –Work expected Start review –Fundamental concepts –Pipelines –Hazards and Dependencies (time allowing) –Performance measurement (as reference only) Outline

Class goals Provide a high-level understanding of many of the relevant issues in modern computer architecture –Dynamic Out-of-order processing –Static (complier based) out-of-order processing –Memory hierarchy issues and improvements –Multi-processor issues –Power and reliability issues Provide a low-level understanding of the most important parts of modern computer architecture. –Details about all the functions a given component will need to perform –How difficult certain tasks, such as caching, really are in hardware Class goals

Communication Website: Piazza –You should be able to get on via ctools. We won’t be using ctools otherwise for the most part. Lab –Attend your assigned section—very full (may need to double up some?) –You need to go!

Course outline – near term Class goals

Grading Grade weights are: –Midterm – 24% –Final Exam – 24% –Homework/Quiz – 9% 5 homeworks, 1 quiz, all equal. Drop the lowest grade. –Verilog Programming assignments – 7% 3 assignments worth 2%, 2%, and 3% –In lab assignments – 1% –Project – 35% Open-ended (group) processor design in Verilog Grades can vary quite a bit, so is a distinguisher. Your grade

Hours (and hours) of work Attend class & discussion –You really need to go to lab! –5 hours/week Read book & handouts – 2-3 hours/week Do homework –5 assignments at ~4 hours each/15 weeks – 1-2 hours/week 3 Programming assignments –6, 6, 20 hours each = 32 hours/15 weeks –~2 hours/week Verilog Project –~200 hours/15 weeks = 13 hours/week General studying – ~ 1 hour/week Total: –~25 hours/week, some students claim more... Work expected

“Key” concepts

Amdahl’s Law Speedup= time without enhancement / time with enhancement Suppose an enhancement speeds up a fraction f of a task by a factor of S time new = time orig ·( (1-f) + f/S ) S overall = 1 / ( (1-f) + f/S ) 1 time orig f(1 - f) time orig (1 - f) time new f/S f(1 - f) time orig (1 - f) time new f/S

Parallelism: Work and Critical Path Parallelism - the amount of independent sub-tasks available Work=T 1 - time to complete a computation on a sequential system Critical Path=T  - time to complete the same computation on an infinitely- parallel system Average Parallelism P avg = T 1 / T  For a p wide system T p  max{ T 1 /p, T  } P avg >>p  T p  T 1 /p x = a + b; y = b * 2 z =(x-y) * (x+y)

Locality Principle One’s recent past is a good indication of near future. –Temporal Locality: If you looked something up, it is very likely that you will look it up again soon –Spatial Locality: If you looked something up, it is very likely you will look up something nearby next Locality == Patterns == Predictability Converse: Anti-locality : If you haven’t done something for a very long time, it is very likely you won’t do it in the near future either

Memoization Dual of temporal locality but for computation If something is expensive to compute, you might want to remember the answer for a while, just in case you will need the same answer again Why does memoization work?? Examples –Trace caches

Amortization overhead cost : one-time cost to set something up per-unit cost : cost for per unit of operation total cost = overhead + per-unit cost x N It is often okay to have a high overhead cost if the cost can be distributed over a large number of units  lower the average cost average cost = total cost / N = ( overhead / N ) + per-unit cost

Trends in computer architecture

A Paradigm Shift In Computing Era of High Performance Computing Era of Energy-Efficient Computing c Limits on heat extraction Limits on energy-efficiency of operations Stagnates performance growth IEEE Computer—April 2001 T. Mudge

The Memory Wall Today: 1 mem access  500 arithmetic ops Source: Hennessy & Patterson, Computer Architecture: A Quantitative Approach, 4 th ed. Processor Memory

Reliability Wall Transient faults –E.g, high-energy particle strikes Manufacturing faults –E.g., broken connections Wearout faults –E.g., Electromigration Device variability (not all transistors created equal) burn-in transients viainterconnect

Review of basic pipelining

Review of basic pipelining 5 stage “RISC” load-store architecture –About as simple as things get 1.Instruction fetch: get instruction from memory/cache 2.Instruction decode: translate opcode into control signals and read regs 3.Execute: perform ALU operation 4.Memory: Access memory if load/store 5.Writeback/retire: update register file Pipeline review

Pipelined implementation Break the execution of the instruction into cycles (5 in this case). Design a separate datapath stage for the execution performed during each cycle. Build pipeline registers to communicate between the stages.

Stage 1: Fetch Design a datapath that can fetch an instruction from memory every cycle. –Use PC to index memory to read instruction –Increment the PC (assume no branches for now) Write everything needed to complete execution to the pipeline register (IF/ID) –The next stage will read this pipeline register. –Note that pipeline register must be edge triggered

Instruction bits IF / ID Pipeline register PC Instruction Memory/ Cache en 1 + MUXMUX Rest of pipelined datapath PC + 1

Stage 2: Decode Design a datapath that reads the IF/ID pipeline register, decodes instruction and reads register file (specified by regA and regB of instruction bits). –Decode can be easy, just pass on the opcode and let later stages figure out their own control signals for the instruction. Write everything needed to complete execution to the pipeline register (ID/EX) –Pass on the offset field and both destination register specifiers (or simply pass on the whole instruction!). –Including PC+1 even though decode didn’t use it.

Destreg Data ID / EX Pipeline register Contents Of regA Contents Of regB Register File regA regB en Rest of pipelined datapath Instruction bits IF / ID Pipeline register PC + 1 Control Signals Stage 1: Fetch datapath

Stage 3: Execute Design a datapath that performs the proper ALU operation for the instruction specified and the values present in the ID/EX pipeline register. –The inputs are the contents of regA and either the contents of RegB or the offset field on the instruction. –Also, calculate PC+1+offset in case this is a branch. Write everything needed to complete execution to the pipeline register (EX/Mem) –ALU result, contents of regB and PC+1+offset –Instruction bits for opcode and destReg specifiers

ID / EX Pipeline register Contents Of regA Contents Of regB Rest of pipelined datapath Alu Result EX/Mem Pipeline register PC + 1 Control Signals Stage 2: Decode datapath Control Signals PC+1 +offset + contents of regB ALUALU MUXMUX

Stage 4: Memory Operation Design a datapath that performs the proper memory operation for the instruction specified and the values present in the EX/Mem pipeline register. –ALU result contains address for ld and st instructions. –Opcode bits control memory R/W and enable signals. Write everything needed to complete execution to the pipeline register (Mem/WB) –ALU result and MemData –Instruction bits for opcode and destReg specifiers

Alu Result Mem/WB Pipeline register Rest of pipelined datapath Alu Result EX/Mem Pipeline register Stage 3: Execute datapath Control Signals PC+1 +offset contents of regB This goes back to the MUX before the PC in stage 1. Memory Read Data Data Memory en R/W Control Signals MUX control for PC input

Stage 5: Write back Design a datapath that conpletes the execution of this instruction, writing to the register file if required. –Write MemData to destReg for ld instruction –Write ALU result to destReg for add or nand instructions. –Opcode bits also control register write enable signal.

Alu Result Mem/WB Pipeline register Stage 4: Memory datapath Control Signals Memory Read Data MUXMUX This goes back to data input of register file This goes back to the destination register specifier MUXMUX bits 0-2 bits register write enable

What can go wrong? Data hazards: since register reads occur in stage 2 and register writes occur in stage 5 it is possible to read the wrong value if is about to be written. Control hazards: A branch instruction may change the PC, but not until stage 4. What do we fetch before that? Exceptions: How do you handle exceptions in a pipelined processor with 5 instructions in flight?

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits op dest offset valB valA PC+1 target ALU result op dest valB op dest ALU result mdata eq? instruction 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB Bits data dest Fetch DecodeExecute Memory WB Basic Pipelining

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX op dest offset valB valA PC+1 target ALU result op dest valB op dest ALU result mdata eq? instruction 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB data dest Fetch DecodeExecute Memory WB Basic Pipelining

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX op offset valB valA PC+1 target ALU result op valB op ALU result mdata eq? instruction 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB fwd data Fetch DecodeExecute Memory WB Basic Pipelining

Lecture 1 Slide 38 EECS 470 Measuring performance This will not be covered in class but you need to read it.

Lecture 1 Slide 39 EECS 470 Metrics of Performance Time (latency) r elapsed time vs. processor time Rate (bandwidth or throughput) r performance = rate = work per time Distinction is sometimes blurred r consider batched vs. interactive processing r consider overlapped vs. non-overlapped processing r may require conflicting optimizations What is the most popular metric of performance?

Lecture 1 Slide 40 EECS 470 The “Iron Law” of Processor Performance

Lecture 1 Slide 41 EECS 470 MIPS - Millions of Instructions per Second MIPS = When comparing two machines (A, B) with the same instruction set, MIPS is a fair comparison ( sometimes… ) But, MIPS can be a “ meaningless indicator of performance… ” instruction sets are not equivalent different programs use a different instruction mix instruction count is not a reliable indicator of work r some optimizations add instructions r instructions have varying work MIPS # of instructions benchmark total run time 1,000,000 X

Lecture 1 Slide 42 EECS 470 MIPS (Cont.) Example: r Machine A has a special instruction for performing square root m It takes 100 cycles to execute r Machine B doesn’t have the special instruction m must perform square root in software using simple instructions m e.g, Add, Mult, Shift each take 1 cycle to execute r Machine A: 1/100 MIPS = 0.01 MIPS r Machine B: 1 MIPS

Lecture 1 Slide 43 EECS 470 MFLOPS MFLOPS = (FP ops/program) x (program/time) x Popular in scientific computing r There was a time when FP ops were much much slower than regular instructions (i.e., off-chip, sequential execution) Not great for “predicting” performance because it r ignores other instructions (e.g., load/store) r not all FP ops are created equally r depends on how FP-intensive program is Beware of “peak” MFLOPS!

Lecture 1 Slide 44 EECS 470 Comparing Performance Often, we want to compare the performance of different machines or different programs. Why? r help architects understand which is “better” r give marketing a “silver bullet” for the press release r help customers understand why they should buy

Lecture 1 Slide 45 EECS 470 Performance vs. Execution Time Often, we use the phrase “ X is faster than Y ” r Means the response time or execution time is lower on X than on Y r Mathematically, “X is N times faster than Y” means Execution Time Y = N Execution Time X Execution Time Y = N = 1/Performance Y = Performance X Execution Time X 1/Performance X Performance Y Performance and Execution time are reciprocals r Increasing performance decreases execution time

Lecture 1 Slide 46 EECS 470 By Definition Machine A is n times faster than machine B iff perf(A)/perf(B) = time(B)/time(A) = n Machine A is x% faster than machine B iff perf(A)/perf(B) = time(B)/time(A) = 1 + x/100 E.g., A 10s, B 15s 15/10 = 1.5  A is 1.5 times faster than B 15/10 = /100  A is 50% faster than B Remember to compare “performance”

Lecture 1 Slide 47 EECS 470 Let’s Try a Simpler Example Two machines timed on two benchmarks Machine AMachine B Program 1 2 seconds 4 seconds Program 2 12 seconds8 seconds r How much faster is Machine A than Machine B? r Attempt 1: ratio of run times, normalized to Machine A times program1: 4/2 program2 : 8/12 r Machine A ran 2 times faster on program 1, 2/3 times faster on program 2 r On average, Machine A is (2 + 2/3) /2 = 4/3 times faster than Machine B It turns this “averaging” stuff can fool us; watch…

Lecture 1 Slide 48 EECS 470 Example (con’t) Two machines timed on two benchmarks Machine AMachine B Program 1 2 seconds 4 seconds Program 2 12 seconds8 seconds r How much faster is Machine A than B? r Attempt 2: ratio of run times, normalized to Machine B times program 1: 2/4 program 2 : 12/8 r Machine A ran program 1 in 1/2 the time and program 2 in 3/2 the time r On average, (1/2 + 3/2) / 2 = 1 r Put another way, Machine A is 1.0 times faster than Machine B

Lecture 1 Slide 49 EECS 470 Example (con’t) Two machines timed on two benchmarks Machine AMachine B Program 1 2 seconds 4 seconds Program 2 12 seconds8 seconds r How much faster is Machine A than B? r Attempt 3: ratio of run times, aggregate (total sum) times, norm. to A m Machine A took 14 seconds for both programs m Machine B took 12 seconds for both programs m Therefore, Machine A takes 14/12 of the time of Machine B m Put another way, Machine A is 6/7 faster than Machine B

Lecture 1 Slide 50 EECS 470 Which is Right? Question: r How can we get three different answers ? Solution r Because, while they are all reasonable calculations… …each answers a different question We need to be more precise in understanding and posing these performance & metric questions

Lecture 1 Slide 51 EECS 470 Arithmetic and Harmonic Mean Average of the execution time that tracks total execution time is the arithmetic mean If performance is expressed as a rate, then the average that tracks total execution time is the harmonic mean This is the defn for “average” you are most familiar with This is a different defn for “average” you are prob. less familiar with

Lecture 1 Slide 52 EECS 470 Quiz E.g., 30 mph for first 10 miles, 90 mph for next 10 miles What is the average speed? Average speed = ( )/2 = 60 mph (Wrong!) The correct answer: Average speed = total distance / total time = 20 / (10/ /90) = 45 mph For rates use Harmonic Mean!

Lecture 1 Slide 53 EECS 470 Problems with Arithmetic Mean Applications do not have the same probability of being run Longer programs weigh more heavily in the average For example, two machines timed on two benchmarks Machine AMachine B Program 1 2 seconds (20%)4 seconds (20%) Program 212 seconds (80%)8 seconds (80%) m If we do arithmetic mean, Program 2 “counts more” than Program 1 q an improvement in Program 2 changes the average more than a proportional improvement in Program 1 m But perhaps Program 2 is 4 times more likely to run than Program 1

Lecture 1 Slide 54 EECS 470 Weighted Execution Time Often, one runs some programs more often than others. Therefore, we should weight the more frequently used programs’ execution time Weighted Harmonic Mean

Lecture 1 Slide 55 EECS 470 Using a Weighted Sum (or weighted average) Machine AMachine B Program 1 2 seconds (20%)4 seconds (20%) Program 212 seconds (80%)8 seconds (80%) Total10 seconds7.2 seconds m Allows us to determine relative performance 10/7.2 = > Machine B is 1.38 times faster than Machine A

Lecture 1 Slide 56 EECS 470 Summary Performance is important to measure r For architects comparing different deep mechanisms r For developers of software trying to optimize code, applications r For users, trying to decide which machine to use, or to buy Performance metrics are subtle r Easy to mess up the “machine A is XXX times faster than machine B” numerical performance comparison r You need to know exactly what you are measuring: time, rate, throughput, CPI, cycles, etc r You need to know how combining these to give aggregate numbers does different kinds of “distortions” to the individual numbers r No metric is perfect, so lots of emphasis on standard benchmarks today