Lecture 7: Pipelining.

Lecture 7: Pipelining

Notes Homework #2 Due Today Homework #3 Delivered this evening
Pre-Lab #2: due next Tuesday 3 paper readings, short summarization Lab #1: ed out tonight Due next Tuesday Done in groups of 4 C code to design an interconnection network Where we are going: Week #4: Pipelining (Chapter 6) (Oct. 16, 18) Week #5: Cache (Chapter 7) (Oct. 23, 25) Week $6: Midterm (Oct. 30 (review) or Nov 1) 4 Weeks to do the project

Pipelining Improve performance by increasing instruction throughput
Ideal speedup is number of stages in the pipeline. Do we achieve this? Note: timing assumptions changed for this example

Multicycle Approach Break up the instructions into steps, each step takes a cycle balance the amount of work to be done restrict each cycle to use only one major functional unit At the end of a cycle store values for use in later cycles (easiest thing to do) introduce additional “internal” registers

Idea behind multicycle approach
We define each instruction from the ISA perspective (do this!) Break it down into steps following our rule that data flows through at most one major functional unit (e.g., balance work across steps) Introduce new registers as needed (e.g, A, B, ALUOut, MDR, etc.) Finally try and pack as much work into each step (avoid unnecessary cycles) while also trying to share steps where possible (minimizes control, helps to simplify solution) Result: Our book’s multicycle Implementation!

Five Execution Steps Instruction Fetch
Instruction Decode and Register Fetch Execution, Memory Address Computation, or Branch Completion Memory Access or R-type instruction completion Write-back step INSTRUCTIONS TAKE FROM CYCLES!

Step 1: Instruction Fetch
Use PC to get instruction and put it in the Instruction Register. Increment the PC by 4 and put the result back in the PC. Can be described succinctly using RTL "Register-Transfer Language" IR <= Memory[PC]; PC <= PC + 4; Can we figure out the values of the control signals? What is the advantage of updating the PC now?

Step 2: Instruction Decode and Register Fetch
Read registers rs and rt in case we need them Compute the branch address in case the instruction is a branch RTL: A <= Reg[IR[25:21]]; B <= Reg[IR[20:16]]; ALUOut <= PC + (sign-extend(IR[15:0]) << 2); We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)

Step 3 (instruction dependent)
ALU is performing one of three functions, based on instruction type Memory Reference: ALUOut <= A + sign-extend(IR[15:0]); R-type: ALUOut <= A op B; Branch: if (A==B) PC <= ALUOut;

Step 4 (R-type or memory-access)
Loads and stores access memory MDR <= Memory[ALUOut]; or Memory[ALUOut] <= B; R-type instructions finish Reg[IR[15:11]] <= ALUOut; The write actually takes place at the end of the cycle on the edge

Write-back step Reg[IR[20:16]] <= MDR;
Which instruction needs this?

Simple Questions How many cycles will it take to execute this code? lw $t2, 0($t3) lw $t3, 4($t3) beq $t2, $t3, Label #assume not add $t5, $t2, $t3 sw $t5, 8($t3) Label: ... What is going on during the 8th cycle of execution? In what cycle does the actual addition of $t2 and $t3 takes place?

Review: finite state machines
a set of states and next state function (determined by current state and the input) output function (determined by current state and possibly input) We’ll use a Moore machine (output based only on current state)

Implementing the Control
Value of control signals is dependent upon: what instruction is being executed which step is being performed Use the information we’ve accumulated to specify a finite state machine specify the finite state machine graphically, or use microprogramming Implementation can be derived from specification

Graphical Specification of FSM
Note: don’t care if not mentioned asserted if name only otherwise exact value How many state bits will we need?

Finite State Machine for Control
Implementation:

PLA Implementation If I picked a horizontal or vertical line could you explain it?

ROM Implementation ROM = "Read Only Memory"
values of memory locations are fixed ahead of time A ROM can be used to implement a truth table if the address is m-bits, we can address 2m entries in the ROM. our outputs are the bits of data that the address points to m is the "height", and n is the "width" m n

ROM Implementation How many inputs are there? 6 bits for opcode, 4 bits for state = 10 address lines (i.e., 210 = 1024 different addresses) How many outputs are there? 16 datapath-control outputs, 4 state bits = 20 outputs ROM is 210 x 20 = 20K bits (and a rather unusual size) Rather wasteful, since for lots of the entries, the outputs are the same — i.e., opcode is often ignored

Another Implementation Style
Complex instructions: the "next state" is often current state + 1

Details

Microprogramming What are the “microinstructions” ?

Microprogramming A specification methodology
appropriate if hundreds of opcodes, modes, cycles, etc. signals specified symbolically using microinstructions Will two implementations of the same architecture have the same microcode? What would a microassembler do?

Microinstruction format

Maximally vs. Minimally Encoded
No encoding: 1 bit for each datapath operation faster, requires more memory (logic) used for Vax 780 — an astonishing 400K of memory! Lots of encoding: send the microinstructions through logic to get control signals uses less memory, slower Historical context of CISC: Too much logic to put on a single chip with everything else Use a ROM (or even RAM) to hold the microcode It’s easy to add new instructions

Microcode: Trade-offs
Distinction between specification and implementation is sometimes blurred Specification Advantages: Easy to design and write Design architecture and microcode in parallel Implementation (off-chip ROM) Advantages Easy to change since values are in memory Can emulate other architectures Can make use of internal registers Implementation Disadvantages, SLOWER now that: Control is implemented on same chip as processor ROM is no longer faster than RAM No need to go back and make changes

Historical Perspective
In the ‘60s and ‘70s microprogramming was very important for implementing machines This led to more sophisticated ISAs and the VAX In the ‘80s RISC processors based on pipelining became popular Pipelining the microinstructions is also possible! Implementations of IA-32 architecture processors since 486 use: “hardwired control” for simpler instructions (few cycles, FSM control implemented using PLA or random logic) “microcoded control” for more complex instructions (large numbers of cycles, central control store) The IA-64 architecture uses a RISC-style ISA and can be implemented without a large central control store

Pentium 4 Chapter 7 Chapter 6
Pipelining is important (last IA-32 without it was in 1985) Pipelining is used for the simple instructions favored by compilers “Simply put, a high performance implementation needs to ensure that the simple instructions execute quickly, and that the burden of the complexities of the instruction set penalize the complex, less frequently used, instructions” Chapter 7 Chapter 6

Pentium 4 Somewhere in all that “control we must handle complex instructions Processor executes simple microinstructions, 70 bits wide (hardwired) 120 control lines for integer datapath (400 for floating point) If an instruction requires more than 4 microinstructions to implement, control from microcode ROM (8000 microinstructions) Its complicated!

Chapter 5 Summary If we understand the instructions… We can build a simple processor! If instructions take different amounts of time, multi-cycle is better Datapath implemented using: Combinational logic for arithmetic State holding elements to remember bits Control implemented using: Combinational logic for single-cycle implementation Finite state machine for multi-cycle implementation

Chapter Six -- Pipelining

Pipelining Improve performance by increasing instruction throughput
Ideal speedup is number of stages in the pipeline. Do we achieve this? Note: timing assumptions changed for this example

Pipelining What makes it easy all instructions are the same length
just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction We’ll build a simple pipeline and look at these issues We’ll talk about modern processors and what really makes it hard: exception handling trying to improve performance with out-of-order execution, etc.

Basic Idea What do we need to add to actually split the datapath into stages?

Pipelined Datapath Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?

Corrected Datapath

Graphically Representing Pipelines
Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths

Pipeline Control

Pipeline control We have 5 stages. What needs to be controlled in each stage? Instruction Fetch and PC Increment Instruction Decode / Register Fetch Execution Memory Stage Write Back How would control be handled in an automobile plant? a fancy control center telling everyone what to do? should we use a finite state machine?

Pipeline Control Pass control signals along just like the data

Datapath with Control

Dependencies Problem with starting next instruction before first is finished dependencies that “go backward in time” are data hazards

Software Solution Have compiler guarantee no hazards
Where do we insert the “nops” ? sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Problem: this really slows us down!

Forwarding Use temporary results, don’t wait for them to be written
register file forwarding to handle read/write to same register ALU forwarding what if this $2 was $13?

Forwarding The main idea (some details not shown)

Can't always forward Load word can still cause a hazard:
an instruction tries to read a register following a load instruction that writes to the same register. Thus, we need a hazard detection unit to “stall” the load instruction

Stalling We can stall the pipeline by keeping an instruction in the same stage

Hazard Detection Unit Stall by letting an instruction that won’t write anything go forward

Branch Hazards When we decide to branch, other instructions are in the pipeline! We are predicting “branch not taken” need to add hardware for flushing instructions if we are wrong

Flushing Instructions
Note: we’ve also moved branch decision to ID stage

Branches A 2-bit prediction scheme
If the branch is taken, we have a penalty of one cycle For our simple design, this is reasonable With deeper pipelines, penalty increases and static branch prediction drastically hurts performance Solution: dynamic branch prediction A 2-bit prediction scheme

Branch Prediction Sophisticated Techniques:
A “branch target buffer” to help us look up the destination Correlating predictors that base prediction on global behavior and recently executed branches (e.g., prediction for a specific branch instruction based on what happened in previous branches) Tournament predictors that use different types of prediction strategies and keep track of which one is performing best. A “branch delay slot” which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA) Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective! Modern processors predict correctly 95% of the time!

Improving Performance
Try and avoid stalls! E.g., reorder these instructions: lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) Dynamic Pipeline Scheduling Hardware chooses which instructions to execute next Will execute instructions out of order (e.g., doesn’t wait for a dependency to be resolved, but rather keeps going!) Speculates on branches and keeps the pipeline full (may need to rollback if prediction incorrect) Trying to exploit instruction-level parallelism

Advanced Pipelining Increase the depth of the pipeline
Start more than one instruction each cycle (multiple issue) Loop unrolling to expose more ILP (better scheduling) “Superscalar” processors DEC Alpha 21264: 9 stage pipeline, 6 instruction issue All modern processors are superscalar and issue multiple instructions usually with some limitations (e.g., different “pipes”) VLIW: very long instruction word, static multiple issue (relies more on compiler technology) This class has given you the background you need to learn more!

Chapter 6 Summary Pipelining does not improve latency, but does improve throughput

Last Slide

Chapter Seven

Memories: Review SRAM: value is stored on a pair of inverting gates
very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be refreshed) very small but slower than SRAM (factor of 5 to 10)

Exploiting Memory Hierarchy
Users want large and fast memories! SRAM access times are .5 – 5ns at cost of $4000 to $10,000 per GB. DRAM access times are 50-70ns at cost of $100 to $200 per GB. Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB. Try and give it to them anyway build a memory hierarchy 2004

Locality A principle that makes having a memory hierarchy a good idea
If an item is referenced, temporal locality: it will tend to be referenced again soon spatial locality: nearby items will tend to be referenced soon. Why does code have locality? Our initial focus: two levels (upper, lower) block: minimum unit of data hit: data requested is in the upper level miss: data requested is not in the upper level

Cache Two issues: How do we know if a data item is in the cache?
If it is, how do we find it? Our first example: block size is one word of data "direct mapped" For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level

Direct Mapped Cache Mapping: address is modulo the number of blocks in the cache

Direct Mapped Cache For MIPS:
What kind of locality are we taking advantage of?

Direct Mapped Cache Taking advantage of spatial locality:

Hits vs. Misses Read hits this is what we want! Read misses
stall the CPU, fetch block from memory, deliver to cache, restart Write hits: can replace data in cache and memory (write-through) write the data only into the cache (write-back the cache later) Write misses: read the entire block into the cache, then write the word

Hardware Issues Make reading multiple words easier by using banks of memory It can get a lot more complicated...

Performance Increasing the block size tends to decrease miss rate:
Use split caches because there is more spatial locality in code:

Performance Simplified model: execution time = (execution cycles + stall cycles) ´ cycle time stall cycles = # of instructions ´ miss ratio ´ miss penalty Two ways of improving performance: decreasing the miss ratio decreasing the miss penalty What happens if we increase block size?

Decreasing miss ratio with associativity
Compared to direct mapped, give a series of references that: results in a lower miss ratio using a 2-way set associative cache results in a higher miss ratio using a 2-way set associative cache assuming we use the “least recently used” replacement strategy

An implementation

Performance

Decreasing miss penalty with multilevel caches
Add a second level cache: often primary cache is on the same chip as the processor use SRAMs to add another cache above primary memory (DRAM) miss penalty goes down if data is in 2nd level cache Example: CPI of 1.0 on a 5 Ghz machine with a 5% miss rate, 100ns DRAM access Adding 2nd level cache with 5ns access time decreases miss rate to .5% Using multilevel caches: try and optimize the hit time on the 1st level cache try and optimize the miss rate on the 2nd level cache

Cache Complexities Theoretical behavior of Radix sort vs. Quicksort
Not always easy to understand implications of caches: Theoretical behavior of Radix sort vs. Quicksort Observed behavior of Radix sort vs. Quicksort

Cache Complexities Here is why:
Memory system performance is often critical factor multilevel caches, pipelined processors, make it harder to predict outcomes Compiler optimizations to increase locality sometimes hurt ILP Difficult to predict best algorithm: need experimental data

Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: illusion of having more physical memory program relocation protection

Pages: virtual memory blocks
Page faults: the data is not in memory, retrieve it from disk huge miss penalty, thus pages should be fairly large (e.g., 4KB) reducing page faults is important (LRU is worth the price) can handle the faults in software instead of hardware using write-through is too expensive so we use writeback

Page Tables

Making Address Translation Fast
A cache for address translations: translation lookaside buffer Typical values: entries, miss-rate: .01% - 1% miss-penalty: 10 – 100 cycles

TLBs and caches

TLBs and Caches

Modern Systems

Modern Systems Things are getting complicated!

Some Issues Processor speeds continue to increase very fast — much faster than either DRAM or disk access times Design challenge: dealing with this growing disparity Prefetching? 3rd level caches and more? Memory design?

Chapters 8 & 9 (partial coverage)

Interfacing Processors and Peripherals
I/O Design affected by many factors (expandability, resilience) Performance: — access latency — throughput — connection between devices and the system — the memory hierarchy — the operating system A variety of different users (e.g., banks, supercomputers, engineers)

I/O Important but neglected “The difficulties in assessing and designing I/O systems have often relegated I/O to second class status” “courses in every aspect of computing, from programming to computer architecture often ignore I/O or give it scanty coverage” “textbooks leave the subject to near the end, making it easier for students and instructors to skip it!” GUILTY! — we won’t be looking at I/O in much detail — be sure and read Chapter 8 in its entirety. — you should probably take a networking class!

I/O Devices Very diverse devices — behavior (i.e., input vs. output) — partner (who is at the other end?) — data rate

I/O Example: Disk Drives
To access data: — seek: position head over the proper track (3 to 14 ms. avg.) — rotational latency: wait for desired sector (.5 / RPM) — transfer: grab the data (one or more sectors) 30 to 80 MB/sec

I/O Example: Buses Shared communication link (one or more wires)
Difficult design: — may be bottleneck — length of the bus — number of devices — tradeoffs (buffers for higher bandwidth increases latency) — support for many different devices — cost Types of buses: — processor-memory (short high speed, custom design) — backplane (high speed, often standardized, e.g., PCI) — I/O (lengthy, different devices, e.g., USB, Firewire) Synchronous vs. Asynchronous — use a clock and a synchronous protocol, fast and small but every device must operate at same rate and clock skew requires the bus to be short — don’t use a clock and instead use handshaking

I/O Bus Standards Today we have two dominant bus standards:

Other important issues
Bus Arbitration: — daisy chain arbitration (not very fair) — centralized arbitration (requires an arbiter), e.g., PCI — collision detection, e.g., Ethernet Operating system: — polling — interrupts — direct memory access (DMA) Performance Analysis techniques: — queuing theory — simulation — analysis, i.e., find the weakest link (see “I/O System Design”) Many new developments

Pentium 4 I/O Options

Fallacies and Pitfalls
Fallacy: the rated mean time to failure of disks is 1,200,000 hours, so disks practically never fail. Fallacy: magnetic disk storage is on its last legs, will be replaced. Fallacy: A 100 MB/sec bus can transfer 100 MB/sec. Pitfall: Moving functions from the CPU to the I/O processor, expecting to improve performance without analysis.

Multiprocessors Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad news: its really hard to write good concurrent programs many commercial failures

Questions How do parallel processors share data? — single address space (SMP vs. NUMA) — message passing How do parallel processors coordinate? — synchronization (locks, semaphores) — built into send / receive primitives — operating system protocols How are they implemented? — connected by a single bus — connected by a network

Supercomputers Plot of top 500 supercomputer sites over a decade:

Using multiple processors an old idea
Some SIMD designs: Costs for the the Illiac IV escalated from $8 million in 1966 to $32 million in 1972 despite completion of only ¼ of the machine. It took three more years before it was operational! “For better or worse, computer architects are not easily discouraged” Lots of interesting designs and ideas, lots of failures, few successes

Topologies

Clusters Constructed from whole computers
Independent, scalable networks Strengths: Many applications amenable to loosely coupled machines Exploit local area networks Cost effective / Easy to expand Weaknesses: Administration costs not necessarily lower Connected using I/O bus Highly available due to separation of memories In theory, we should be able to do better

Google Serve an average of 1000 queries per second
Google uses 6,000 processors and 12,000 disks Two sites in silicon valley, two in Virginia Each site connected to internet using OC48 (2488 Mbit/sec) Reliability: On an average day, 20 machines need rebooted (software error) 2% of the machines replaced each year In some sense, simple ideas well executed. Better (and cheaper) than other approaches involving increased complexity

Concluding Remarks Evolution vs. Revolution “More often the expense of innovation comes from being too disruptive to computer users” “Acceptance of hardware ideas requires acceptance by software people; therefore hardware people should learn about software. And if software people want good machines, they must learn more about hardware to be able to communicate with and thereby influence hardware engineers.”

Lecture 7: Pipelining.

Similar presentations

Presentation on theme: "Lecture 7: Pipelining."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 7: Pipelining.

Similar presentations

Presentation on theme: "Lecture 7: Pipelining."— Presentation transcript:

Similar presentations

About project

Feedback