Natawut NupairojAssembly Language1 Pipelining Processor.

Slides:

Advertisements

Similar presentations

PipelineCSCE430/830 Pipeline: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Prof. Yifeng Zhu, U of Maine Fall,

Advertisements

CPU performance CPU power consumption

Lecture 4: CPU Performance

© 2008 Wayne Wolf Overheads for Computers as Components 2nd ed. CPUs CPU performance CPU power consumption. 1.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.

CMPT 334 Computer Organization

© 2000 Morgan Kaufman Overheads for Computers as Components CPUs zCPU performance zCPU power consumption.

Pipeline and Vector Processing (Chapter2 and Appendix A)

Chapter 8. Pipelining.

Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.

Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.

Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.

Computer Architecture

CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.

S. Barua – CPSC 440 CHAPTER 6 ENHANCING PERFORMANCE WITH PIPELINING This chapter presents pipelining.

Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.

Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.

CS430 – Computer Architecture Introduction to Pipelined Execution

1 Atanasoff–Berry Computer, built by Professor John Vincent Atanasoff and grad student Clifford Berry in the basement of the physics building at Iowa State.

Pipelining - II Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.

Appendix A Pipelining: Basic and Intermediate Concepts

Sparc Architecture Overview

Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.

Pipelining - II Rabi Mahapatra Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.

Introduction to Pipelining Rabi Mahapatra Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley)

Natawut NupairojAssembly Language1 Subroutines. Natawut NupairojAssembly Language2 Subroutines A subroutine is a piece of program codes that performs.

CSC 3210 Computer Organization and Programming Chapter 2 SPARC Architecture Dr. Anu Bourgeois 1.

9.2 Pipelining Suppose we want to perform the combined multiply and add operations with a stream of numbers: A i * B i + C i for i =1,2,3,…,7.

CS1104: Computer Organisation School of Computing National University of Singapore.

B 0000 Pipelining ENGR xD52 Eric VanWyk Fall

CSC 3210 Computer Organization and Programming Chapter 2 SPARC Architecture Dr. Anu Bourgeois 1.

EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.

Natawut NupairojAssembly Language1 Arithmetic Operations.

Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.

Analogy: Gotta Do Laundry

Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

Natawut NupairojAssembly Language1 Control Structure.

ECE 232 L18.Pipeline.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 18 Pipelining.

CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S

Cs 152 L1 3.1 DAP Fa97,  U.CB Pipelining Lessons °Pipelining doesn’t help latency of single task, it helps throughput of entire workload °Multiple tasks.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Pipelining Example Laundry Example: Three Stages

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

EE524/CptS561 Jose G. Delgado-Frias 1 Processor Basic steps to process an instruction IFID/OFEXMEMWB Instruction Fetch Instruction Decode / Operand Fetch.

11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.

10/11: Lecture Topics Execution cycle Introduction to pipelining

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Chapter 2 SPARC Architecture Chinua Umoja 1.  SPARC is a load/store architecture  Registers used for all arithmetic and logical operations  32 registers.

SPARC Programming Model 24 “window” registers 8 global registers Control registers –Multiply step –PSR (status flags, etc.) –Trap Base register –Window.

Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 10 Computer Hardware Design (Pipeline Datapath and Control Design) Prof. Dr.

Lecture 18: Pipelining I.

CMSC 611: Advanced Computer Architecture

ECE232: Hardware Organization and Design

Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from

CSC 3210 Computer Organization and Programming

Serial versus Pipelined Execution

Chapter 8. Pipelining.

Pipelining Appendix A and Chapter 3.

A relevant question Assuming you’ve got: One washer (takes 30 minutes)

Presentation transcript:

Natawut NupairojAssembly Language1 Pipelining Processor

Natawut NupairojAssembly Language2 Instruction Cycle pc = 0; do { ir := memory[pc++];{ Fetch the instruction. } decode(ir);{ Decode the instruction. } fetch(operands);{ Fetch the operands. } execute;{ Execute the instruction. } store(results);{ store the results. } } while(ir != HALT);

Natawut NupairojAssembly Language3 Pipelining Improve the execution speed. Divide instruction cycle into “stages”. Each stage executes independently and concurrently. Pipelining is natural !!! (from David Patterson’s lecture note.)

Natawut NupairojAssembly Language4

Natawut NupairojAssembly Language5

Natawut NupairojAssembly Language6

Natawut NupairojAssembly Language7 Pipelining Lessons Pipelining doesn’t help latency of single task. It helps throughput of entire workload. Multiple tasks operating simultaneously using different resources. Potential speedup = number pipe stages

Natawut NupairojAssembly Language8 Pipelining in Modern Processor Instruction cycle is divided into five stages: FetchDecode Operand Fetch ExecuteStore

Natawut NupairojAssembly Language9 Pipelining Execution Time Inst 1 F D O E S Inst 2 F D O E S Inst 3 F D O E S

Natawut NupairojAssembly Language10 Performance of Pipeline What do we gain ? Suppose we execute 1000 instructions on non- pipelined and pipelined CPUs. Clock speed = 500 MHz (1 clock = 2 ns.) non-pipelined CPU: –total time = 2ns/cycle x 5 cycles/inst x 1000 instr. = 10 ms. Perfect pipelined CPU: –total time = 2ns/cycle x (1 cycle/inst x 1000 instr. + 4 cycles drain) = ms.

Natawut NupairojAssembly Language11 Nothing is perfect !!! Problem with branch. Don’t know what to fetch next until decoded. Time Inst 1 F D O E S Inst 2 : JMP X F D O E S Inst X F D O E S Branch target address is not available until here !!!

Natawut NupairojAssembly Language12 Stalled Pipe When pipelining is not smooth, we called it is “stalled”. Branch and others ? –Subroutine calling –Memory accessing –Multi-cycle execution Can we do better ? YES…but discuss later.

Natawut NupairojAssembly Language13 Branching in Sparc Sparc uses a 5-stage pipeline. Recall: pipe is stalled due to branch !!! Time Inst 1 F D O E S Inst 2 : JMP X F D O E S Inst X F D O E S Branch target address is not available until here !!!

Natawut NupairojAssembly Language14 Branching in Sparc However, Sparc does not stall but will execute the instruction next to the brach (or call) instruction BEFORE it actually branches. This is called “delay slot”.

Natawut NupairojAssembly Language15 Delay Slot Time Inst 1 F D O E S Inst 2 : JMP X F D O E S Inst 3 F D O E S Inst X F D O E S Branch target address is not available until here !!! Delay Slot

Natawut NupairojAssembly Language16 Filling Delay Slots with NOP.global main main:save %sp, -64, %sp mov 9, %l0 sub %l0, 2, %o0 add %l0, 14, %o1! Instruction before branch call.mul nop! Delay slot => wasted add %l0, 8, %o1! Instruction before branch call.div nop! Delay slot => wasted mov %o0, %l1 mov 1, %g1 ta 0

Natawut NupairojAssembly Language17 Filling Delay Slots.global main main:save %sp, -64, %sp mov 9, %l0 sub %l0, 2, %o0 call.mul add %l0, 14, %o1! Delay slot filled call.div add %l0, 8, %o1! Delay slot filled mov %o0, %l1 mov 1, %g1 ta 0

Natawut NupairojAssembly Language18 Optimizing Our Second Program Can we fill the delay slot ?... mov %o0, %l1! Store it in y add %l0, 1, %l0! x++ cmp %l0, 11! x < 11 ? bl loop nop! Delay slot => wasted... Not with cmp, not add (cmp depends on add). mov can !!! No other instructions after that (and before bl) depend on this instruction.

Natawut NupairojAssembly Language19 Optimizing Our Second Program... mov %o0, %l1! Store it in y add %l0, 1, %l0! x++ cmp %l0, 11! x < 11 ? bl loop mov %o0, %l1! Store it in y... The key is to fill the delay slot with the instruction that has no other instruction depends on its result !!!

Natawut NupairojAssembly Language20 Filling Delay Slot Summary After branch and call, there is one delay slot. Always feel the delay slot to improve performance. When filling the slot, don’t change the results the program computes. No other instructions (before the branch and the branch itself) depend on the instruction in the delay slot. You can always fill the slot with “nop”.

Natawut NupairojAssembly Language21 Do…While Delay Slot How to fill the delay slot ? –Independent instruction –target instruction with annulled branch when you cannot find any independent instruction Independent instruction –see our second program

Natawut NupairojAssembly Language22 Filling with Target Instruction sub %l0, 1, %o0!(x-1) to %o0, execute once loop:call.mul sub %l0, 7, %o1!(x-7) to %o1, delay slot call.div sub %l0, 11, %o1!(x-11) to %o1, delay slot mov %o0, %l1! Store it in y add %l0, 1, %l0! x++ cmp %l0, 11! x < 11 ? bl,a loop sub %l0, 1, %o0!(x-1) to %o0 (delay slot)

Natawut NupairojAssembly Language23 Annulled Branch Execute an instruction in the delay slot if and only if branch occurs. Program is one instruction longer and waste one cycle when the loop exits. Do not need to find an independent instruction. Can be used with any type of branches.

Natawut NupairojAssembly Language24 While Loop Optimization Reduce number of instructions to be executed inside the loop. –By first jumping to the comparison at the end of the loop. Then fill the delay slot.

Natawut NupairojAssembly Language25 While Loop Optimization Example ba test! Initial jump nop! Delay slot loop: add %l0, %l1, %l0! a = a + b add %l2, 1, %l2! c++ test: cmp %l0, 17! Check condition ble loop! Repeat if true nop! Delay slot

Natawut NupairojAssembly Language26 While Loop Optimization Example ba test! Initial jump cmp %l0, 17 ! Check condition (DS) loop:add %l2, 1, %l2 ! c++ cmp %l0, 17! Check condition test: ble,a loop! Repeat if true add %l0, %l1, %l0! very tricky! (DS) Performance Improvement: –Direct translation = 7*number of loop iterations –With initial jumping = 5*number of loop iterations –And filling delay slots = 4*number of loop iterations