Computer System Architecture Dept. of Info. Of Computer Chap. 9 Pipeline and Vector Processing 9-1 Chap. 9 Pipeline and Vector Processing n 9-1 Parallel.

Slides:

Advertisements

Similar presentations

CPU Structure and Function

Advertisements

Lecture 4: CPU Performance

PIPELINING AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING

The University of Adelaide, School of Computer Science

Chapter 3 Pipelining. 3.1 Pipeline Model n Terminology –task –subtask –stage –staging register n Total processing time for each task. –T pl =, where t.

Computer Organization and Architecture

Pipeline and Vector Processing (Chapter2 and Appendix A)

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

Computer Organization and Architecture

Computer Organization and Architecture

Chapter 12 Pipelining Strategies Performance Hazards.

Pipelining Fetch instruction Decode instruction Calculate operands (i.e. EAs) Fetch operands Execute instructions Write result Overlap these operations.

Lec 17 Nov 2 Chapter 4 – CPU design data path design control logic design single-cycle CPU performance limitations of single cycle CPU multi-cycle CPU.

Chapter 12 CPU Structure and Function. Example Register Organizations.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Prince Sultan College For Woman

CH12 CPU Structure and Function

Advanced Computer Architectures

5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.

Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.

Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13

Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.

Computer Systems Organization CS 1428 Foundations of Computer Science.

9.2 Pipelining Suppose we want to perform the combined multiply and add operations with a stream of numbers: A i * B i + C i for i =1,2,3,…,7.

Presented by: Sergio Ospina Qing Gao. Contents ♦ 12.1 Processor Organization ♦ 12.2 Register Organization ♦ 12.3 Instruction Cycle ♦ 12.4 Instruction.

PIPELINING AND VECTOR PROCESSING

PIPELINING AND VECTOR PROCESSING

Computer System Architecture © Korea Univ. of Tech. & Edu. Dept. of Info. & Comm. Chap. 7 Microprogrammed Control 7-1 Chap. 7 Microprogrammed Control (Control.

Lecture 8: Processors, Introduction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014,

1 Pipelining and Vector Processing Computer Organization Computer Architectures Lab PIPELINING AND VECTOR PROCESSING Parallel Processing Pipelining Arithmetic.

Principles of Linear Pipelining

Principles of Linear Pipelining. In pipelining, we divide a task into set of subtasks. The precedence relation of a set of subtasks {T 1, T 2,…, T k }

Chapter One Introduction to Pipelined Processors

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

EKT303/4 Superscalar vs Super-pipelined.

Introduction  The speed of execution of program is influenced by many factors. i) One way is to build faster circuit technology to build the processor.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Chapter One Introduction to Pipelined Processors

CPU Design and Pipelining – Page 1CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: CPU Operations and Pipelining Reading:

Vector computers.

Pipelining. A process of execution of instructions may be decomposed into several suboperations Each of suboperations may be executed by a dedicated segment.

DICCD Class-08. Parallel processing A parallel processing system is able to perform concurrent data processing to achieve faster execution time The system.

UNIT-V PIPELINING & VECTOR PROCESSING.

William Stallings Computer Organization and Architecture 8th Edition

Pipelining and Vector Processing

Array Processor.

Overview Parallel Processing Pipelining

Chap. 9 Pipeline and Vector Processing

Computer Architecture

COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING

Extra Reading Data-Instruction Stream : Flynn

Chapter 11 Processor Structure and function

Compiler analyzes the instructions before and after the branch and rearranges the program sequence by inserting useful instructions in the delay steps.

COMPUTER ORGANIZATION AND ARCHITECTURE

Presentation transcript:

Computer System Architecture Dept. of Info. Of Computer Chap. 9 Pipeline and Vector Processing 9-1 Chap. 9 Pipeline and Vector Processing n 9-1 Parallel Processing  Simultaneous data processing tasks for the purpose of increasing the computational speed  Perform concurrent data processing to achieve faster execution time  Multiple Functional Unit : Fig. 9-1 l Separate the execution unit into eight functional units operating in parallel  Computer Architectural Classification l Data-Instruction Stream : Flynn l Serial versus Parallel Processing : Feng l Parallelism and Pipelining : Händler  Flynn’s Classification l 1) SISD (Single Instruction - Single Data stream) »for practical purpose: only one processor is useful »Example systems : Amdahl 470V/6, IBM 360/91 Parallel Processing Example =

Computer System Architecture Dept. of Info. Of Computer Chap. 9 Pipeline and Vector Processing 9-2 l 2) SIMD (Single Instruction - Multiple Data stream) »vector or array operations 에 적합한 형태 n one vector operation includes many operations on a data stream »Example systems : CRAY -1, ILLIAC-IV l 3) MISD (Multiple Instruction - Single Data stream) »Data Stream 에 Bottle neck 으로 인해 실제로 사용되지 않음

Computer System Architecture Dept. of Info. Of Computer Chap. 9 Pipeline and Vector Processing 9-3 l 4) MIMD (Multiple Instruction - Multiple Data stream) » 대부분의 Multiprocessor System 에서 사용됨  Main topics in this Chapter l Pipeline processing : Sec. 9-2 »Arithmetic pipeline : Sec. 9-3 »Instruction pipeline : Sec. 9-4 l Vector processing :adder/multiplier pipeline 이용, Sec. 9-6 l Array processing : 별도의 array processor 이용, Sec. 9-7 »Attached array processor : Fig »SIMD array processor : Fig Large vector, Matrices, 그리고 Array Data 계산

Computer System Architecture Dept. of Info. Of Computer Chap. 9 Pipeline and Vector Processing 9-4 n 9-2 Pipelining  Pipelining 의 원리 l Decomposing a sequential process into suboperations l Each subprocess is executed in a special dedicated segment concurrently  Pipelining 의 예제 : Fig. 9-2 l Multiply and add operation : ( for i = 1, 2, …, 7 ) l 3 개의 Suboperation Segment 로 분리 »1) : Input Ai and Bi »2) : Multiply and input Ci »3) : Add Ci l Content of registers in pipeline example : Tab. 9-1  General considerations l 4 segment pipeline : Fig. 9-3 »S : Combinational circuit for Suboperation »R : Register(intermediate results between the segments) l Space-time diagram : Fig. 9-4 »Show segment utilization as a function of time l Task : T1, T2, T3,…, T6 »Total operation performed going through all the segment Segment versus clock-cycle

Computer System Architecture Dept. of Info. Of Computer Chap. 9 Pipeline and Vector Processing 9-5  Speedup S : Nonpipeline / Pipeline l S = n t n / ( k + n - 1 ) t p = 6 6 t n / ( ) t p = 36 t n / 9 t n = 4 »n : task number ( 6 ) »t n : time to complete each task in nonpipeline ( 6 cycle times = 6 t p ) »t p : clock cycle time ( 1 clock cycle ) »k : segment number ( 4 ) l If n   이면, S = t n / t p l 한 개의 task 를 처리하는 시간이 같을 때 즉, nonpipeline ( t n ) = pipeline ( k t p ) 이라고 가정하면, S = t n / t p = k t p / t p = k 따라서 이론적으로 k 배 (segment 개수 ) 만큼 처리 속도가 향상된다.  Pipeline 에는 Arithmetic Pipeline( Sec. 9-3 ) 과 Instruction Pipeline( Sec. 9-4 ) 이 있다 n Sec. 9-3 Arithmetic Pipeline  Floating-point Adder Pipeline Example : Fig. 9-6 l Add / Subtract two normalized floating-point binary number »X = A x 2 a = x 10 3 »Y = B x 2 b = x 10 2 Pipeline 에서의 처리 시간 = 9 clock cycles k + n - 1  n

Computer System Architecture Dept. of Info. Of Computer Chap. 9 Pipeline and Vector Processing 9-6 l 4 segments suboperations »1) Compare exponents by subtraction : = 1 n X = x 10 3 n Y = x 10 2 »2) Align mantissas n X = x 10 3 n Y = x 10 3 »3) Add mantissas n Z = x 10 3 »4) Normalize result n Z = x 10 4

Computer System Architecture Dept. of Info. Of Computer Chap. 9 Pipeline and Vector Processing 9-7 n 9-4 Instruction Pipeline  Instruction Cycle 1) Fetch the instruction from memory 2) Decode the instruction 3) Calculate the effective address 4) Fetch the operands from memory 5) Execute the instruction 6) Store the result in the proper place  Example : Four-segment Instruction Pipeline l Four-segment CPU pipeline : Fig. 9-7 »1) FI : Instruction Fetch »2) DA : Decode Instruction & calculate EA »3) FO : Operand Fetch »4) EX : Execution l Timing of Instruction Pipeline : Fig. 9-8 »Instruction 3 에서 Branch 명령 실행 Branch No Branch

Computer System Architecture Dept. of Info. Of Computer Chap. 9 Pipeline and Vector Processing 9-8  Pipeline Conflicts : 3 major difficulties l 1) Resource conflicts »memory access by two segments at the same time l 2) Data dependency »when an instruction depend on the result of a previous instruction, but this result is not yet available l 3) Branch difficulties »branch and other instruction (interrupt, ret,..) that change the value of PC  Data Dependency 해결 방법 l Hardware 적인 방법 »Hardware Interlock n previous instruction 의 결과가 나올 때 까지 Hardware 적인 Delay 를 강제 삽입 »Operand Forwarding n previous instruction 의 결과를 곧바로 ALU 로 전달 ( 정상적인 경우, register 를 경유함 ) l Software 적인 방법 »Delayed Load n previous instruction 의 결과가 나올 때 까지 No-operation instruction 을 삽입  Handling of Branch Instructions l Prefetch target instruction »Conditional branch 에서 branch target instruction ( 조건 맞음 ) 과 다음 instruction ( 조건 안 맞음 ) 을 모두 fetch

Computer System Architecture Dept. of Info. Of Computer Chap. 9 Pipeline and Vector Processing 9-9 l Branch Target Buffer : BTB »1) Associative memory 를 이용하여 branch target address 이후에 몇 개에 instruction 을 미리 BTB 에 저장한다. »2) 만약 branch instruction 이면 우선 BTB 를 검사하여 BTB 에 있으면 곧바로 가져온다 (Cache 개념 도입 ) l Loop Buffer »1) small very high speed register file (RAM) 을 이용하여 프로그램에서 loop 를 detect 한다. »2) 만약 loop 가 발견되면 loop 프로그램 전체를 Loop Buffer 에 load 하여 실행하면 외부 메모리를 access 하지 않는다. l Branch Prediction »Branch 를 predict 하는 additional hardware logic 사용  Delayed Branch 해결 방법 l Fig. 9-8 에서와 같이 branch instruction 이 pipeline operation 을 지연시키는 경우 l 예제 : Fig. 9-10, p. 318, Sec. 9-5 »1) No-operation instruction 삽입 »2) Instruction Rearranging : Compiler 지원

Computer System Architecture Dept. of Info. Of Computer Chap. 9 Pipeline and Vector Processing 9-10 n 9-5 RISC Pipeline  RISC CPU 의 특징 l Instruction Pipeline 을 이용함 l Single-cycle instruction execution l Compiler support  Example : Three-segment Instruction Pipeline l 3 Suboperations Instruction Cycle »1) I : Instruction fetch »2) A : Instruction decoded and ALU operation »3) E : Transfer the output of ALU to a register, memory, or PC l Delayed Load : Fig. 9-9(a) »3 번째 Instruction( ADD R1 + R3 ) 에서 Conflict 발생 n 4 번째 clock cycle 에서 2 번째 Instruction (LOAD R2) 실행과 동시에 3 번째 instruction 에서 R2 를 연산 »Delayed Load 해결 방법 : Fig. 9-9(b) n No-operation 삽입 l Delayed Branch : Sec. 9-4 에서 이미 설명 Conflict 발생

Computer System Architecture Dept. of Info. Of Computer Chap. 9 Pipeline and Vector Processing 9-11 n 9-6 Vector Processing  Science and Engineering Applications l Long-range weather forecasting, Petroleum explorations, Seismic data analysis, Medical diagnosis, Aerodynamics and space flight simulations, Artificial intelligence and expert systems, Mapping the human genome, Image processing  Vector Operations l Arithmetic operations on large arrays of numbers l Conventional scalar processor »Machine language l Vector processor »Single vector instruction Initialize I = 0 20 Read A(I) Read B(I) Store C(I) = A(I) + B(I) Increment I = I + 1 If I  100 go to 20 Continue »Fortran language DO 20 I = 1, C(I) = A(I) + B(I) C(1:100) = A(1:100) + B(1:100)

Computer System Architecture Dept. of Info. Of Computer Chap. 9 Pipeline and Vector Processing 9-12  Vector Instruction Format : Fig ADD A B C 100  Matrix Multiplication l 3 x 3 matrices multiplication : n 2 = 9 inner product » : 이와 같은 inner product 가 9 개 l Cumulative multiply-add operation : n 3 = 27 multiply-add » : 이와 같은 multiply-add 가 3 개 따라서 9 X 3 multiply-add = 27    C 11 의 초기값 = 0

Computer System Architecture Dept. of Info. Of Computer Chap. 9 Pipeline and Vector Processing 9-13  Pipeline for calculating an inner product : Fig l Floating point multiplier pipeline : 4 segment l Floating point adder pipeline : 4 segment l 예제 ) »after 1st clock input »after 8th clock input »Four section summation »after 4th clock input A1B1A1B1 A 4 B 4 A 3 B 3 A 2 B 2 A 1 B 1 »after 9th, 10th, 11th,... A 8 B 8 A 7 B 7 A 6 B 6 A 5 B 5 A 4 B 4 A 3 B 3 A 2 B 2 A 1 B 1,,,  

Computer System Architecture Dept. of Info. Of Computer Chap. 9 Pipeline and Vector Processing 9-14  Memory Interleaving : Fig l Simultaneous access to memory from two or more source using one memory bus system l AR 의 하위 2 bit 를 사용하여 4 개중 1 개의 memory module 선택 l 예제 ) Even / Odd Address Memory Access  Supercomputer l Supercomputer = Vector Instruction + Pipelined floating-point arithmetic l Performance Evaluation Index »MIPS : Million Instruction Per Second »FLOPS : Floating-point Operation Per Second n megaflops : 10 6, gigaflops : 10 9 l Cray supercomputer : Cray Research »Clay-1 : 80 megaflops, 4 million 64 bit words memory »Clay-2 : 12 times more powerful than the clay-1 l VP supercomputer : Fujitsu »VP-200 : 300 megaflops, 32 million memory, 83 vector instruction, 195 scalar instruction »VP-2600 : 5 gigaflops

Computer System Architecture Dept. of Info. Of Computer Chap. 9 Pipeline and Vector Processing 9-15 n 9-7 Array Processors  Performs computations on large arrays of data  Array Processing l Attached array processor : Fig »Auxiliary processor attached to a general purpose computer l SIMD array processor : Fig »Computer with multiple processing units operating in parallel n Vector 계산 C = A + B 에서 c i = a i + b i 를 각각의 PE i 에서 동시에 실행 Vector processing : Adder/Multiplier pipeline 이용 Array processing : 별도의 array processor 이용