Prince Sultan College For Woman Riyadh Philanthropic Society For Science Prince Sultan College For Woman Dept. of Computer & Information Sciences CS 251 Introduction to Computer Organization & Assembly Language Lecture 4 (Computer System Organization) Processors - Parallelism
Outline From text Book: Chapter 2 ( 2.1.3, 2.1.4, 2.1.5, 2.1.6) CISC vs. RISC Design Principles for modern Computers Instruction level Parallelism Processor level Parallelism Processors
RISC vs. CISC The control determines the instruction set of the computer and it is mainly split into two main categories: RISC: (Reduced Instruction Set Computer) CISC: (Complex Instruction set Computer) Processors
RISC vs. CISC (Cont.) The RISC computers consists of: A small number of simple instructions that executes in one cycle of the data path. All the instructions are executed by hardware It s instructions are 10 times faster than the CISC instructions RISC machines had performance advantages: All instructions were supported by hardware The chip was properly designed without any backward compatibility issues. Processors
RISC vs. CISC (Cont.) The CISC computers consists of: A large number of complex instructions. All the instructions will require interpretation. Complex instruction will be interpreted into many machine instructions and then executed by the computer hardware. It s instructions are 10 times slower than the RISC instructions The chip is designed keeping in mind backward compatibility issues. Both RISC and CISC instruction computers had their fan clubs and none of them was able to overcome the other in the market Processors
RISC vs. CISC (COMP.) Intel combined the RISC and CISC architectures (Intel 486) IT has a RISC core that executes simple instructions in a single cycle The more complex instructions are executed in a CISC way The net result: Common instructions are fast. Less common instructions are slow. It is not as fast as the pure RISC design It gives competitive overall performance while still allowing old software to run unmodified. Processors
Design Principles for Modern Computers Modern computer design is based on a set of design principles, sometimes called the RISC design principles. These principles could be summarized in 5 major points All instructions are directly executed by hardware Maximize the rate at which the instructions are issued Instructions should be easy to decode Only loads and stores should access memory Provide plenty of registers Processors
Design Principles for Modern Computers ALL Instructions are directly executed by hardware It eliminates a level of interpretation Provides high speed for most instructions For computers with CISC instruction implementation: Complex instructions can be split into smaller ones that could be executed as micro instructions This extra step slows the machine, but only for less frequently used instructions which is acceptable Processors
Design Principles for Modern Computers Maximize the rate at which the instructions are issued MIPS = Millions of instructions per second MIPS speed related to the number of instructions issued per second, no matter how long the instructions actually take to complete. This principle suggests that parallelism can play a major role in improving performance Although instructions are always encountered in program order, they are not always issued in program order and they need not finish in program order. If instruction 1 sets a register and instruction 2 uses that register, great care must be taken to make sure that instruction 2 does not read that register until it contains the correct value. Getting this right requires a lot of bookkeeping but has the potential for performance gains by executing multiple instructions at once. Processors
Design Principles for Modern Computers Instructions should be easy to decode Making instructions regular, fixed length, with a small number of fields. The fewer different formats for instructions, the better. Processors
Design Principles for Modern Computers Only Loads and stores should access memory Operands for most instructions come from - and return to - registers. Access to memory can take a long time. Thus, Only LOAD and STORE instruction should reference memory. Processors
Design Principles for Modern Computers Provide plenty of registers Accessing memory is relatively slow Many registers need to be provided (at least 32) Once a word is fetched it can be kept in register memory until no longer needed Processors
Parallelism There are two types of parallelism Instruction level parallelism : instructions per second issued by the computer rather than improving the execution speed of a particular instruction. Processor level parallelism: multiple processors (CPUs) working together on the same problem. Processors
Instruction level Parallelism Pipelining: The biggest bottleneck in the instruction cycle is fetching the instruction from memory. Prefetch instructions and put them in a Prefetch Buffer Instructions then are pre loaded into registers when it’s time for their execution Thus prefetching divides instruction execution up into two parts: Fetching. Actual Execution. Usually the execution is split into several stages with each stage having its own dedicated hardware piece, and all them can work in parallel Processors
Instruction Level Parallelism A five stage Pipeline S1 S2 S3 S4 S5 Instruction fetch unit Instruction decode unit Operand fetch unit Instruction execution unit Write back unit S1: fetches instruction from memory and places it in a buffer until it is needed. S2: decodes the instruction, determining its type and what operands it needs. S3: locate and fetches the operands, either from registers or from memory. S4: actually does the work of carrying out the instruction, typically by running the operands through the data path. S5: Writes the result back to the proper register. Processors
Instruction Level Parallelism Five stage pipeline The execution at each point in time S1 S2 S3 S4 S5 Instruction fetch unit Instruction decode unit Operand fetch unit Instruction execution unit Write back unit S1: 1 2 3 4 5 6 7 8 9 S2: 1 2 3 4 5 6 7 8 S3: 1 2 3 4 5 6 7 S4: 1 2 3 4 5 6 S5: 1 2 3 4 5 1 2 3 4 5 6 7 8 9 Processors
Instruction level Parallelism Duel Pipeline: Instruction fetch unit decode Operand execution Write back S1 S2 S3 S4 S5 Processors
Instruction Level Parallelism Duel Pipeline Single instruction fetch unit fetches pairs of instructions together and puts each one into its own pipeline, completes with its own ALU for parallel operations. To be able to run in parallel, the two instructions must not conflict over resource usage (e.g. Registers), and neither must depend on the result of the other. Processors
Instruction Level Parallelism Super scalar architecture Single pipeline with multiple functional units Instruction fetch unit decode Operand LOAD Write back S1 S2 S3 S4 S5 STORE Floating point ALU Processors
Processor Level Parallelism Instruction level parallelism helps performance but only to a factor of 5 to 10 Processor parallelism gains a factor of 50, 100, and even more There are three main forms of processor parallelism Array computers: Array processors Vector processors Multiprocessors Multicomputer Processors
Processor Level Parallelism Array computers Many problems in the physical sciences & engineering involve arrays Often the same calculations are performed on many different sets of data at the same time. The regularity & structure of these programs makes them especially easy targets for speedup through parallel execution. There are two methods that have been used to execute large scientific programs quickly: Array processor Vector processor Processors
Processor Level Parallelism Array computers – Array processors consists of a large number of identical processors With single control unit to control all perform the same sequence of instructions on different sets of data (in parallel). Processors
Processor Level Parallelism Array computers – Array processors 8 * 8 Processor/memory grid Processor (ALU + registers) Memory (local) Control Unit Broadcasts instructions Processors
Processor Level Parallelism Array computers – Array processors Example: the vector addition C = A + B The control unit stores the ith components ai and bi of A and B in local memory mi The control unit broadcasts the add instruction ci = ai + bi to all processors. Addition takes place simultaneously, since there is an adder for each element in the vector. Processors
Processor Level Parallelism Array computers – Vector processors appears to the programmer very much like an array processor. all of the addition operations in a vector processor are performed in a single, heavily- pipelined adder. (array processor has an adder for each element in the vector) the concept of a vector register, which consists of a set of conventional registers that can be loaded from memory in a single instruction, which actually loads them from memory serially. Vector addition instruction is performed on vector registers through a pipelined adder Processors
Processor Level Parallelism Array computers – Vector processors Processors
Processor Level Parallelism Array computers Both array processors and vector processors work on array of data. Both execute single instructions that, for example, add the element together pairwise for two vectors. The difference is on the way they perform the addition operation. In comparison with vector processors, array processors : can perform some data operations more efficiently requires more hardware is more difficult to program Processors
Processor Level Parallelism Array computers Array processors are still being made, but they occupy an ever- decreasing niche market, since they only work well on problems requiring the same computation to be performed on many data sets simultaneously. Vector processor can be added to a conventional processor. The result is that parts of the program that can be vectorized can be executed quickly by taking advantage of the vector unit. While the rest of the program can be executed on a conventional single processor. Processors
Processor Level Parallelism Multiprocessors A multiprocessor is made up of a collection of CPUs sharing a common memory. There are various implementation schemas for the CPU to access memory The simplest one is to have a single bus with multiple CPUs and one memory all plugged into it. Shared Memory CPU Bus Processors
Processor Level Parallelism Multiprocessors The bus quickly becomes a bottleneck in scheme 1. Another solution is to give each CPU a local memory, it can cache information there. Shared Memory CPU Bus Local memories Processors
Processor Level Parallelism Multiprocessors multiprocessors with a small number of processor (<=64) are relatively easy to build. large ones are difficult to construct The difficulty lays in connecting all the processors to the memory Processors
Processor Level Parallelism Multicomputer similar to a multiprocessor in that it is made up of a collection of CPUs it differs in that there is no shared memory. Individual CPUs communicate by sending messages, something like e-mail, but much faster. Multicomputers with nearly 10,000 CPUs have been built and put into operation. Processors
Processor Level Parallelism Conclusion b/w multiprocessors & multicomputers multiprocessors are easier to program multicomputers are easier to build there is much research on designing hybrid systems that combine the good properties of each Processors