Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance
Prepared by: Dr Masri Ayob - TK2123 2Contents This lecture will discuss: This lecture will discuss: Speeding up computer operation. Speeding up computer operation. Improvements in Chip Organisation and Architecture. Improvements in Chip Organisation and Architecture. Multilevel Machines Multilevel Machines
Prepared by: Dr Masri Ayob - TK Speeding up computer operation Pipelining Pipelining On board cache On board cache On board L1 & L2 cache On board L1 & L2 cache Branch prediction Branch prediction Data flow analysis Data flow analysis Speculative execution Speculative execution
Prepared by: Dr Masri Ayob - TK Performance Balance Processor speed increased. Processor speed increased. Memory capacity increased. Memory capacity increased. Memory speed lags behind processor speed. Memory speed lags behind processor speed.
Prepared by: Dr Masri Ayob - TK Logic and Memory Performance Gap
Prepared by: Dr Masri Ayob - TK2123 6Solutions Increase number of bits retrieved at one time Increase number of bits retrieved at one time Make DRAM “wider” rather than “deeper” Make DRAM “wider” rather than “deeper” Change DRAM interface Change DRAM interface Cache Cache Reduce frequency of memory access Reduce frequency of memory access More complex cache and cache on chip More complex cache and cache on chip Increase interconnection bandwidth Increase interconnection bandwidth High speed buses High speed buses Hierarchy of buses Hierarchy of buses
Prepared by: Dr Masri Ayob - TK I/O Devices Peripherals with intensive I/O demands Peripherals with intensive I/O demands Large data throughput demands Large data throughput demands Processors can handle this Processors can handle this Problem moving data Problem moving data Solutions: Solutions: Caching Caching Buffering Buffering Higher-speed interconnection buses Higher-speed interconnection buses More elaborate bus structures More elaborate bus structures Multiple-processor configurations Multiple-processor configurations
Prepared by: Dr Masri Ayob - TK Typical I/O Device Data Rates
Prepared by: Dr Masri Ayob - TK Key is Balance Processor components Processor components Main memory Main memory I/O devices I/O devices Interconnection structures Interconnection structures
Prepared by: Dr Masri Ayob - TK Improvements in Chip Organization and Architecture Increase hardware speed of processor Increase hardware speed of processor Fundamentally due to shrinking logic gate size Fundamentally due to shrinking logic gate size More gates, packed more tightly, increasing clock rate More gates, packed more tightly, increasing clock rate Propagation time for signals reduced Propagation time for signals reduced Increase size and speed of caches Increase size and speed of caches Dedicating part of processor chip Dedicating part of processor chip Cache access times drop significantly Cache access times drop significantly Change processor organization and architecture Change processor organization and architecture Increase effective speed of execution Increase effective speed of execution Parallelism Parallelism
Prepared by: Dr Masri Ayob - TK Problems with Clock Speed and Logic Density Power Power Power density increases with density of logic and clock speed. Power density increases with density of logic and clock speed. Dissipating heat. Dissipating heat. RC delay RC delay Speed at which electrons flow limited by resistance and capacitance of metal wires connecting them. Speed at which electrons flow limited by resistance and capacitance of metal wires connecting them. Delay increases as RC product increases. Delay increases as RC product increases. Wire interconnects thinner, increasing resistance. Wire interconnects thinner, increasing resistance. Wires closer together, increasing capacitance. Wires closer together, increasing capacitance. Memory latency Memory latency Memory speeds lag processor speeds. Memory speeds lag processor speeds. Solution: Solution: More emphasis on organisational and architectural approaches More emphasis on organisational and architectural approaches
Prepared by: Dr Masri Ayob - TK Intel Microprocessor Performance
Prepared by: Dr Masri Ayob - TK Increased Cache Capacity Typically two or three levels of cache between processor and main memory. Typically two or three levels of cache between processor and main memory. Chip density increased Chip density increased More cache memory on chip More cache memory on chip Faster cache access Faster cache access Pentium chip devoted about 10% of chip area to cache. Pentium chip devoted about 10% of chip area to cache. Pentium 4 devotes about 50% Pentium 4 devotes about 50%
Prepared by: Dr Masri Ayob - TK More Complex Execution Logic Enable parallel execution of instructions Enable parallel execution of instructions Pipeline works like assembly line Pipeline works like assembly line Different stages of execution of different instructions at same time along pipeline Different stages of execution of different instructions at same time along pipeline Superscalar allows multiple pipelines within single processor Superscalar allows multiple pipelines within single processor Instructions that do not depend on one another can be executed in parallel Instructions that do not depend on one another can be executed in parallel
Prepared by: Dr Masri Ayob - TK Diminishing Returns Internal organisation of processors complex Internal organisation of processors complex Can get a great deal of parallelism Can get a great deal of parallelism Further significant increases likely to be relatively modest. Further significant increases likely to be relatively modest. Benefits from cache are reaching limit. Benefits from cache are reaching limit. Increasing clock rate runs into power dissipation problem. Increasing clock rate runs into power dissipation problem. Some fundamental physical limits are being reached. Some fundamental physical limits are being reached.
Prepared by: Dr Masri Ayob - TK New Approach – Multiple Cores Multiple processors on single chip Multiple processors on single chip Large shared cache Large shared cache Within a processor, increase in performance proportional to square root of increase in complexity Within a processor, increase in performance proportional to square root of increase in complexity If software can use multiple processors, doubling number of processors almost doubles performance If software can use multiple processors, doubling number of processors almost doubles performance So, use two simpler processors on the chip rather than one more complex processor So, use two simpler processors on the chip rather than one more complex processor With two processors, larger caches are justified With two processors, larger caches are justified Power consumption of memory logic less than processing logic Power consumption of memory logic less than processing logic Example: IBM POWER4 Example: IBM POWER4 Two cores based on PowerPC Two cores based on PowerPC
Prepared by: Dr Masri Ayob - TK POWER4 Chip Organization
Prepared by: Dr Masri Ayob - TK Pentium Evolution (1) first general purpose microprocessor first general purpose microprocessor 8 bit data path 8 bit data path Used in first personal computer – Altair Used in first personal computer – Altair much more powerful much more powerful 16 bit 16 bit instruction cache, prefetch few instructions instruction cache, prefetch few instructions 8088 (8 bit external bus) used in first IBM PC 8088 (8 bit external bus) used in first IBM PC Mbyte memory addressable 16 Mbyte memory addressable up from 1Mb up from 1Mb bit 32 bit Support for multitasking Support for multitasking
Prepared by: Dr Masri Ayob - TK Pentium Evolution (2) sophisticated powerful cache and instruction pipelining sophisticated powerful cache and instruction pipelining built in maths co-processor built in maths co-processor Pentium Pentium Superscalar Superscalar Multiple instructions executed in parallel Multiple instructions executed in parallel Pentium Pro Pentium Pro Increased superscalar organization Increased superscalar organization Aggressive register renaming Aggressive register renaming branch prediction branch prediction data flow analysis data flow analysis speculative execution speculative execution
Prepared by: Dr Masri Ayob - TK Pentium Evolution (3) Pentium II Pentium II MMX technology MMX technology graphics, video & audio processing graphics, video & audio processing Pentium III Pentium III Additional floating point instructions for 3D graphics Additional floating point instructions for 3D graphics Pentium 4 Pentium 4 Note Arabic rather than Roman numerals Note Arabic rather than Roman numerals Further floating point and multimedia enhancements Further floating point and multimedia enhancements Itanium Itanium 64 bit 64 bit Itanium 2 Itanium 2 Hardware enhancements to increase speed Hardware enhancements to increase speed
Prepared by: Dr Masri Ayob - TK Intel Computer Family (3) Moore’s law for (Intel) CPU chips.
Prepared by: Dr Masri Ayob - TK Intel Computer Family (1) The Intel CPU family. Clock speeds are measured in MHz (megahertz) where 1 MHZ is 1 million cycles/sec.
Prepared by: Dr Masri Ayob - TK PowerPC 1975, 801 minicomputer project (IBM) RISC 1975, 801 minicomputer project (IBM) RISC Berkeley RISC I processor Berkeley RISC I processor 1986, IBM commercial RISC workstation product, RT PC. 1986, IBM commercial RISC workstation product, RT PC. Not commercial success Not commercial success Many rivals with comparable or better performance Many rivals with comparable or better performance 1990, IBM RISC System/ , IBM RISC System/6000 RISC-like superscalar machine RISC-like superscalar machine POWER architecture POWER architecture IBM alliance with Motorola (68000 microprocessors), and Apple, (used in Macintosh) IBM alliance with Motorola (68000 microprocessors), and Apple, (used in Macintosh) Result is PowerPC architecture Result is PowerPC architecture Derived from the POWER architecture Derived from the POWER architecture Superscalar RISC Superscalar RISC Apple Macintosh Apple Macintosh Embedded chip applications Embedded chip applications
Prepared by: Dr Masri Ayob - TK PowerPC Family (1) 601: 601: Quickly to market. 32-bit machine Quickly to market. 32-bit machine 603: 603: Low-end desktop and portable Low-end desktop and portable 32-bit 32-bit Comparable performance with 601 Comparable performance with 601 Lower cost and more efficient implementation Lower cost and more efficient implementation 604: 604: Desktop and low-end servers Desktop and low-end servers 32-bit machine 32-bit machine Much more advanced superscalar design Much more advanced superscalar design Greater performance Greater performance 620: 620: High-end servers High-end servers 64-bit architecture 64-bit architecture
Prepared by: Dr Masri Ayob - TK PowerPC Family (2) 740/750: 740/750: Also known as G3 Also known as G3 Two levels of cache on chip Two levels of cache on chip G4: G4: Increases parallelism and internal speed Increases parallelism and internal speed G5: G5: Improvements in parallelism and internal speed Improvements in parallelism and internal speed 64-bit organization 64-bit organization
Prepared by: Dr Masri Ayob - TK Internet Resources Search for the Intel Museum Search for the Intel Museum Charles Babbage Institute Charles Babbage Institute PowerPC PowerPC Intel Developer Home Intel Developer Home
Prepared by: Dr Masri Ayob - TK Languages, Levels, Virtual Machines A multilevel machine
Prepared by: Dr Masri Ayob - TK Contemporary Multilevel Machines
Prepared by: Dr Masri Ayob - TK Evolution of Multilevel Machines Invention of microprogramming Invention of microprogramming Invention of operating system Invention of operating system Migration of functionality to microcode Migration of functionality to microcode Elimination of microprogramming Elimination of microprogramming
Prepared by: Dr Masri Ayob - TK The Computer Spectrum The current spectrum of computers available.
Prepared by: Dr Masri Ayob - TK Metric Units The principal metric prefixes.
Prepared by: Dr Masri Ayob - TK Thank you Q & A