Computer performance issues* Pipelines, Parallelism. Process and Threads.

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

Computer Organization and Architecture
Microprocessors. Von Neumann architecture Data and instructions in single read/write memory Contents of memory addressable by location, independent of.
Computer Organization and Architecture
Computer Organization and Architecture
Computer Architecture and Data Manipulation Chapter 3.
1 Microprocessor-based Systems Course 4 - Microprocessors.
Computer Systems Organization
Chapter 12 Three System Examples The Architecture of Computer Hardware and Systems Software: An Information Technology Approach 3rd Edition, Irv Englander.
Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.
Chapter 17 Parallel Processing.
Vacuum tubes Transistor 1948 –Smaller, Cheaper, Less heat dissipation, Made from Silicon (Sand) –Invented at Bell Labs –Shockley, Brittain, Bardeen ICs.
1 Introduction Chapter What is an operating system 1.2 History of operating systems 1.3 The operating system zoo 1.4 Computer hardware review 1.5.
Parallel Computer Architectures
Organization of a Simple Computer. Computer Systems Organization  The CPU (Central Processing Unit) is the “brain” of the computer. Fetches instructions.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Pipelining By Toan Nguyen.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Prince Sultan College For Woman
CH12 CPU Structure and Function
Computer performance.
Input/Output. Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower.
Processor Structure & Operations of an Accumulator Machine
Parallelism Processing more than one instruction at a time. Pipelining
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Part 2. Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved A five-level memory.
The Computer Systems By : Prabir Nandi Computer Instructor KV Lumding.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Previously Fetch execute cycle Pipelining and others forms of parallelism Basic architecture This week we going to consider further some of the principles.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
The Central Processing Unit
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Hyper Threading (HT) and  OPs (Micro-Operations) Department of Computer Science Southern Illinois University Edwardsville Summer, 2015 Dr. Hiroshi Fujinoki.
Organization of a Simple Computer The organization of a simple computer with one CPU and two I/O devices.
CS Computer Architecture Section 600 Dr. Angela Guercio Fall 2010.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
Ch. 2 Data Manipulation 4 The central processing unit. 4 The stored-program concept. 4 Program execution. 4 Other architectures. 4 Arithmetic/logic instructions.
Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.
The fetch-execute cycle. 2 VCN – ICT Department 2013 A2 Computing RegisterMeaningPurpose PCProgram Counter keeps track of where to find the next instruction.
Part 3.  What are the general types of parallelism that we already discussed?
Computer Hardware A computer is made of internal components Central Processor Unit Internal External and external components.
CPS 4150 Computer Organization Fall 2006 Ching-Song Don Wei.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
Introduction to Concurrency CS-502 (EMC) Fall Introduction to Concurrency ( Processes, Threads, Interrupts, etc.) CS-502, Operating Systems Fall.
Floating Point Numbers & Parallel Computing. Outline Fixed-point Numbers Floating Point Numbers Superscalar Processors Multithreading Homogeneous Multiprocessing.
Copyright © Curt Hill Parallelism in Processors Several Approaches.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming Peter Pacheco.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
Lecture on Central Process Unit (CPU)
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
1 Basic Processor Architecture. 2 Building Blocks of Processor Systems CPU.
CISC. What is it?  CISC - Complex Instruction Set Computer  CISC is a design philosophy that:  1) uses microcode instruction sets  2) uses larger.
Background Computer System Architectures Computer System Software.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
Processor Level Parallelism 1
Introduction to Microprocessor Programming
COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING
CS 286 Computer Organization and Architecture
Presentation transcript:

Computer performance issues* Pipelines, Parallelism. Process and Threads.

Review - The data path of a Von Neumann machine.

Review Fetch-Execute Cycle 1. Fetch next instruction from memory into instr. register 2. Change program counter to point to next instruction 3. Decode type of instruction just fetched 4. If instruction uses word in memory, determine where. Fetch word, if needed, into a CPU register 5. Execute the instruction 6. Go to step 1 to begin executing next instruction

General design principles for performance Have plenty of registers Execute instructions by hardware, not software Make the instructions easy to decode: eg regular, fixed length, small number of fields Access to memory takes a long time: Only Loads and Stores should reference memory Maximise the rate at which instructions are issued (started): instructions are always encountered in program order, but might not be issued in program order; nor finish in program order

Pipelining Instruction fetch is a major bottleneck in instruction execution; early designers created a prefetch buffer – instructions could be fetched from memory in advance of execution Pipelining concept carries this idea further – divide the instruction execution into several stages, each handled by a special piece of hardware

Instruction Fetch-execute cycle In the above model, ‘fetch’ is performed in one clock cycle, ‘decode’ on 2 nd clock cycle, ‘execute’ on 3 rd clock cycle, ‘store’ result on 4 th (No operand memory fetch)

With Pipe-lining Cycle 1: Fetch Instr 1 Cycle 2: Decode Instr 1;Fetch Instr 2 Cycle 3: Exec Instr 1;Decode Instr 2; Fetch Instr 3 Cycle 4: Store Instr 1;Exec Instr 2; decode Instr 3; Fetch Instr 4

Instruction-Level Parallelism A five-stage pipeline

Instruction-Level Parallelism The state of each stage as a function of time. Nine clock cycles are illustrated. Intel 486 had one pipeline

Superscalar Architectures A processor which issues multiple instructions in one clock cycle is called “Superscalar”

Superscalar Architectures (1) Dual five-stage pipelines with a common instruction fetch unit.  Fetch Unit brings pairs of instructions to CPU;  Each instruction must not conflict over resources (registers), and instructions must not depend on each other.  Conflicts are detected and eliminated using extra hardware. If a conflict arises, only first instr is executed; 2 nd is paired with next incoming instr  Basis for original Pentium; twice as fast as 486

Superscalar Architectures (2) A superscalar processor with five functional units.  High-end CPUs (Pentium II on) have one pipeline and several functional units  Most functional units in S4 take much longer than one clock cycle  Can have multiple CPUs in S4

Parallel Processing Instruction-level Parallelism using pipelining and Superscalar techniques gets the speed up by a factor of 5 to 10 For gains of 50x and more, need multiple CPUs An Array Processor is a large number of identical processors with one CPU that perform the same operations in parallel on different sets of data – suitable for processing large problems in engineering and physics. Idea is used in MMX (Multimedia eXtension) and SSE (Streaming SIMD Extensions) to speed up the graphics in later Pentiums Array computer aka as SIMD – Single Instruction-stream, Multiple Data- stream ILLIAC-IV 1972 had an array of Processors each with its own memory

Processor-Level Parallelism (1) An array of processors of the ILLIAC IV (1972) type.

Parallel processing - Multiprocessors Many full-blown CPUs accessing a common memory can lead to conflict Also, many processors trying to access memory over the same bus can cause problems

Processor-Level Parallelism (2) a. A single-bus multiprocessor. (Good example application – searching areas of a photograph for cancer cells) b. A multicomputer with local memories.

Parallelism now Large numbers of PCs connected by high- speed network called COWs (Clusters of Workstations) or Server Farms can achieve a high degree of parallel processing For example, a network server such as Google takes incoming requests and ‘sprays’ them among its servers to be processed in parallel

Process and Thread A process is a running program, together with its State information such its own memory space, register values, program counter, stack pointer, PSW, I/O status A process can be running, waiting to run, or blocked When a process is suspended, its state data must be saved, while a new, other, process is invoked

Processes are typically independent carry state informationstate have separate address spacesaddress spaces interact only through system-provided inter- process communication mechanismsinter- process communication

Thread A thread is a mini-process; it uses the same address space Run Excel – process Run WP – process Handle Keyboard Input – high-priority thread Display text on screen – high-priority thread Spell-checker in WP – low-priority thread The threads are invoked by the Process, and use its address space

Go faster? The clock speed on current computers may be nearing its limit, due to heat problems – speed can be improved through Parallelism at different levels. Level 1 is On-Chip Level: Pipelines. Can issue multiple instructions which can be executed in parallel by different functional units Multithreading. CPU switches among multiple threads on an instr. by instr. basis, creating a virtual multiprocessor Multiprocessing. Two or 4 cores on same chip

Level 2 Parallelism Coprocessors Extra processing power provided by plug-in boards : Sound, Graphics (Floating Point arithmetic) Network Protocol Processing I/O channels (I/O carried out independently of the CPU) – IBM 360 range

Level 3 Parallelism Multiprocessors and Multicomputers Multiprocessor is a parallel computer system with many CPUs, one memory space, and one Operating System A Multicomputer system is a parallel system which consists of many computers, each with its own CPU, memory and OS; all connected by an interconnection network. Very cheap compared w multiprocessors, which are much easier to program. Different examples of multicomputers are IBM BlueGene/L, the Google cluster

Massively parallel Processors (MPP) IBM BlueGene/L Used for v large calculations, v large numbers of transactions per second, data warehousing (managing immense databases) 1000s of standard CPUs – PowerPC 440 Enormous I/O capability High fault tolerance 71 teraflops /sec

Multiprocessors (a) A multiprocessor with 16 CPUs sharing a common memory. (b) An image partitioned into 16 sections, each being analyzed by a different CPU.

Multicomputers (a) A multicomputer with 16 CPUs, each with its own private memory. (b) The previous bit-map image, split up among the 16 memories.

Google (2) A typical Google cluster. Up to 5120 PCs

Heterogeneous Multiprocessors on a Chip – DVD player The logical structure of a simple DVD player contains a heterogeneous multiprocessor containing multiple cores for different functions.