Parallel Computers1 RISC and Parallel Computers Prof. Sin-Min Lee Department of Computer Science
Parallel Computers2 The Basis for RISC Use of simple instructions One of their key realizations was that a sequence of simple instructions produces the same results as a sequence of complex instructions, but can be implemented with a simpler (and faster) hardware design. Reduced Instruction Set Computers---RISC machines---were the result.
Parallel Computers3 Instruction Pipeline Similar to a manufacturing assembly line 1. Fetch an instruction 2. Decode the instruction 3. Execute the instruction 4. Store results Each stage processes simultaneously (after initial latency) Execute one instruction per clock cycle
Parallel Computers4 Pipeline Stages Some processors use 3, 4, or 5 stages
Parallel Computers5
6 RISC characteristics Simple instruction set. In a RISC machine, the instruction set contains simple, basic instructions, from which more complex instructions can be composed. Same length instructions.
Parallel Computers7 RISC characteristics Each instruction is the same length, so that it may be fetched in a single operation. 1 machine-cycle instructions. Most instructions complete in one machine cycle, which allows the processor to handle several instructions at the same time. This pipelining is a key technique used to speed up RISC machines.
Parallel Computers8 Instructions Pipelines It is to prepare the next instruction while the current instruction is still executing. A Three states RISC pipelines is : 1. Fetch instruction 2. Decode and select registers 3. Execute the instruction Clock Stage i1i2i3i4i5i6i7 2-i1i2i3i4i5i6 3--i1i2i3i4i5
Parallel Computers9 RISC vs. CISC RISC have fewer and simpler instructions, therefore, they are less complex and easier to design. Also, it allow higher clock speed than CISC. However, When we compiled high- level language. RISC CPU need more instructions than CISC CPU. CISC are complex but it doesn ’ t necessarily increase the cost. CISC processors are backward compactable.
Parallel Computers10 Demand for Computational Speed Continual demand for greater computational speed from a computer system than is currently possible Areas requiring great computational speed include numerical modeling and simulation of scientific and engineering problems. Computations must be completed within a “reasonable” time period.
Parallel Computers11 Grand Challenge Problems One that cannot be solved in a reasonable amount of time with today’s computers. Obviously, an execution time of 10 years is always unreasonable. Examples Modeling large DNA structures Global weather forecasting Modeling motion of astronomical bodies.
Parallel Computers12 Weather Forecasting Atmosphere modeled by dividing it into 3-dimensional cells. Calculations of each cell repeated many times to model passage of time.
Parallel Computers13 Global Weather Forecasting Example Suppose whole global atmosphere divided into cells of size 1 mile 1 mile 1 mile to a height of 10 miles (10 cells high) - about 5 10 8 cells. Suppose each calculation requires 200 floating point operations. In one time step, floating point operations necessary. To forecast the weather over 7 days using 1-minute intervals, a computer operating at 1Gflops (10 9 floating point operations/s) takes 10 6 seconds or over 10 days. To perform calculation in 5 minutes requires computer operating at 3.4 Tflops (3.4 floating point operations/sec).
Parallel Computers14 Modeling Motion of Astronomical Bodies Each body attracted to each other body by gravitational forces. Movement of each body predicted by calculating total force on each body. With N bodies, N - 1 forces to calculate for each body, or approx. N 2 calculations. (N log 2 N for an efficient approx. algorithm.) After determining new positions of bodies, calculations repeated.
Parallel Computers15 A galaxy might have, say, stars. Even if each calculation done in 1 ms (extremely optimistic figure), it takes 10 9 years for one iteration using N 2 algorithm and almost a year for one iteration using an efficient N log 2 N approximate algorithm.
Parallel Computers16 Astrophysical N-body simulation by Scott Linssen (undergraduate UNC-Charlotte student).
Parallel Computers17
Parallel Computers18 Parallel Computing Using more than one computer, or a computer with more than one processor, to solve a problem. Motives Usually faster computation - very simple idea - that n computers operating simultaneously can achieve the result n times faster - it will not be n times faster for various reasons. Other motives include: fault tolerance, larger amount of memory available,...
Parallel Computers19
Parallel Computers20 Background Parallel computers - computers with more than one processor - and their programming - parallel programming - has been around for more than 40 years.
Parallel Computers21 Gill writes in 1958: “... There is therefore nothing new in the idea of parallel programming, but its application to computers. The author cannot believe that there will be any insuperable difficulty in extending it to computers. It is not to be expected that the necessary programming techniques will be worked out overnight. Much experimenting remains to be done. After all, the techniques that are commonly used in programming today were only won at the cost of considerable toil several years ago. In fact the advent of parallel programming may do something to revive the pioneering spirit in programming which seems at the present to be degenerating into a rather dull and routine occupation...” Gill, S. (1958), “Parallel Programming,” The Computer Journal, vol. 1, April, pp
Parallel Computers22 Trends Moore’s Law: Number of transistors per square inch in an integrated circuit doubles every 18 months Every decade – computer performance increases 2 order of magnitude
Parallel Computers23 Goal of Parallel Computing Solve bigger problems faster Often bigger is more important than faster P-fold speedups not as important! Challenge of Parallel Computing Coordinate, control, and monitor the computation
Parallel Computers24 Speedup Factor where t s is execution time on a single processor and t p is execution time on a multiprocessor. S(p) gives increase in speed by using multiprocessor. Use best sequential algorithm with single processor system. Underlying algorithm for parallel implementation might be (and is usually) different.
Parallel Computers25 Speedup factor can also be cast in terms of computational steps: Can also extend time complexity to parallel computations.
Parallel Computers26 Maximum Speedup Maximum speedup is usually p with p processors (linear speedup). Possible to get superlinear speedup (greater than p) but usually a specific reason such as: Extra memory in multiprocessor system Nondeterministic algorithm
Parallel Computers27 Maximum Speedup Amdahl’s law
Parallel Computers28 PPPPPP Microkernel Multi-Processor Computing System Threads Interface Hardware Operating System Process Processor Thread P P Applications Computing Elements Programming paradigms
Parallel Computers29 Architectures System Software/Compiler Applications P.S.Es Architectures System Software Applications P.S.Es Sequential Era Parallel Era Two Eras of Computing Commercialization R & D Commodity
Parallel Computers30 History of Parallel Processing PP can be traced to a tablet dated around 100 BC. Tablet has 3 calculating positions. Infer that multiple positions: Reliability/ Speed
Parallel Computers31 Why Parallel Processing? Ø Computation requirements are ever increasing -- visualization, distributed databases, simulations, scientific prediction (earthquake), etc. Ø Sequential architectures reaching physical limitation (speed of light, thermodynamics)
Parallel Computers32 No. of Processors C.P.I Computational Power Improvement Multiprocessor Uniprocessor
Parallel Computers33
Parallel Computers34
Parallel Computers35
Parallel Computers36 Ø The Tech. of PP is mature and can be exploited commercially; significant R & D work on development of tools & environment. Ø Significant development in Networking technology is paving a way for heterogeneous computing. Why Parallel Processing?
Parallel Computers37 Ø Hardware improvements like Pipelining, Superscalar, etc., are non- scalable and requires sophisticated Compiler Technology. Ø Vector Processing works well for certain kind of problems. Why Parallel Processing?
Parallel Computers38 Parallel Program has & needs... ä Multiple “processes” active simultaneously solving a given problem, general multiple processors. ä Communication and synchronization of its processes (forms the core of parallel programming efforts).
Parallel Computers39 Parallelism in Uniprocessor Systems A computer achieves parallelism when it performs two or more unrelated tasks simultaneously
Parallel Computers40 Uniprocessor Systems Uniprocessor system may incorporate parallelism using: an instruction pipeline a fixed or reconfigurable arithmetic pipeline I/O processors vector arithmetic units multiport memory
Parallel Computers41 Uniprocessor Systems Instruction pipeline: By overlapping the fetching, decoding, and execution of instructions Allows the CPU to execute one instruction per clock cycle
Parallel Computers42 Uniprocessor Systems Reconfigurable Arithmetic Pipeline: Better suited for general purpose computing Each stage has a multiplexer at its input The control unit of the CPU sets the selected data to configure the pipeline Problem: Although arithmetic pipelines can perform many iterations of the same operation in parallel, they cannot perform different operations simultaneously.
Parallel Computers43 Uniprocessor Systems Vectored Arithmetic Unit: Provides a solution to the reconfigurable arithmetic pipeline problem Purpose: to perform different arithmetic operations in parallel
Parallel Computers44 Uniprocessor Systems Vectored Arithmetic Unit (cont.): Contains multiple functional units - Some performs addition, subtraction, etc. Input and output switches are needed to route the proper data to their proper destinations - Switches are set by the control unit
Parallel Computers45 Uniprocessor Systems Vectored Arithmetic Unit (cont.): How do we get all that data to the vector arithmetic unit? By transferring several data values simultaneously using: - Multiple buses - Very wide data buses
Parallel Computers46 Uniprocessor Systems Improve performance: Allowing multiple, simultaneous memory access - requires multiple address, data, and control buses (one set for each simultaneous memory access) - The memory chip has to be able to handle multiple transfers simultaneously
Parallel Computers47 Uniprocessor Systems Multiport Memory: Has two sets of address, data, and control pins to allow simultaneous data transfers to occur CPU and DMA controller can transfer data concurrently A system with more than one CPU could handle simultaneous requests from two different processors
Parallel Computers48 Uniprocessor Systems Multiport Memory (cont.): Can - Multiport memory can handle two requests to read data from the same location at the same time Cannot - Process two simultaneous requests to write data to the same memory location - Requests to read from and write to the same memory location simultaneously
Parallel Computers49 Multiprocessors I/O Port Device Controller CPU Bus Memory CPU
Parallel Computers50 Multiprocessors Systems designed to have 2 to 8 CPUs The CPUs all share the other parts of the computer Memory Disk System Bus etc CPUs communicate via Memory and the System Bus
Parallel Computers51 MultiProcessors Each CPU shares memory, disks, etc Cheaper than clusters Not as good performance as clusters Often used for Small Servers High-end Workstations
Parallel Computers52 MultiProcessors OS automatically shares work among available CPUs On a workstation … One CPU can be running an engineering design program Another CPU can be doing complex graphics formatting
Parallel Computers53 Applications of Parallel Computers Traditionally: government labs, numerically intensive applications Research Institutions Recent Growth in Industrial Applications 236 of the top 500 Financial analysis, drug design and analysis, oil exploration, aerospace and automotive
Parallel Computers54 Multiprocessor Systems Flynn’s Classification Single instruction multiple data (SIMD): Main Memory Control Unit Processor Memory Communications Network Executes a single instruction on multiple data values simultaneously using many processors Executes a single instruction on multiple data values simultaneously using many processors Since only one instruction is processed at any given time, it is not necessary for each processor to fetch and decode the instruction Since only one instruction is processed at any given time, it is not necessary for each processor to fetch and decode the instruction This task is handled by a single control unit that sends the control signals to each processor. This task is handled by a single control unit that sends the control signals to each processor. Example: Array processor Example: Array processor
Parallel Computers55 Why Multiprocessors? 1. Microprocessors as the fastest CPUs Collecting several much easier than redesigning 1 2. Complexity of current microprocessors Do we have enough ideas to sustain 1.5X/yr? Can we deliver such complexity on schedule? 3. Slow (but steady) improvement in parallel software (scientific apps, databases, OS) 4. Emergence of embedded and server markets driving microprocessors in addition to desktops Embedded functional parallelism, producer/consumer model Server figure of merit is tasks per hour vs. latency
Parallel Computers56 Parallel Processing Intro Long term goal of the field: scale number processors to size of budget, desired performance Machines today: Sun Enterprise (8/00) MHz UltraSPARC® II CPUs,64 GB SDRAM memory, GB disk,tape $4,720,800 total 64 CPUs 15%,64 GB DRAM 11%, disks 55%, cabinet 16% ($10,800 per processor or ~0.2% per processor) Minimal E10K - 1 CPU, 1 GB DRAM, 0 disks, tape ~$286,700 $10,800 (4%) per CPU, plus $39,600 board/4 CPUs (~8%/CPU) Machines today: Dell Workstation 220 (2/01) 866 MHz Intel Pentium® III (in Minitower) GB RDRAM memory, 1 10GB disk, 12X CD, 17” monitor, nVIDIA GeForce 2 GTS,32MB DDR Graphics card, 1yr service $1,600; for extra processor, add $350 (~20%)
Parallel Computers57 Major MIMD Styles 1. Centralized shared memory ("Uniform Memory Access" time or "Shared Memory Processor") 2. Decentralized memory (memory module with CPU) get more memory bandwidth, lower memory latency Drawback: Longer communication latency Drawback: Software model more complex
Parallel Computers58 Organization of Multiprocessor Systems Three different ways to organize/classify systems: Flynn’s Classification System Topologies MIMD System Architectures
Parallel Computers59 Multiprocessor Systems Flynn’s Classification Flynn’s Classification: Based on the flow of instructions and data processing A computer is classified by: - whether it processes a single instruction at a time or multiple instructions simultaneously - whether it operates on one more multiple data sets
Parallel Computers60 Multiprocessor Systems Flynn’s Classification Four Categories of Flynn’s Classification: SISDSingle instruction single data SIMDSingle instruction multiple data MISDMultiple instruction single data ** MIMDMultiple instruction multiple data ** The MISD classification is not practical to implement. In fact, no significant MISD computers have ever been build. It is included only for completeness.
Parallel Computers61 Multiprocessor Systems Flynn’s Classification Single instruction single data (SISD): Consists of a single CPU executing individual instructions on individual data values
Parallel Computers62 Multiprocessor Systems Flynn’s Classification Multiple instruction Multiple data (MIMD): Executes different instructions simultaneously Each processor must include its own control unit The processors can be assigned to parts of the same task or to completely separate tasks Example: Multiprocessors, multicomputers
Parallel Computers63 Popular Flynn Categories SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) ???; multiple processors on a single data stream SIMD (Single Instruction Multiple Data) Examples: Illiac-IV, CM-2 Simple programming model Low overhead Flexibility All custom integrated circuits (Phrase reused by Intel marketing for media instructions ~ vector) MIMD (Multiple Instruction Multiple Data) Examples: Sun Enterprise 5000, Cray T3D, SGI Origin Flexible Use off-the-shelf micros MIMD current winner: Concentrate on major design emphasis <= 128 processor MIMD machines
Parallel Computers64 Multiprocessor Systems System Topologies: The topology of a multiprocessor system refers to the pattern of connections between its processors Quantified by standard metrics: DiameterThe maximum distance between two processors in the computer system BandwidthThe capacity of a communications link multiplied by the number of such links in the system (best case) Bisectional BandwidthThe total bandwidth of the links connecting the two halves of the processor split so that the number of links between the two halves is minimized (worst case)
Parallel Computers65 Multiprocessor Systems System Topologies Six Categories of System Topologies: Shared bus Ring Tree Mesh Hypercube Completely Connected
Parallel Computers66 Multiprocessor Systems System Topologies Shared bus: The simplest topology Processors communicate with each other exclusively via this bus Can handle only one data transmission at a time Can be easily expanded by connecting additional processors to the shared bus, along with the necessary bus arbitration circuitry Shared Bus Global Memory M P M P M P
Parallel Computers67 Multiprocessor Systems System Topologies Ring: Uses direct dedicated connections between processors Allows all communication links to be active simultaneously A piece of data may have to travel through several processors to reach its final destination All processors must have two communication links P PP PP P
Parallel Computers68 Multiprocessor Systems System Topologies Tree topology: Uses direct connections between processors Each processor has three connections Its primary advantage is its relatively low diameter Example: DADO Computer P PP PPP
Parallel Computers69 Multiprocessor Systems System Topologies Mesh topology: Every processor connects to the processors above, below, left, and right Left to right and top to bottom wraparound connections may or may not be present PPP PPP PPP
Parallel Computers70 Multiprocessor Systems System Topologies Hypercube: Multidimensional mesh Has n processors, each with log n connections
Parallel Computers71 Multiprocessor Systems System Topologies Completely Connected: Every processor has n-1 connections, one to each of the other processors The complexity of the processors increases as the system grows Offers maximum communication capabilities
Parallel Computers72 Architecture Details Computers MPPs P M World’s simplest computer (processor/memory) P M C D Standard computer (add cache,disk) P M C D P M C D P M C D Network
Parallel Computers73 A Supercomputer at $5.2 million Virginia Tech 1,100 node Macs. G5 supercomputer
Parallel Computers74 The Virginia Polytechnic Institute and State University has built a supercomputer comprised of a cluster of 1,100 dual- processor Macintosh G5 computers. Based on preliminary benchmarks, Big Mac is capable of 8.1 teraflops per second. The Mac supercomputer still is being fine tuned, and the full extent of its computing power will not be known until November. But the 8.1 teraflops figure would make the Big Mac the world's fourth fastest supercomputer
Parallel Computers75 Big Mac's cost relative to similar machines is as noteworthy as its performance. The Apple supercomputer was constructed for just over US$5 million, and the cluster was assembled in about four weeks. In contrast, the world's leading supercomputers cost well over $100 million to build and require several years to construct. The Earth Simulator, which clocked in at 38.5 teraflops in 2002, reportedly cost up to $250 million.
Parallel Computers76 Srinidhi Varadarajan, Ph.D. Dr. Srinidhi Varadarajan is an Assistant Professor of Computer Science at Virginia Tech. He was honored with the NSF Career Award in 2002 for "Weaving a Code Tapestry: A Compiler Directed Framework for Scalable Network Emulation." He has focused his research on building a distributed network emulation system that can scale to emulate hundreds of thousands of virtual nodes. October Time: 7:30pm - 9:00pm Location: Santa Clara Ballroom
Parallel Computers77 Parallel Computers Two common types Cluster Multi-Processor
Parallel Computers78 Cluster Computers
Parallel Computers79 Clusters on the Rise Using clusters of small machines to build a supercomputer is not a new concept. Another of the world's top machines, housed at the Lawrence Livermore National Laboratory, was constructed from 2,304 Xeon processors. The machine was build by Utah-based Linux Networx.Lawrence Livermore Clustering technology has meant that traditional big-iron leaders like Cray (Nasdaq: CRAY) and IBM have new competition from makers of smaller machines. Dell (Nasdaq: DELL), among other companies, has sold high-powered computing clusters to research institutions.Cray Dell
Parallel Computers80 Cluster Computers Each computer in a cluster is a complete computer by itself CPU Memory Disk etc Computers communicate with each other via some interconnection bus
Parallel Computers81 Cluster Computers Typically used where one computer does not have enough capacity to do the expected work Large Servers Cheaper than building one GIANT computer
Parallel Computers82 Although not new, supercomputing clustering technology still is impressive. It works by farming out chunks of data to individual machines, adding that clustering works better for some types of computing problems than others. For example, a cluster would not be ideal to compete against IBM's Deep Blue supercomputer in a chess match; in this case, all the data must be available to one processor at the same moment -- the machine operates much in the same way as the human brain handles tasks. However, a cluster would be ideal for the processing of seismic data for oil exploration, because that computing job can be divided into many smaller tasks.
Parallel Computers83 Cluster Computers Need to break up work among the computers in the cluster Example: Microsoft.com Search Engine 6 computers running SQL Server Each has a copy of the MS Knowledge Base Search requests come to one computer Sends request to one of the 6 Attempts to keep all 6 busy
Parallel Computers84 The Virginia Tech Mac supercomputer should be fully functional and in use by January It will be used for research into nanoscale electronics, quantum chemistry, computational chemistry, aerodynamics, molecular statics, computational acoustics and the molecular modeling of proteins.