Download presentation
Presentation is loading. Please wait.
Published byEustacia Harris Modified over 8 years ago
1
Parallel Computers Prof. Sin-Min Lee Department of Computer Science
2
Uniprocessor Systems Improve performance: Allowing multiple, simultaneous memory access Allowing multiple, simultaneous memory access - requires multiple address, data, and control buses (one set for each simultaneous memory access) (one set for each simultaneous memory access) - The memory chip has to be able to handle multiple transfers simultaneously transfers simultaneously
3
Uniprocessor Systems Multiport Memory: Has two sets of address, data, and control pins to allow simultaneous data transfers to occur Has two sets of address, data, and control pins to allow simultaneous data transfers to occur CPU and DMA controller can transfer data concurrently CPU and DMA controller can transfer data concurrently A system with more than one CPU could handle simultaneous requests from two different processors A system with more than one CPU could handle simultaneous requests from two different processors
4
Uniprocessor Systems Multiport Memory (cont.): Can - Multiport memory can handle two requests to read data from the same location at the same time Cannot - Process two simultaneous requests to write data to the same memory location - Requests to read from and write to the same memory location simultaneously
5
Multiprocessors I/O Port Device Controller CPU Bus Memory CPU
7
Multiprocessors Systems designed to have 2 to 8 CPUs Systems designed to have 2 to 8 CPUs The CPUs all share the other parts of the computer The CPUs all share the other parts of the computer Memory Memory Disk Disk System Bus System Bus etc etc CPUs communicate via Memory and the System Bus CPUs communicate via Memory and the System Bus
8
MultiProcessors Each CPU shares memory, disks, etc Each CPU shares memory, disks, etc Cheaper than clusters Cheaper than clusters Not as good performance as clusters Not as good performance as clusters Often used for Often used for Small Servers Small Servers High-end Workstations High-end Workstations
9
MultiProcessors OS automatically shares work among available CPUs OS automatically shares work among available CPUs On a workstation… On a workstation… One CPU can be running an engineering design program One CPU can be running an engineering design program Another CPU can be doing complex graphics formatting Another CPU can be doing complex graphics formatting
10
Applications of Parallel Computers Traditionally: government labs, numerically intensive applications Traditionally: government labs, numerically intensive applications Research Institutions Research Institutions Recent Growth in Industrial Applications Recent Growth in Industrial Applications 236 of the top 500 236 of the top 500 Financial analysis, drug design and analysis, oil exploration, aerospace and automotive Financial analysis, drug design and analysis, oil exploration, aerospace and automotive
11
Multiprocessor Systems Flynn’s Classification Single instruction multiple data (SIMD): Main Memory Control Unit Processor Memory Communications Network Executes a single instruction on multiple data values simultaneously using many processors Executes a single instruction on multiple data values simultaneously using many processors Since only one instruction is processed at any given time, it is not necessary for each processor to fetch and decode the instruction Since only one instruction is processed at any given time, it is not necessary for each processor to fetch and decode the instruction This task is handled by a single control unit that sends the control signals to each processor. This task is handled by a single control unit that sends the control signals to each processor. Example: Array processor Example: Array processor
12
Why Multiprocessors? 1. Microprocessors as the fastest CPUs Collecting several much easier than redesigning 1 Collecting several much easier than redesigning 1 2. Complexity of current microprocessors Do we have enough ideas to sustain 1.5X/yr? Do we have enough ideas to sustain 1.5X/yr? Can we deliver such complexity on schedule? Can we deliver such complexity on schedule? 3. Slow (but steady) improvement in parallel software (scientific apps, databases, OS) 4. Emergence of embedded and server markets driving microprocessors in addition to desktops Embedded functional parallelism, producer/consumer model Embedded functional parallelism, producer/consumer model Server figure of merit is tasks per hour vs. latency Server figure of merit is tasks per hour vs. latency
13
Parallel Processing Intro Long term goal of the field: scale number processors to size of budget, desired performance Long term goal of the field: scale number processors to size of budget, desired performance Machines today: Sun Enterprise 10000 (8/00) Machines today: Sun Enterprise 10000 (8/00) 64 400 MHz UltraSPARC® II CPUs,64 GB SDRAM memory, 868 18GB disk,tape 64 400 MHz UltraSPARC® II CPUs,64 GB SDRAM memory, 868 18GB disk,tape $4,720,800 total $4,720,800 total 64 CPUs 15%,64 GB DRAM 11%, disks 55%, cabinet 16% ($10,800 per processor or ~0.2% per processor) 64 CPUs 15%,64 GB DRAM 11%, disks 55%, cabinet 16% ($10,800 per processor or ~0.2% per processor) Minimal E10K - 1 CPU, 1 GB DRAM, 0 disks, tape ~$286,700 Minimal E10K - 1 CPU, 1 GB DRAM, 0 disks, tape ~$286,700 $10,800 (4%) per CPU, plus $39,600 board/4 CPUs (~8%/CPU) $10,800 (4%) per CPU, plus $39,600 board/4 CPUs (~8%/CPU) Machines today: Dell Workstation 220 (2/01) Machines today: Dell Workstation 220 (2/01) 866 MHz Intel Pentium® III (in Minitower) 866 MHz Intel Pentium® III (in Minitower) 0.125 GB RDRAM memory, 1 10GB disk, 12X CD, 17” monitor, nVIDIA GeForce 2 GTS,32MB DDR Graphics card, 1yr service 0.125 GB RDRAM memory, 1 10GB disk, 12X CD, 17” monitor, nVIDIA GeForce 2 GTS,32MB DDR Graphics card, 1yr service $1,600; for extra processor, add $350 (~20%) $1,600; for extra processor, add $350 (~20%)
14
Major MIMD Styles 1. Centralized shared memory ("Uniform Memory Access" time or "Shared Memory Processor") 2. Decentralized memory (memory module with CPU) get more memory bandwidth, lower memory latency get more memory bandwidth, lower memory latency Drawback: Longer communication latency Drawback: Longer communication latency Drawback: Software model more complex Drawback: Software model more complex
16
Organization of Multiprocessor Systems Three different ways to organize/classify systems: Flynn ’ s Classification System Topologies MIMD System Architectures
17
Multiprocessor Systems Flynn’s Classification Flynn’s Classification: Based on the flow of instructions and data processing Based on the flow of instructions and data processing A computer is classified by: A computer is classified by: - whether it processes a single instruction at a time or multiple instructions simultaneously - whether it operates on one more multiple data sets
18
Multiprocessor Systems Flynn’s Classification Four Categories of Flynn ’ s Classification: SISDSingle instruction single data SISDSingle instruction single data SIMDSingle instruction multiple data SIMDSingle instruction multiple data MISDMultiple instruction single data ** MISDMultiple instruction single data ** MIMDMultiple instruction multiple data MIMDMultiple instruction multiple data ** The MISD classification is not practical to implement. In fact, no significant MISD computers have ever been build. It is included only for completeness.
19
From the beginning of time, computer scientists have been challenging computers with larger and larger problems. Eventually, computer processors were combined together in parallel to work on the same task together. This is parallel processing. Types Of Parallel Processing SISD – Single Instruction stream, Single Data stream MISD – Multiple Instruction stream, Single Data stream SIMD – Single Instruction stream, Multiple Data stream MIMD – Multiple Instruction stream, Multiple Data stream
20
SISD One piece of data is sent to one processor. Ex: To multiply one hundred numbers by the number three, each number would be sent and calculated until all one hundred results were calculated. Data Multiply CPU
21
MISD One piece of data is broken up and sent to many processor. Ex: A database is broken up into sections of records and sent to several different processor, each of which searches the section for a specific key. Data Search CPU
22
SIMD Multiple processors execute the same instruction of separate data. Ex: A SIMD machine with 100 processors could multiply 100 numbers, each by the number three, at the same time. Multiply CPU Data
23
MIMD Multiple processors execute different instruction of separate data. This is the most complex form of parallel processing. It is used on complex simulations like modeling the growth of cities. Multiply CPU Data Search Add Subtract
24
The Granddaddy of Parallel Processing MIMD
25
MIMD computers usually have a different program running on every processor. This makes for a very complex programming environment. What processor? Doing which task? At what time? What’s doing what when?
26
Memory latency The time between issuing a memory fetch and receiving the response. Simply put, if execution proceeds before the memory request responds, unexpected results will occur. What values are being used? Not the ones requested!
27
A similar problem can occur with instruction executions themselves. Synchronization The need to enforce the ordering of instruction executions according to their data dependencies. Instruction b must occur before instruction a.
28
Despite potential problems, MIMD can prove larger than life. MIMD Successes IBM Deep Blue – Computer beats professional chess player. Some may not consider this to be a fair example, because Deep Blue was built to beat Kasparov alone. It “knew” his play style so it could counter is projected moves. Still, Deep Blue’s win marked a major victory for computing.
29
IBM’s latest, a supercomputer that models nuclear explosions. IBM Poughkeepsie built the world’s fastest supercomputer for the U. S. Department of Energy. It’s job was to model nuclear explosions.
30
MIMD – it’s the most complex, fastest, flexible parallel paradigm. It’s beat a world class chess player at his own game. It models things that few people understand. It is parallel processing at its finest.
31
Multiprocessor Systems Flynn’s Classification Single instruction single data (SISD): Consists of a single CPU executing individual instructions on individual data values Consists of a single CPU executing individual instructions on individual data values
32
Multiprocessor Systems Flynn’s Classification Multiple instruction Multiple data (MIMD): Executes different instructions simultaneously Executes different instructions simultaneously Each processor must include its own control unit Each processor must include its own control unit The processors can be assigned to parts of the same task or to completely separate tasks The processors can be assigned to parts of the same task or to completely separate tasks Example: Multiprocessors, multicomputers Example: Multiprocessors, multicomputers
33
Popular Flynn Categories SISD (Single Instruction Single Data) SISD (Single Instruction Single Data) Uniprocessors Uniprocessors MISD (Multiple Instruction Single Data) MISD (Multiple Instruction Single Data) ???; multiple processors on a single data stream ???; multiple processors on a single data stream SIMD (Single Instruction Multiple Data) SIMD (Single Instruction Multiple Data) Examples: Illiac-IV, CM-2 Examples: Illiac-IV, CM-2 Simple programming model Simple programming model Low overhead Low overhead Flexibility Flexibility All custom integrated circuits All custom integrated circuits (Phrase reused by Intel marketing for media instructions ~ vector) (Phrase reused by Intel marketing for media instructions ~ vector) MIMD (Multiple Instruction Multiple Data) MIMD (Multiple Instruction Multiple Data) Examples: Sun Enterprise 5000, Cray T3D, SGI Origin Examples: Sun Enterprise 5000, Cray T3D, SGI Origin Flexible Flexible Use off-the-shelf micros Use off-the-shelf micros MIMD current winner: Concentrate on major design emphasis <= 128 processor MIMD machines MIMD current winner: Concentrate on major design emphasis <= 128 processor MIMD machines
34
Multiprocessor Systems System Topologies: The topology of a multiprocessor system refers to the pattern of connections between its processors The topology of a multiprocessor system refers to the pattern of connections between its processors Quantified by standard metrics: Quantified by standard metrics: DiameterThe maximum distance between two processors in the computer system DiameterThe maximum distance between two processors in the computer system BandwidthThe capacity of a communications link multiplied by the number of such links in the system (best case) BandwidthThe capacity of a communications link multiplied by the number of such links in the system (best case) Bisectional BandwidthThe total bandwidth of the links connecting the two halves of the processor split so that the number of links between the two halves is minimized (worst case) Bisectional BandwidthThe total bandwidth of the links connecting the two halves of the processor split so that the number of links between the two halves is minimized (worst case)
35
Multiprocessor Systems System Topologies Six Categories of System Topologies: Shared bus Ring Tree Mesh Hypercube Completely Connected
37
Multiprocessor Systems System Topologies Shared bus: The simplest topology The simplest topology Processors communicate with each other exclusively via this bus Processors communicate with each other exclusively via this bus Can handle only one data transmission at a time Can handle only one data transmission at a time Can be easily expanded by connecting additional processors to the shared bus, along with the necessary bus arbitration circuitry Can be easily expanded by connecting additional processors to the shared bus, along with the necessary bus arbitration circuitry Shared Bus Global Memory M P M P M P
41
Multiprocessor Systems System Topologies Ring: Uses direct dedicated connections between processors Uses direct dedicated connections between processors Allows all communication links to be active simultaneously Allows all communication links to be active simultaneously A piece of data may have to travel through several processors to reach its final destination A piece of data may have to travel through several processors to reach its final destination All processors must have two communication links All processors must have two communication links P PP PP P
42
Multiprocessor Systems System Topologies Tree topology: Uses direct connections between processors Uses direct connections between processors Each processor has three connections Each processor has three connections Its primary advantage is its relatively low diameter Its primary advantage is its relatively low diameter Example: DADO Computer Example: DADO Computer P PP P PP P
46
Multiprocessor Systems System Topologies Mesh topology: Every processor connects to the processors above, below, left, and right Every processor connects to the processors above, below, left, and right Left to right and top to bottom wraparound connections may or may not be present Left to right and top to bottom wraparound connections may or may not be present PPP PPP PPP
49
Multiprocessor Systems System Topologies Hypercube: Multidimensional mesh Multidimensional mesh Has n processors, each with log n connections Has n processors, each with log n connections
52
Multiprocessor Systems System Topologies Completely Connected: Every processor has n-1 connections, one to each of the other processors The complexity of the processors increases as the system grows Offers maximum communication capabilities
53
Architecture Details Computers MPPs Computers MPPs P M World ’ s simplest computer (processor/memory) P M C D Standard computer (add cache,disk) P M C D P M C D P M C D Network
54
A Supercomputer at $5.2 million Virginia Tech 1,100 node Macs. G5 supercomputer
55
The Virginia Polytechnic Institute and State University has built a supercomputer comprised of a cluster of 1,100 dual- processor Macintosh G5 computers. Based on preliminary benchmarks, Big Mac is capable of 8.1 teraflops per second. The Mac supercomputer still is being fine tuned, and the full extent of its computing power will not be known until November. But the 8.1 teraflops figure would make the Big Mac the world's fourth fastest supercomputer
56
Big Mac's cost relative to similar machines is as noteworthy as its performance. The Apple supercomputer was constructed for just over US$5 million, and the cluster was assembled in about four weeks. In contrast, the world's leading supercomputers cost well over $100 million to build and require several years to construct. The Earth Simulator, which clocked in at 38.5 teraflops in 2002, reportedly cost up to $250 million.
57
Srinidhi Varadarajan, Ph.D. Dr. Srinidhi Varadarajan is an Assistant Professor of Computer Science at Virginia Tech. He was honored with the NSF Career Award in 2002 for "Weaving a Code Tapestry: A Compiler Directed Framework for Scalable Network Emulation." He has focused his research on building a distributed network emulation system that can scale to emulate hundreds of thousands of virtual nodes. October 28 2003 Time: 7:30pm - 9:00pm Location: Santa Clara Ballroom
58
Parallel Computers Two common types Two common types Cluster Cluster Multi-Processor Multi-Processor
59
Cluster Computers
60
Clusters on the Rise Using clusters of small machines to build a supercomputer is not a new concept. Another of the world's top machines, housed at the Lawrence Livermore National Laboratory, was constructed from 2,304 Xeon processors. The machine was build by Utah-based Linux Networx.Lawrence Livermore Clustering technology has meant that traditional big-iron leaders like Cray (Nasdaq: CRAY) and IBM have new competition from makers of smaller machines. Dell (Nasdaq: DELL), among other companies, has sold high-powered computing clusters to research institutions.Cray Dell
61
Cluster Computers Each computer in a cluster is a complete computer by itself Each computer in a cluster is a complete computer by itself CPU CPU Memory Memory Disk Disk etc etc Computers communicate with each other via some interconnection bus Computers communicate with each other via some interconnection bus
62
Cluster Computers Typically used where one computer does not have enough capacity to do the expected work Typically used where one computer does not have enough capacity to do the expected work Large Servers Large Servers Cheaper than building one GIANT computer Cheaper than building one GIANT computer
63
Although not new, supercomputing clustering technology still is impressive. It works by farming out chunks of data to individual machines, adding that clustering works better for some types of computing problems than others. For example, a cluster would not be ideal to compete against IBM's Deep Blue supercomputer in a chess match; in this case, all the data must be available to one processor at the same moment -- the machine operates much in the same way as the human brain handles tasks. However, a cluster would be ideal for the processing of seismic data for oil exploration, because that computing job can be divided into many smaller tasks.
64
Cluster Computers Need to break up work among the computers in the cluster Need to break up work among the computers in the cluster Example: Microsoft.com Search Engine Example: Microsoft.com Search Engine 6 computers running SQL Server 6 computers running SQL Server Each has a copy of the MS Knowledge Base Each has a copy of the MS Knowledge Base Search requests come to one computer Search requests come to one computer Sends request to one of the 6 Sends request to one of the 6 Attempts to keep all 6 busy Attempts to keep all 6 busy
65
The Virginia Tech Mac supercomputer should be fully functional and in use by January 2004. It will be used for research into nanoscale electronics, quantum chemistry, computational chemistry, aerodynamics, molecular statics, computational acoustics and the molecular modeling of proteins.
66
Specialized Processors Vector Processors Vector Processors Massively Parallel Computers Massively Parallel Computers
67
Vector Processors For (I=0;I<n;I++) { array1[I] = array2[I] + array3[I] } This is an array (vector) operation
68
Vector Processors Special instructions to operate on vectors (arrays) Vector instruction specifies Vector instruction specifies Starting addresses of all 3 arrays Starting addresses of all 3 arrays Loop count Loop count Saves For Loop overhead Saves For Loop overhead Can more efficiently access memory Can more efficiently access memory Also Known as SIMD Computers Also Known as SIMD Computers Single Instruction Multiple Data Single Instruction Multiple Data
69
Vector Processors Until the 1990s, the world’s fastest supercomputers were implemented as vector processors Until the 1990s, the world’s fastest supercomputers were implemented as vector processors Now, Vector Processors are typically special peripheral devices that can be installed on a “regular” computer Now, Vector Processors are typically special peripheral devices that can be installed on a “regular” computer
70
Massively Parallel Computers IBM ASCI Purple IBM ASCI Purple Cluster of 196 computers Cluster of 196 computers Each computer has Each computer has 64 CPUs 64 CPUs 256 Gigabytes of RAM 256 Gigabytes of RAM 10,000 GB of Disk 10,000 GB of Disk
71
Massively Parallel Computer How will ASCI Purple be used? How will ASCI Purple be used? Simulation of molecular dynamics Simulation of molecular dynamics Research into repairing damaged DNA Research into repairing damaged DNA Analysis of seismic waves Analysis of seismic waves Earthquake research Earthquake research Simulation of star evolution Simulation of star evolution Simulation of Weapons of Mass Destruction Simulation of Weapons of Mass Destruction
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.