Download presentation
Presentation is loading. Please wait.
1
Group Members: PJ Kulick Jon Robb Brian Tobin
Parallel Processing Group Members: PJ Kulick Jon Robb Brian Tobin Parallelism is the process of performing tasks concurrently.
2
Topics Theory of parallel computers SUPERCOMPUTERS
Distributed Computing
3
What is parallelism??? Real life examples: Definition
Parallelism is the process of performing tasks concurrently. Real life examples: Definition A pack of wolves hunting its prey. An orchestra. An orchestra performance: where each musician plays their part, and together they make beautiful music.
4
Flynn’s Hardware Taxonomy
Processor Organizations Single instruction, single data (SISD) stream Single instruction, multiple data (SIMD) stream Multiple instruction, single data (MISD) stream Multiple instruction, multiple data (MIMD) stream Vector processor Array processor Shared memory Distributed memory Uniprocessor MENTION FLYNN IEEE TRANSACTIONS ON COMPUTERS – “some computer organizations and their effectiveness” SISD - single processor executes a single instruction on data stored in a single memory - uniprocessors SIMD - single machine instruction controls the simultaneous execution of a number of processing elements, each element has its own data memory and each instruction is executed on a different set of data by the different processors - vector and array processors MISD - data is transmitted to a set of processors, each processor executes a different instruction sequence - never been implemented MIMD - a set of processors simultaneously execute different instruction sequences on different data sets - SMPs, clusters, and NUMA systems MIMDs can be divided by the way their processors communicate: shared memory or distributed memory Shared - the processors communicate with each other through the shared memory Distributed - computers communicate via fixed paths or a network SMPs - multiple similar processors in one computer interconnected by a bus - PROBLEM – cache coherence NUMA - shared-memory multiprocessor - access time from a processor to a memory word varies depending on the location of the word Clusters - group of interconnected computers working together (illusion of one machine) Symmetric multiprocessor (SMP) Nonuniform memory access (NUMA) Clusters
5
Taxonomy of parallel computing paradigms
Parallel Computer Synchronous Asynchronous Vector/Array MIMD Class question: What is a paradigm? It is simply a model of the world that is used to formulate a computer solution to some problem. Paradigms are useful in parallel computer architecture as well as parallel programming because it controls the complexity of the details. Asynch Vs. Synch : Coordination is required in parallel programs when a certain task depends on others. Synchronous or lockstep coordination is implemented in the hardware by enabling all operations at once in a way that removes the dependency of one operation on the other. Asynchronous coordination relies on coordination mechanisms called locks to coordinate processors. Vector/Array paradigm is another name for pipelining since numerical problems involving matricies and such require breaking down the problem into small stages. Pipelining for a parallel computer is similar to that of a processor, except in the sense of scale. SIMD:(Single Instruction, Multiple Data) means all the processors do the same thing at the same time or else they remain idle. SIMD uses two phases over and over to solve the problem of managing the data: Phase 1: Partition and distribute the data. Called data partitioning Phase 2: Process the data in parallel Called data parallel processing Now this might seem to be a very trivial at first, but this paradigm is extremely useful in solving problems that have lots of data to be updated on a wholesale basis. This is especially powerful in vector and matrix calculations. Systolic: Invented in the 80s by H.T. Kung at Carnegie Mellon University, a Systolic parallel computer is a multiprocessor which distributed and pulses data from memory to an array of processors before returning to memory. This paradigm incorporates features of both SIMD and vector/array paradigms. In this mode, a parallel computer can exhibit very high speeds by avoiding input/output bottlenecks. This is done by churning the data among the processors as much as possible before returning to memory. The most general form of parallelism is asynchronous parallelism, since the processors operate without regard to any global synch. MIMD: As stated earlier, an MIMD organization means that processors are doing different instructions to different data at the same time. This enables each processor to work on its data freely without relying on other processors for information. However, certain mechanisms called locks are needed if lets say that two processors need access to one spot in memory. This is done by mutual exclusion, which only allows one processor to access info at a point in time. Other locks such as read and write locks are used when the MIMD incorporates a shared memory. MIMD is most useful in large-grained problems because of the overhead in passing data and control from task to task. When I say large-grained, I am speaking of grain size, which is equal to the number of serial instructions done by one processor. SIMD Systolic
6
Interconnection Networks(IN)
IN topology Distributed Memory Shared Memory Vector MIMD Static Dynamic The processors themselves in a parallel system can range from simple 1-bit processors to some of the advance architectures we have seen in the previous case studies. While the speed and capacitance can vary, the most significant difference in these machines is how they are connected to each other. In shared memory, every processor must be connected to the main memory. However, in distributed memory networks, each processor can be fully connected to each other processor or the processors are somewhat connected to each other and in some cases the data must ‘hop’ through other processors to reach its destination processor. Distributed memory is set into two different classes: static and dynamic. The difference is in when a connection is needed between processors. Static IN’s always have a connection whether it is direct (fully connected) or through another processor. Dynamic IN’s make the connection on the fly. 1-dimensional 2-dimensional Hypercube Single-stage Multi-stage Cross-bar
7
Distributed Memory – Static Networks
Linear array (1-d) 2-dimensional networks ring star tree mesh Here are some examples of static networks. In all of these models, some connections may be performed by channeling through other processors, a process called hopping. In the star network, a main processor is used to perform connections between each processor, eluding to the fact that for every connection there is a hop. Star,linear array, and ring networks are all special cases of a tree network, which gives only one path between any pair of processors. Tree networks can cause bottlenecking at higher levels of the tree. One way to reduce this some-what involved task of hops is by fully-connecting the network shown in the next slide.
8
Distributed Memory – Static Networks (cont’d)
As you can see, this configuration is somewhat complex and would bring up the cost of the connection significantly. However, one advantage to this design is that any one processor can send information to all processors in a shorter amount of time that the other networks. Fully connected network
9
Hypercube Another method for static networks is to increase the dimensionality of the configuration. Generally, an n-dimensional hypercube contains 2^n = N processors connected by a links along each axis. Therefore there are n=log2N links at each processor. In a fully-connected network of 16 processors(seen earlier) you would need 136 links, whereas in a hypercube you would only need 32. In a larger scale, over 500,000 links would be needed for 1,024 processors fully-connected, 10,240 for a hypercube.
10
Dynamic configurations
Distributed Memory Dynamic configurations multi-stage In the single-stage dynamic interconnection network, each of the inputs on the left side are connected to some, but not all, of the outputs on the right. The rectangles in the diagram represent binary switches, these switches direct the input message to one of two outputs on the right. The single-stage dynamic IN is limited, but if we combine enough of them together, we can create a multi-stage IN. The multi-stage IN shown connects 8 inputs to 8 outputs. This IN is dynamic because the connections are made as needed. An example would be if we wanted to connect the input 101 (or 5) to the output 001 (or 1). One would just use the bits of the destination address, 001……………………..Explain…………………………..Any input can be routed to any output using this IN. The third type of dynamic IN shown is the cross-bar IN. This network also allows any input to be routed to any output, but you only needs to set one switch to determine make a connection. For instance if you want to connect the input 001 to the output 001, the switch at the intersection of the two lines need only be set……………………Show……………………. The cross-bar IN uses more switches than the multi-stage IN, but the big advantage of using cross-bar over multi-stage is the speed. In the cross-bar, a connection can be in one clock cycle. However, the multi-stage requires log2 N clock cycles to make connection where N is the number of inputs as well as outputs. The multi-stage and cross-bar INs are used in an unusual way in commercial processors. The memory is distributed, but the IN causes the appearance of shared memory. A second way to categorize dynamic parallel processors is to determine how they are coupled to the memory. A Symmetric Multiprocessor is tightly coupled because of its use of a cross-bar system in which all processors can access all memories in the same length of time. A Non-Uniform Memory Access system is loosely coupled because of its use of a multi-stage IN for each processor that is used to access that processor’s local memory. single-stage cross-bar
11
Deep Blue First computer to defeat a world chess champion
32-node IBM Power Parallel SP2 6-move look ahead capability As most of us know, Deep Blue is the first computer to defeat a world chess champion. It defeated grandmaster chess player Garry Kasparov in the first of a six-game match in 1996, Kasparov went on to win 4 to 2. In 1997, there was a rematch and Deep Blue won 3.5 to 2.5. Deep Blue consists of a 32-node IBM Power Parallel SP2. However, each node of the SP2 also has a single microchannel card containing eight VLSI chess processors and this yields a total of 256 processors working concurrently. This and about half-of-a-million lines of code allow Deep Blue to look ahead 6 moves which involves looking at 9 billion positions.
12
SP2 Architecture “The IBM SP2 is a general-purpose scalable parallel system based on a distributed memory message passing architecture.” 2 to 128 nodes POWER2 technology RISC System/6000 processor As stated in IBM’s research journals online, “The IBM SP2 is a general-purpose scalable parallel system based on a distributed memory message passing architecture.” So what does this all mean, the SP2 is available in range from 2 to 128 nodes (or processors). Each node consists of a POWER2 technology RISC System/6000 processor.
13
SP2 Architecture As I mentioned, the SP2 is based on a distributed memory message-passing architecture. Basically, IBM considered 2 choices when deciding on the memory structure. The first choice was the structure shown in the bottom left-hand corner on the screen, the distributed shared-memory access machine. With this type of machine, all of the physical memory is directly addressable from any processor. There is a global real address space such that any node can load data from or store data to any part of this address space. These systems usually have an operating system image on each processor, but they aren’t independent. Therefore, address and data coherence must be dealt with in hardware or software. If dealt with in hardware, the hardware becomes complex and expensive and limits performance and scalability. If dealt with in software, programming complexity becomes a problem that hinders the performance. This brings us to IBM’s second choice, a distributed message-passing machine which is shown in the bottom right-hand corner of the screen. In this machine, a processor can only perform a load or store operation on its own local memory. This eliminates the problem of address and data coherence across nodes. Therefore, IBM chose this type of system for its SP2 architecture.
14
SP2 Architecture This is the general structure for an SP2 processor node. Each node in an SP2 architecture can be one of 3 types: thin node (denoted as TN), thin node 2 (denoted as TN2), and wide node (denoted as WN). All 3 node types contain 2 fixed-point units, 2 floating-point units, and an instruction and branch control unit. The processor has a peak performance of 267 million floating-point operations per second (MFLOPS) at a 66.7 MHz clock speed.
15
Super Computers – “Real World”
RISC System technology Running a high-volume scalable WWW server Forecasting the weather Designing cars Compaq AlphaServer technology Human Genome Project All of this to play chess? Not exactly, the RISC system/6000 technology used for the Deep Blue is also being used for various real world problems such as running a high-volume scalable web server, forecasting the weather, and designing cars. Also, Compaq has recently joined with Celera Genomics to tackle the Human Genome Project. They have created a cluster-type system which uses their Compaq’s AlphaServer technology and is about half the size of a football field. The estimated value of this system is 50 million dollars.
16
Sun Systems MAJC Chip MAJC stands for Microprocessor Architecture for Java Computing. Pronounced magic.
17
MAJC Implements parallel processing on one chip
Can operate in standalone or with up to several hundred others in parallel First version contains two separate processors As time goes many more will be included on one chip
18
Features Four function units per processor
Each function unit contains local registers Global registers can be accessed by all FU’s Operates as SIMD Multiple function units allow multiple instructions to be done simultaneously Each function unit can act as RISC/DSP processor itself SIMD (single instruction multiple data) instructions allow the same instruction to be performed in each FU on different data simultaneously. Also can act as MSIMD (Multiple single instruction multiple data)
19
Architecture The header bits determine which FU’s are used for each instruction packet. The instruction packet is variable length, allowing for the use of one FU to 4 FU’s.
20
Instruction Word
21
SGI Onyx 3000
22
Onyx 3000 Series Developed for visualization and supercomputing
Modular design allows for scalability ease Snap together approach Growth in multiple dimensions NUMAFlex architecture Designed for different generations to work together Growth in multiple dimensions means the system can grow in multiple dimensions, such as processing performance, memory and I/O, while maintaining consistent price/performance across all configurations and allowing each dimension to expand independently from the others (e.g., increasing I/O without increasing processors.) These systems implemented a physically distributed but logically shared memory structure that allowed servers to scale up to 256 processors. This distributed/shared memory model is usually referred to as a nonuniform memory access scheme or NUMA. The NUMAflex design strategy will encompass multiple generations of SGI NUMA-based computer families, all characterized by exceptional modularity, scalability, resilience, high performance, and strong price/performance. NUMAflex systems combine high-speed interconnects and modular “bricks” to create a wide variety of configurations, not only by size but by balance of CPU and memory, I/O, graphics pipes, and storage [“independent resource scalability”]. NUMAflex families often share bricks, which can evolve at their own natural rates
23
Road Map Currently they are using the RISC R12000, with 400MHZ and 800 Mflops and the R14000 with 500MHz and 1.0 Gflops.
24
Available configurations
25
Applications of Onyx 3000 High speed processing
Real time graphic to video High-definition editing Integral support for virtual reality, real-time six degrees of freedom (6DOF) interaction, and sensory immersion
26
Real World Example The Cave(Iowa State University)
Recreation of Forbidden City John Dear Factory Molecular Structuring
27
References http://www.sun.com http://www.sgi.com http://www.ibm.com
Stallings, Williams. Computer Organization and Architecture,5th Edition.Upper Saddle River, New Jersey: Prentice Hall 2000 Lewis, Ted G. Introduction to Parallel Computing. Englewood Cliffs, New Jersey: Prentice Hall 1992 Kumar, Vippin. Introduction to Parallel Computing. Redwood City,California: The Benjamin/Cummings Publishing Company 1994 Moldovan,Dan I. Parallel Processing: From Applications to Systems. San Mateo, California: Morgan Kaufmann 1993
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.