Presentation is loading. Please wait.

Presentation is loading. Please wait.

Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/2007 2. Machine.

Similar presentations


Presentation on theme: "Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/2007 2. Machine."— Presentation transcript:

1 Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra pmarques@dei.uc.pt Ago/2007 2. Machine Architectures

2 2 von Neumann Architecture Based on the fetch-decode-execute cycle The computer executes a single sequence of instructions that act on data. Both program and data are stored in memory. Flow of instructions Data A B C

3 3 Flynn's Taxonomy Classifies computers according to… The number of execution flows The number of data flows Number of data flows Number of execution flows SISD Single-Instruction Single-Data SIMD Single-Instruction Multiple-Data MISD Multiple-Instruction Single-Data MIMD Multiple-Instruction Multiple-Data

4 4 Single Instruction, Single Data (SISD) A serial (non-parallel) computer Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle Single data: only one data stream is being used as input during any one clock cycle Most PCs, single CPU workstations, …

5 5 Single Instruction, Multiple Data (SIMD) A type of parallel computer Single instruction: All processing units execute the same instruction at any given clock cycle Multiple data: Each processing unit can operate on a different data element Best suited for specialized problems characterized by a high degree of regularity, such as image processing. Examples: Connection Machine CM-2, Cray J90, Pentium MMX instructions

6 6 The Connection Machine 2 (SIMD) The massively parallel Connection Machine 2 was a supercomputer produced by Thinking Machines Corporation, containing 32,768 (or more) processors of 1-bit that work in parallel.

7 7 Multiple Instruction, Single Data (MISD) Few actual examples of this class of parallel computer have ever existed Some conceivable examples might be: multiple frequency filters operating on a single signal stream multiple cryptography algorithms attempting to crack a single coded message the Data Flow Architecture

8 8 Multiple Instruction, Multiple Data (MIMD) Currently, the most common type of parallel computer Multiple Instruction: every processor may be executing a different instruction stream Multiple Data: every processor may be working with a different data stream Execution can be synchronous or asynchronous, deterministic or non- deterministic Examples: most current supercomputers, computer clusters, multi- processor SMP machines (inc. some types of PCs)

9 9 Earth Simulator Center – Yokohama, NecSX (MIMD) The Earth Simulator is a project to develop a 40 TFLOPS system for climate modeling. It performs at 35.86 TFLOPS. The ES is based on: - 5,120 (640 8-way nodes) 500 MHz NEC CPUs - 8 GFLOPS per CPU (41 TFLOPS total) - 2 GB RAM per CPU (10 TB total) - Shared memory inside the node - 640 × 640 crossbar switch between the nodes - 16 GB/s inter-node bandwidth

10 10 What about Memory? The interface between CPUs and Memory in Parallel Machines is of crucial importance The bottleneck on the bus, many times between memory and CPU, is known as the von Neumann bottleneck It limits how fast a machine can operate: relationship between computation/communication

11 11 Communication in Parallel Machines Programs act on data. Quite important: how do processors access each others data? Shared Memory Model Message Passing Model MemoryCPUMemoryCPU MemoryCPUMemoryCPU network CPU Memory

12 12 Shared Memory Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as a global address space Multiple processors can operate independently but share the same memory resources Changes in a memory location made by one processor are visible to all other processors Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA

13 13 Shared Memory (2) Fast Memory Interconnect UMA: Uniform Memory Access Single 4-processor Machine CPU Memory CPU Memory CPU Memory CPU Memory NUMA: Non-Uniform Memory Access A 3-processor NUMA Machine

14 14 Uniform Memory Access (UMA) Most commonly represented today by Symmetric Multiprocessor (SMP) machines Identical processors Equal access and access times to memory Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor updates a location in shared memory, all the other processors know about the update. Cache coherency is accomplished at the hardware level. Very hard to scale

15 15 Non-Uniform Memory Access (NUMA) Often made by physically linking two or more SMPs. One SMP can directly access memory of another SMP. Not all processors have equal access time to all memories Sometimes called DSM – Distributed Shared Memory Advantages User-friendly programming perspective to memory Data sharing between tasks is both fast and uniform due to the proximity of memory and CPUs More scalable than SMPs Disadvantages Lack of scalability between memory and CPUs Programmer responsibility for synchronization constructs that ensure "correct" access of global memory Expensive: it becomes increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of processors

16 16 UMA and NUMA The new MAC PRO features 2 Intel Core2 Duo processors that share a common central memory (up to 16Gbyte) SGI Origin 3900: - 16 R14000A processors per brick, each brick with 32GBytes of RAM. - 12.8GB/s aggregated memory bw (Scales up to 512 processors and 1TByte of memory)

17 17 Distributed Memory (DM) Processors have their own local memory. Memory addresses in one processor do not map to another processor (no global address space) Because each processor has its own local memory, cache coherency does not apply Requires a communication network to connect inter- processor memory When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is the programmer's responsibility Very scalable Cost effective: use of off-the-shelf processors and networking Slower than UMA and NUMA machines

18 18 Distributed Memory CPU Memory Computer CPU Memory Computer CPU Memory Computer network interconnect TITAN@DEI, a PC cluster interconnected by FastEthernet

19 19 Hybrid Architectures Today, most systems are an hybrid featuring shared distributed memory. Each node has several processors that share a central memory A fast switch interconnects the several nodes In some cases the interconnect allows for the mapping of memory among nodes; in most cases it gives a message passing interface fast network interconnect Memory CPU Memory CPU Memory CPU Memory CPU

20 20 ASCI White at the Lawrence Livermore National Laboratory Each node is an IBM POWER3 375 MHz NH-2 16-way SMP (i.e. 16 processors/node) Each node has 16GB of memory A total of 512 nodes, interconnected by a 2GB/sec network node-to-node The 512 nodes feature a total of 8192 processors, having a total of 8192 GB of memory It currently operates at 13.8 TFLOPS

21 21 Summary ArchitectureCC-UMACC-NUMADistributed/ Hybrid Examples- SMPs - Sun Vexx - SGI Challenge - IBM Power3 - SGI Origin - HP Exemplar - IBM Power4 - Cray T3E - IBM SP2 Programming- MPI - Threads - OpenMP - Shmem - MPI Scalability<10 processors<1000 processors~1000 processors Draw Backs- Limited mem bw - Hard to scale - New architecture - Point-to-point communication - Costly system administration - Programming is hard to develop and maintain Software Availability- Great - Limited


Download ppt "Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/2007 2. Machine."

Similar presentations


Ads by Google