-1- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Parallel Computer Architectures 2 nd week References Flynn’s Taxonomy Classification of Parallel Computers Based on Architectures Trend in Parallel Computer Architectures
-2- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM REFERENCES Scalable Parallel Computing – Kai Hwang and Zhiwei Xu, McGraw-Hill Chapter.1 Parallel Computing: Theory and Practice – Michael J. Quinn, McGraw-Hill Chapter 3 Parallel Processing Course – Yang-Suk Kee School of EECS, Seoul National University
-3- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM FLYNN’S TAXONOMY A way to classify parallel computers (Flynn,1972) Based on notions of instruction and data streams – SISD (Single Instruction stream over a Single Data stream ) – SIMD (Single Instruction stream over Multiple Data streams ) – MISD (Multiple Instruction streams over a Single Data stream) – MIMD (Multiple Instruction streams over Multiple Data stream) Popularity – MIMD > SIMD > MISD
-4- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM SISD (Single Instruction Stream Over A Single Data Stream ) CUPUMU IS DS I/O IS : Instruction StreamDS : Data Stream CU : Control UnitPU : Processing Unit MU : Memory Unit SISD – Conventional sequential machines
-5- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM SIMD (Single Instruction Stream Over Multiple Data Streams ) SIMD – Vector computers – Special purpose computations CU PE 1 LM 1 PE n LM n DS IS Program loaded from host Data sets loaded from host SIMD architecture with distributed memory PE : Processing ElementLM : Local Memory
-6- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM MISD – Processor arrays, systolic arrays – Special purpose computations Is it different from pipeline computer? MISD (Multiple Instruction Streams Over A Single Data Streams) Memory (Program, Data) PU 1 PU 2 PU n CU 1 CU 2 CU n DS IS DS I/O MISD architecture (the systolic array)
-7- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM MIMD (Multiple Instruction Streams Over Multiple Data Stream) MIMD – General purpose parallel computers CU 1 PU 1 Shared Memory IS DS I/O CU n PU n ISDS I/O IS MIMD architecture with shared memory
-8- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM CLASSIFICATION BASED ON ARCHITECTURES Pipelined Computers Dataflow Architectures Data Parallel Systems Multiprocessors Multicomputers
-9- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Instructions are divided into a number of steps (segments, stages) At the same time, several instructions can be loaded in the machine and be executed in different steps PIPELINED COMPUTERS Instruction iIFIDEXWB IFIDEXWB IFIDEXWB IFIDEXWB IFIDEXWB Instruction i+1 Instruction i+2 Instruction i+3 Instruction i+4 Instruction # Cycles
-10- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM DATAFLOW ARCHITECTURES Data-driven model – A program is represented as a directed acyclic graph in which a node represents an instruction and an edge represents the data dependency relationship between the connected nodes – Firing rule: A node can be scheduled for execution if and only if its input data become valid for consumption Dataflow languages – Id, SISAL, Silage,... – Single assignment, applicative(functional) language, no side effect – Explicit parallelism
-11- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM DATAFLOW GRAPH z = (a + b) * c + * a b c z The dataflow representation of an arithmetic expression
-12- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM DATA-FLOW COMPUTER Execution of instructions is driven by data availability Basic components – Data are directly held inside instructions – Data availability check unit – Token matching unit – Chain reaction of asynchronous instruction executions What is the difference between this and normal (control flow) computers?
-13- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM DATAFLOW COMPUTERS Advantages – Very high potential for parallelism – High throughput – Free from side-effect Disadvantages – Time lost waiting for unneeded arguments – High control overhead – Difficult in manipulating data structures
-14- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Dataflow Machine (Manchester Dataflow Computer) From host To host First actual hardware implementation I/O Switch Matching Unit Overflow Unit Token Queue Instruction Store func 1 func k Network Token Match
-15- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM DATAFLOW REPRESENTATION input d,e,f c 0 = 0 for i from 1 to 4 do begin a i := d i / e i b i := a i * f i c i := b i + c i-1 end output a, b, c / d1d1 e1e1 * + f1f1 / d2d2 e2e2 * + f2f2 / d3d3 e3e3 * + f3f3 / d4d4 e4e4 * + f4f4 a1a1 b1b1 a2a2 b2b2 a3a3 b3b3 a4a4 b4b4 c0c0 c4c4 c1c1 c2c2 c3c3
-16- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM EXECUTION ON CONTROL FLOW MACHINES a1b1c1a2b2c2a4b4c4 Sequential execution on a uniprocessor in 24 cycles Assume all the external inputs are available before entering do loop + : 1 cycle, * : 2 cycles, / : 3 cycles, How long will it take to execute this program on a dataflow computer with 4 processors?
-17- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM EXECUTION ON A DATAFLOW MACHINE c1c2c3c4a1 a2 a3 a4 b1 b2 b4 b3 Data-driven execution on a 4-processor dataflow computer in 9 cycles Can we further reduce the execution time of this program ?
-18- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM PROBLEMS OF DATAFLOW COMPUTERS Excessive copying of large data structures in dataflow operations – I-structure : a tagged memory unit for overlapped usage by the producer and consumer Retreat from pure dataflow approach (shared memory) Handling complex data structures Chain reaction control is difficult to implement – Complexity of matching store and memory units Expose too much parallelism (?)
-19- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM DATA PARALLEL SYSTEMS Programming model – Operations performed in parallel on each element of data structure – Logically single thread of control, performs sequential or parallel steps – Conceptually, a processor associated with each data element
-20- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM DATA PARALLEL SYSTEMS (cont’d) SIMD Architectural model – Array of many simple, cheap processors with little memory each Processors don’t sequence through instructions – Attached to a control processor that issues instructions – Specialized and general communication, cheap global synchronization
-21- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM VECTOR PROCESSOR Instruction set includes operation on vectors as well as scalars 2 types of vector computers – Processor arrays – Pipelined vector processors
-22- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM PROCESSOR ARRAY A sequential computer connected with a set of identical processing elements simultaneouls doing the same operation on different data. Eg CM-200 Data path Processing element Data memory Interconnection network Processing element Data memory Processing element Data memory I/0 … … Processor array Instruction path Program and Data Memory CPU I/O processor Front-end computer I/0
-23- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM PIPELINED VECTOR PROCESSOR Stream vector from memory to the CPU Use pipelined arithmetic units to manipulate data Eg: Cray-1, Cyber-205
-24- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM MULTIPROCESSOR Consists of many fully programmable processors each capable of executing its own program Shared Address Space Architecture Classified into 2 types – Uniform Memory Access (UMA) Multiprocessors – Non-Uniform Memory Access (NUMA) Multiprocessors
-25- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM SHARED ADDRESS SPACE ARCHITECTURE Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space Shared portion of address space Private portion of address space Pn private Common physical addresses P 2 private P 1 private P 0 private Store Load
-26- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Uses a central switching mechanism to reach a centralized shared memory All processors have equal access time to global memory Tightly coupled system Problem: cache consistency UMA MULTIPROCESSOR P1P1 Switching mechanism I/O P2P2 … PnPn Memory banks C1C1 C2C2 CnCn PiPi CiCi Processor i Cache i
-27- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM UMA MULTIPROCESSOR (cont’d) Crossbar switching mechanism Mem Cache P I/OCache P I/O
-28- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM UMA MULTIPROCESSOR (cont’d) Shared- Bus switching mechanism Mem Cache P I/OCache P I/O
-29- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM UMA MULTIPROCESSOR (cont’d) Packet-switched network Network Mem Cache P P P Mem
-30- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM NUMA MULTIPROCESSOR Distributed shared memory combined by local memory of all processors Memory access time depends on whether it is local to the processor Caching shared (particularly nonlocal) data? Network Cache P MemCache P Mem Cache P MemCache P Mem Distributed Memory
-31- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM CURRENT TYPES OF MULTIPROCESSORS PVP (Parallel Vector Processor) – A small number of proprietary vector processors connected by a high-bandwidth crossbar switch SMP (Symmetric Multiprocessor) – A small number of COTS microprocessors connected by a high-speed bus or crossbar switch DSM (Distributed Shared Memory) – Similar to SMP – The memory is physically distributed among nodes.
-32- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM PVP (Parallel Vector Processor) VP Crossbar Switch VP SM VP : Vector ProcessorSM : Shared Memory
-33- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM SMP (Symmetric Multi-Processor) P/C Bus or Crossbar Switch P/C SM P/C : Microprocessor and Cache
-34- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM DSM (Distributed Shared Memory) Custom-Designed Network MB P/C LM NIC DIR P/C LM NIC DIR DIR : Cache Directory
-35- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM MULTICOMPUTERS Consists of many processors with their own memory No shared memory Processors interact via message passing loosely coupled system Message-Passing Interconnection Network M P M P M P P M P M P M PM PM … P P … M M … …
-36- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM CURRENT TYPES OF MULTICOMPUTERS MPP (Massively Parallel Processing) – Total number of processors > 1000 Cluster – Each node in system has less than 16 processors. Constellation – Each node in system has more than 16 processors
-37- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM MPP (Massively Parallel Processing) P/C LM NIC MB P/C LM NIC MB Custom-Designed Network MB : Memory BusNIC : Network Interface Circuitry
-38- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Cluster Commodity Network (Ethernet, ATM, Myrinet, VIA) MB P/C M NIC P/C M Bridge LD NIC IOB LD : Local DiskIOB : I/O Bus
-39- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM CONSTELLATION P/C SM NIC LD Hub Custom or Commodity Network >= 16 IOC P/C SM NIC LD Hub >= 16 IOC IOC : I/O Controller
-40- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Trend in Parallel Computer Architectures Years Number of HPCs MPPsConstellationsClustersSMPs
-41- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM TOWARD ARCHITECTURAL CONVERGENCE Evolution and role of software have blurred boundary – Send/recv supported on SAS machine via buffers – Can construct global address space on MP Hardware organization converging too. – Tighter NI integration for MP (low latency, higher bandwidth) – At lower level, hardware SAS passes hardware messages. – Nodes connected by general network and communication assists. Cluster of workstations/SMPs – Emergence of fast SAN (System Area Networks)
-42- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Converged Architecture of Current Supercomputers Interconnection Network Memory PPPP PPPP PPPP PPPP Multiprocessors