Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical University of Athens School of Electrical and Computer Engineering Computing Systems Laboratory
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Previous work M. Athanasaki, A. Sotiropoulos, G. Tsoukalas, N. Koziris, "Pipelined Scheduling of Tiled Nested Loops onto Clusters of SMPs using Memory Mapped Network Interfaces", SuperComputing Conference on High Performance Networking and Computing (SC2002), Baltimore, Maryland, November 16-22, G. Goumas, A.Sotiropoulos and N. Koziris, "Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping," Proceedings of the 2001 International Parallel and Distributed Processing Symposium (IPDPS2001), IEEE Press, San Francisco, California, April 2001.
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Overview Tiling for parallelization Non-overlapping vs. Overlapping execution scheme Grouping Application on a cluster of SMPs with a fixed number of nodes Experimental-Simulation Results
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Nested For-Loops for (i 1 =l 1 ; i 1 <=u 1 ; i 1 ++) for (i 2 =l 2 ; i 2 <=u 2 ; i 2 ++) … … … … … for (i n =l n ; i n <=u n ; i n ++) { Loop Body }
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Dependence Vectors i2i2 i1i1 for (i 1 =0; i 1 <=7; i 1 ++) for (i 2 =0; i 2 <=7; i 2 ++) A[i,j]=A[i-1,j]+A[i,j-1]
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Tiling i2i2 i1i1
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Tiling i2i2 i1i1 Processor 0 Processor 1
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Overview Tiling for parallelization Non-overlapping vs. Overlapping execution scheme Grouping Application on a cluster of SMPs with a fixed number of nodes Experimental-Simulation Results
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Non-Overlapping Scheme i2i2 i1i1 Processor 0 Processor 1 Processor 2
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Non-Overlapping vs. Overlapping Scheme P0 P1 P2 P3 P0 P1 P2 P3
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Overlapping Scheme i2i2 i1i1 Processor 0 Processor 1 Processor 2
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Overview Tiling for parallelization Non-overlapping vs. Overlapping execution scheme Grouping Application on a cluster of SMPs with a fixed number of nodes Experimental-Simulation Results
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Generalization to SMPs – “Grouping” SMP0 SMP1 SMP2 SMP3 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Example: Grouping + Non overlapping Communication Scheme Tile Space Group Space SMP node0 SMP node1 Scheduling vector Π=(1,0)
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Example: Grouping + Overlapping Communication Scheme Tile Space Group Space SMP node0 SMP node1 Scheduling vector Π=(1,1)
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Overview Tiling for parallelization Non-overlapping vs. Overlapping execution scheme Grouping Application on a cluster of SMPs with a fixed number of nodes Experimental-Simulation Results
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Scheduling onto a Fixed Number of SMPs Dynamic Scheduling by the Operating System Run time overhead for generating a lot of processes Context switching slows down the execution Static Scheduling at Compile Time
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Scheduling onto a Fixed Number of SMPs Cyclic Assignment Schedule Mirror Assignment Schedule Cluster Assignment Schedule Retiling
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cyclic Assignment SMP0 SMP1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 Cyclic assignment on 2 SMP nodes with 2 CPUs each SMP0 SMP1
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cyclic Assignment CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 Cyclic assignment on 2 SMP nodes with 2 CPUs each SMP0 SMP1 SMP0 SMP1 chunk
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cyclic Assignment – Non Overlapping Communication CPU0 CPU1 CPU0 CPU1 Cyclic assignment on 2 SMP nodes with 2 CPUs each SMP0 SMP1 t
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cyclic Assignment - Overlapping Communication Cyclic assignment on 2 SMP nodes with 2 CPUs each t CPU0 CPU1 CPU0 CPU1 SMP0 SMP1
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cyclic Assignment - Communication CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 Cyclic assignment on 2 SMP nodes with 2 CPUs each SMP0 SMP1 SMP0 SMP1 chunk
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Scheduling onto a Fixed Number of SMPs Cyclic Assignment Schedule Mirror Assignment Schedule Cluster Assignment Schedule Retiling
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Mirror Assignment SMP0 SMP1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 Mirror assignment on 2 SMP nodes with 2 CPUs each SMP1 SMP0 chunk
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Mirror Assignment – Non Overlapping Communication Mirror assignment on 2 SMP nodes with 2 CPUs each CPU0 CPU1 CPU0 CPU1 SMP0 SMP1 t
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Mirror Assignment - Overlapping Communication Mirror assignment on 2 SMP nodes with 2 CPUs each t CPU0 CPU1 CPU0 CPU1 SMP0 SMP1
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Mirror Assignment - Communication SMP0 SMP1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 Mirror assignment on 2 SMP nodes with 2 CPUs each SMP1 SMP0
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Scheduling onto a Fixed Number of SMPs Cyclic Assignment Schedule Mirror Assignment Schedule Cluster Assignment Schedule Retiling
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cluster Assignment SMP0 SMP1 CPU0 Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 CPU0 CPU1 tiles “TILE”
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cluster Assignment SMP0 SMP1 CPU0 Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 CPU0 CPU1 TILES GROUPS
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cluster Assignment – Non Overlapping Communication SMP0 SMP1 CPU0 Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 CPU0 CPU1 t
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cluster Assignment – Overlapping Communication SMP0 SMP1 CPU0 Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 CPU0 CPU1 t
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cluster Assignment - Communication SMP0 SMP1 CPU0 Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 CPU0 CPU1 TILES GROUPS
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Scheduling onto a Fixed Number of SMPs Cyclic Assignment Schedule Mirror Assignment Schedule Cluster Assignment Schedule Retiling
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Retiling SMP0 SMP1 CPU0 Retiling on 2 SMP nodes with 2 CPUs each CPU1 CPU0 CPU1 old tiles new tiles
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Retiling SMP0 SMP1 CPU0 Retiling on 2 SMP nodes with 2 CPUs each CPU1 CPU0 CPU1 old tiles new tiles retaining computation volume of a tile
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Retiling – Non Overlapping Communication SMP0 SMP1 CPU0 Retiling on 2 SMP nodes with 2 CPUs each CPU1 CPU0 CPU1 t
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Retiling –Overlapping Communication SMP0 SMP1 CPU0 Retiling on 2 SMP nodes with 2 CPUs each CPU1 CPU0 CPU1 t
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Retiling - Communication SMP0 SMP1 CPU0 Retiling on 2 SMP nodes with 2 CPUs each CPU1 CPU0 CPU1
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Overview Tiling for parallelization Non-overlapping vs. Overlapping execution scheme Grouping Application on a cluster of SMPs with a fixed number of nodes Experimental-Simulation Results
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Experimental Platform Linux SMP (Symmetric Multi- Processors) Cluster 2 nodes 1GB RAM 2 Pentium III 1266MHz Myrinet high performance interconnect GM low level message passing system
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes The Myrinet interconnect User-level Networking Based on the GM message passing interface All message exchange using DMA Directly to/from pinned userspace buffers Communication is offloaded to the NIC Programmable NIC LANai RISC MHz 2-8MB SRAM 2+2Gbps full duplex fiber links
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes GM Architecture Comprised of three main parts User library Kernel driver Firmware on NIC OS bypass design Regions of NIC memory mapped to the VM of a process GM Library Application GM kernel module GM firmware User Kernel NIC
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Sending and Receiving messages over Myrinet/GM Sending application Host NIC Send q Send DMARecv DMA Host DMA LANai Receiving application Host NIC Recv q Send DMARecv DMA Host DMA LANai BufferEvent qBufferEvent q
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Initial Code for (i=1; i<=X; i++) for (j=1; j<=Y; j++) for (k=1; k<=Z; k++) { A[i][j][k] = func(A[i-1][j][k], A[i][j-1][k], A[i][j][k-1]) }
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes cyclic mirror cluster retile cyclic mirror cluster retile Experimental results Speedup / # processors Height of Iteration Space Non Overlapping Execution Scheme Speedup / # processors Height of Iteration Space Overlapping Execution Scheme
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Simulation results mirror cyclic retile Speedup / # processors Height of Iteration Space Overlapping Execution Scheme cluster mirror Speedup / # processors Height of Iteration Space Non Overlapping Execution Scheme retile cluster cyclic
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Simulation results retile cluster cyclic mirror Speedup / # processors Height of Iteration Space Non Overlapping Execution Scheme mirror cluster retile Speedup / # processors Height of Iteration Space Overlapping Execution Scheme cyclic
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Advantages - Disadvantages AdvantagesDisadvantages cyclic + fast pipeline filling- communication mirror + better communication than cyclic - idle time steps - worse communication than cluster, retile cluster + communication: 1) little volume of data to be transferred 2) data combined in fewer messages - slow pipeline filling retile + fast pipeline filling + communication: little volume of data to be transfered - reorganizes tiles annuls optimal tile shape for cache hits
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes The End
National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cyclic Assignment - Overlapping Communication SMP0 SMP1 SMP0 SMP1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 equivalent schedulings P t scheduling on a fixed number of processors empty pipeline waiting for the necessary data to become available t P scheduling on an unlimited number of processors