ISPDC 2007, Hagenberg, Austria, 5-8 July 2007 1 On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.

ISPDC 2007, Hagenberg, Austria, 5-8 July 2007 1 On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of Computer Science and Informatics University College Dublin Alexey.Lastovetsky@ucd.ie

ISPDC 2007, Hagenberg, Austria, 5-8 July 20072 Heterogeneous parallel computing u Heterogeneity of processors –The processors run at different speeds –Even distribution of computations do not balance processors’ load »The performance is determined by the slowest processor –Data must be distributed unevenly »So that each processor will perform the volume of computation proportional to its speed

ISPDC 2007, Hagenberg, Austria, 5-8 July 20073 Constant performance models of heterogeneous processors u The simplest performance model of heterogeneous processors –p, the number of the processors, –S={s 1, s 2,..., s p }, the speeds of the processors (positive constants). u The speed –Absolute: the number of computational units performed by the processor per one time unit –Relative: –Some use the execution time:

ISPDC 2007, Hagenberg, Austria, 5-8 July 20074 Data distribution problems with constant models of heterogeneous processors u Typical design of heterogeneous parallel algorithms –Problem of distribution of computations in proportion to the speed of processors »Problem of partitioning of some mathematical objects u Sets, matrices, graphs, geometric figures, etc.

ISPDC 2007, Hagenberg, Austria, 5-8 July 20075 Partitioning matrices with constant models of heterogeneous processors u Matrices – Most widely used math. objects in scientific computing »Studied partitioning problems mainly deal with matrices »Matrix partitioning in one dimension over a 1D arrangement of processors u Often reduced to partitioning sets or well-ordered sets »Design of algorithms often results in matrix partitioning problems not imposing the restriction of partitioning in one dimension u E.g., in parallel linear algebra for heterogeneous platforms u We will use matrix multiplication –A simple but very important linear algebra kernel

ISPDC 2007, Hagenberg, Austria, 5-8 July 20076 Partitioning matrices with constant models of heterogeneous processors (ctd) u A heterogeneous matrix multiplication algorithm – A modification of some homogeneous one »Most often, of the 2D block cyclic ScaLAPACK algorithm

ISPDC 2007, Hagenberg, Austria, 5-8 July 20077 Partitioning matrices with constant models of heterogeneous processors (ctd) u 2D block cyclic ScaLAPACK MM algorithm (ctd)

ISPDC 2007, Hagenberg, Austria, 5-8 July 20078 Partitioning matrices with constant models of heterogeneous processors (ctd) u 2D block cyclic ScaLAPACK MM algorithm (ctd) –The matrices are identically partitioned into rectangular generalized blocks of the size (p×r)×(q×r) »Each generalized block forms a 2D p×q grid of r×r blocks »There is 1-to-1 mapping between this grid of blocks and the p×q processor grid –At each step of the algorithm »Each processor not owing the pivot row and column receives horizontally (n/p)×r elements of matrix A and vertically (n/q)×r elements of matrix B »=> in total,, i.e., ~ the half-perimeter of the rectangle area allocated to the processor

ISPDC 2007, Hagenberg, Austria, 5-8 July 20079 Partitioning matrices with constant models of heterogeneous processors (ctd) u General design of heterogeneous modifications –Matrices A, B, and C are identically partitioned into equal rectangular generalized blocks –The generalized blocks are identically partitioned into rectangles so that »There is one-to-one mapping between the rectangles and the processors »The area of each rectangle is (approximately) proportional to the speed of the processor which has the rectangle –Then, the algorithm follows the steps of its homogeneous prototype

ISPDC 2007, Hagenberg, Austria, 5-8 July 200710 Partitioning matrices with constant models of heterogeneous processors (ctd)

ISPDC 2007, Hagenberg, Austria, 5-8 July 200711 Partitioning matrices with constant models of heterogeneous processors (ctd) u Why to partition the GBs in proportion to the speed –At each step, updating one r×r block of matrix C needs the same amount of computation for all the blocks –=> the load will be perfectly balanced if the number of blocks updated by each processor is proportional to its speed – The number = n i ×N GB –n i = the area of the GB partition allocated to i-th processor (measured in r×r blocks) –=> if the area of each GB partition ~ to the speed of the owing processor, their load will be perfectly balanced

ISPDC 2007, Hagenberg, Austria, 5-8 July 200712 Partitioning matrices with constant models of heterogeneous processors (ctd) u A generalized block from partitioning POV –An integer-valued rectangular –If we need an asymptotically optimal solution, the problem can be reduced to a geometrical problem of optimal partitioning of a real-valued rectangle »the asymptotically optimal integer-valued solution can be obtained by rounding off an optimal real-valued solution of the geometrical partitioning problem

ISPDC 2007, Hagenberg, Austria, 5-8 July 200713 Geometrical partitioning problem u The general geometrical partitioning problem –Given a set of p processors P 1, P 2,..., P p, the relative speed of each of which is characterized by a positive constant, s i, ( ), –Partition a unit square into p rectangles so that »There is one-to-one mapping between the rectangles and the processors »The area of the rectangle allocated to processor P i is equal to s i »The partitioning minimizes, where w i is the width and h i is the height of the rectangle allocated to processor P i

ISPDC 2007, Hagenberg, Austria, 5-8 July 200714 Geometrical partitioning problem (ctd) u Motivation behind the formulation –Proportionality of the areas to the speeds »Balancing the load of the processors –Minimization of the sum of half-perimeters »Multiple partitionings can balance the load »Minimizes the total volume of communications u At each step of MM, each receiving processor receives data ~ the half-perimeter of its rectangle u => In total, the communicated data ~

ISPDC 2007, Hagenberg, Austria, 5-8 July 200715 Geometrical partitioning problem (ctd) u Motivation behind the formulation (ctd) –Option: minimizing the maximal half-perimeter »Parallel communications –The use of a unit square instead of a rectangle »No loss of generality » the optimal solution for an arbitrary rectangle is obtained by straightforward scaling of that for the unit square u Proposition. The general geometrical partitioning problem is NP-complete.

ISPDC 2007, Hagenberg, Austria, 5-8 July 200716 Restricted geometrical partitioning problems u Restricted problems having polynomial solutions –Column-based –Grid-based u Column-based partitioning –Rectangles make up columns –Has an optimal solution of complexity O(p 3 )

ISPDC 2007, Hagenberg, Austria, 5-8 July 200717 Column-based partitioning problem

ISPDC 2007, Hagenberg, Austria, 5-8 July 200718 Column-based partitioning problem (ctd) u A more restricted form of the column-based partitioning problem –The processors are already arranged into a set of columns u Algorithm 1: Optimal partitioning a unit square between p heterogeneous processors arranged into c columns, each of which is made of r j processors, j=1,…,c : –Let the relative speed of the i-th processor from the j-th column, P ij, be s ij. –Then, we first partition the unit square into c vertical rectangular slices such that the width the j-th slice

ISPDC 2007, Hagenberg, Austria, 5-8 July 200719 Column-based partitioning problem (ctd) u Algorithm 1: (ctd) : –Second, each vertical slice is partitioned independently into rectangles in proportion with the speed of the processors in the corresponding processor column. u Algorithm 1 is of linear complexity.

ISPDC 2007, Hagenberg, Austria, 5-8 July 200720 Grid-based partitioning problem u Grid-based partitioning problem –The heterogeneous processors form a two-dimensional grid - There exist p and q such that any vertical line crossing the unit square will pass through exactly p rectangles and any horizontal line crossing the square will pass through exactly q rectangles

ISPDC 2007, Hagenberg, Austria, 5-8 July 200721 Grid-based partitioning problem (ctd) u Proposition. Let a grid-based partitioning of the unit square between p heterogeneous processors form c columns, each of which consists of r processors, p=r×c. Then, the sum of half-perimeters of the rectangles of the partitioning will be equal to (r+c). –The shape r×c of the processor grid formed by any optimal grid-based partitioning will minimize (r+c). –The sum of half-perimeters of the rectangles of the optimal grid-based partitioning does not depend on the mapping of the processors onto the nodes of the grid.

ISPDC 2007, Hagenberg, Austria, 5-8 July 200722 Grid-based partitioning problem (ctd) u Algorithm 2: Optimal grid-based partitioning a unit square between p heterogeneous processors: –Step 1: Find the optimal shape r×c of the processor grid such that p=r×c and (r+c) is minimal. –Step 2: Map the processors onto the nodes of the grid. –Step 3: Apply Algorithm 3 of the optimal partitioning of the unit square to this r×c arrangement of the p heterogeneous processors. u The correctness of Algorithm 2 is obvious. u Algorithm 2 returns a column-based partitioning.

ISPDC 2007, Hagenberg, Austria, 5-8 July 200723 Grid-based partitioning problem (ctd) u The optimal grid-based partitioning can be seen as a restricted form of column-based partitioning.

ISPDC 2007, Hagenberg, Austria, 5-8 July 200724 Grid-based partitioning problem (ctd) u Algorithm 3: Finding r and c such that p=r×c and (r+c) is minimal: ; while(r>1) if((p mod r)==0)) goto stop; else r--; stop: c = p / r;

ISPDC 2007, Hagenberg, Austria, 5-8 July 200725 Grid-based partitioning problem (ctd) u Proposition. Algorithm 3 is correct. u Proposition. The complexity of Algorithm 2 can be bounded by O(p 3/2 ).

ISPDC 2007, Hagenberg, Austria, 5-8 July 200726 Experimental results Specifications of sixteen Linux computers on which the matrix multiplication is executed ComputerCpu / Main memory / Cache (GHz/Mbytes/Kbytes) Absolute speed (MFlops) hcl013.60 / 256 / 20482171 hcl023.00 / 256 / 20482099 hcl033.40 / 1024 / 10241761 hcl043.40 / 1024 / 10241787 hcl053.40 / 256 / 10241735 hcl063.40 / 256 / 10241653 hcl073.40 / 256 / 10241879 hcl083.40 / 256 / 10241635 hcl091.00 / 1024 / 10243004 hcl101.00 / 1024 / 10242194 hcl113.00 / 512 / 10244580 hcl123.40 / 512 / 10241762 hcl133.40 / 1024 / 10244934 hcl142.80 / 1024 / 10244096 hcl153.60 / 1024 / 162697 hcl163.60 / 1024 / 20484840

ISPDC 2007, Hagenberg, Austria, 5-8 July 200727 Experimental results (ctd) Matrix size (n) Processor grid (r×c)Total execution time (sec)Communication time (sec) 5120 1×16348180 5120 2×821161 5120 4×417923 6144 1×16537258 6144 2×833595 6144 4×427630 7168 1×16770352 7168 2×8514150 7168 4×442051 8192 1×161264464 8192 2×8709181 8192 4×458250 9216 1×161444590 9216 2×8969233 9216 4×482893 10240 1×161916727 10240 2×81292297 10240 4×41100110

ISPDC 2007, Hagenberg, Austria, 5-8 July 200728 Application to Cartesian partitioning u Cartesian partitioning: –A column-based partitioning, the rectangles of which make up rows.

ISPDC 2007, Hagenberg, Austria, 5-8 July 200729 Application to Cartesian partitioning (ctd) u Cartesian partitioning –Plays important role in design of heterogeneous parallel algorithms (e.g., in scalable algorithms) u The Cartesian partitioning problem –Very difficult »Their may be no Cartesian partitionings perfectly balancing the load of processors

ISPDC 2007, Hagenberg, Austria, 5-8 July 200730 Application to Cartesian partitioning (ctd) u Cartesian partitioning problem in general form –Given p processors, the speed of each of which is characterized by a given positive constant, –Find a Cartesian partitioning of a unit square such that »There is 1-to-1 mapping between the rectangles and the processors »The partitioning minimizes

ISPDC 2007, Hagenberg, Austria, 5-8 July 200731 Application to Cartesian partitioning (ctd) u The Cartesian partitioning problem –Not even studied in the general form. »If shape r×c is given, it proved NP-complete. »Unclear if there exists a polynomial algorithm when both the shape and the processors’ mapping are given »There exists an optimal Cartesian partitioning with processors arranged in a non-increasing order of speed

ISPDC 2007, Hagenberg, Austria, 5-8 July 200732 Application to Cartesian partitioning (ctd) u Approximate solutions of the Cartesian partitioning problem are based on the observation –Let the speed matrix {s ij } of the given r×c processor arrangement be rank-one –Then there exists a Cartesian partitioning perfectly balancing the load of the processors

ISPDC 2007, Hagenberg, Austria, 5-8 July 200733 Application to Cartesian partitioning (ctd) u Algorithm 4: Finding an approximate solution of the simplified Cartesian problem (when only the shape r×c is given): –Step 1: Arrange the processors in a non-increasing order of speed –Step 2: For this arrangement, let and be the parameters of the partitioning –Step 3: Calculate the areas h i ×w j of the rectangles of this partitioning

ISPDC 2007, Hagenberg, Austria, 5-8 July 200734 Application to Cartesian partitioning (ctd) u Algorithm 5: Finding an approximate solution of the simplified Cartesian problem when only the shape r×c is given (ctd): –Step 4: Re-arrange the processors so that –Step 5: If Step 4 does not change the arrangement of the processors then return the current partitioning and stop the procedure else go to Step 2

ISPDC 2007, Hagenberg, Austria, 5-8 July 200735 Application to Cartesian partitioning (ctd) u Proposition. Let a Cartesian partitioning of the unit square between p heterogeneous processors form c columns, each of which consists of r processors, p=r×c. Then, the sum of half- perimeters of the rectangles of the partitioning will be (r+c). –Proof is a trivial exercise –Minimization of the communication cost does not depend on the speeds of the processors but only on their number –=> minimization of communication cost and minimization of computation cost are two independent problems –Any Cartesian partitioning minimizing (r+c) will optimize communication cost

ISPDC 2007, Hagenberg, Austria, 5-8 July 200736 Application to Cartesian partitioning (ctd) u Now we can extend Algorithm 5 –By adding the 0-th step, finding the optimal r×c –The modified algorithm returns an approximate solution of the extended Cartesian problem »Aimed at minimization of both computation and communication cost u The modified Algorithm 5 will return an optimal solution if the speed matrix for the arrangement is a rank-one matrix

ISPDC 2007, Hagenberg, Austria, 5-8 July 2007 1 On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.

Similar presentations

Presentation on theme: "ISPDC 2007, Hagenberg, Austria, 5-8 July 2007 1 On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ISPDC 2007, Hagenberg, Austria, 5-8 July 2007 1 On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.

Similar presentations

Presentation on theme: "ISPDC 2007, Hagenberg, Austria, 5-8 July 2007 1 On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of."— Presentation transcript:

Similar presentations

About project

Feedback