Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advances in the Optimization of Parallel Routines (III)

Similar presentations


Presentation on theme: "Advances in the Optimization of Parallel Routines (III)"— Presentation transcript:

1 Advances in the Optimization of Parallel Routines (III)
Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain dis.um.es/~domingo 04 April 2019 Universidad Politécnica de Valencia

2 Universidad Politécnica de Valencia
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 04 April 2019 Universidad Politécnica de Valencia

3 Colaborations and autoreferences
Algorithmic schemes + J. P. Martínez: Automatic Optimization in Parallel Dynamic Programming Schemes. 2004 04 April 2019 Universidad Politécnica de Valencia

4 Universidad Politécnica de Valencia
Algorithmic schemes To study ALGORITHMIC SCHEMES, and not individual routines. The study could be useful to: Design libraries to solve problems in different fields. Divide and Conquer, Dynamic Programming, Branch and Bound (La Laguna) Develop SKELETONS which could be used in parallel programming languages. Skil, Skipper, CARAML, P3L, … 04 April 2019 Universidad Politécnica de Valencia

5 Universidad Politécnica de Valencia
Dynamic Programming There are different Parallel Dynamic Programming Schemes. The simple scheme of the “coins problem” is used: A quantity C and n coins of values v=(v1,v2,…,vn), and a quantity q=(q1,q2,…,qn) of each type. Minimize the quantity of coins to be used to give C. But the granularity of the computation has been varied to study the scheme, not the problem. 04 April 2019 Universidad Politécnica de Valencia

6 Universidad Politécnica de Valencia
Dynamic Programming Sequential scheme: for i=1 to number_of_decisions for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor Complete the table with the formula: endfor 1 2 . j N …. i n 04 April 2019 Universidad Politécnica de Valencia

7 Universidad Politécnica de Valencia
Dynamic Programming Parallel scheme: for i=1 to number_of_decisions In Parallel: for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor endInParallel 1 2 . j ... i n PO P P PS PK PK 04 April 2019 Universidad Politécnica de Valencia

8 Universidad Politécnica de Valencia
Dynamic Programming Message-passing scheme: In each processor Pj for i=1 to number_of_decisions communication step obtain the optimum solution with i decisions and the problem sizes Pj has assigned endfor endInEachProcessor 1 2 . j ... i n N PO P P PK PK 04 April 2019 Universidad Politécnica de Valencia

9 Universidad Politécnica de Valencia
Dynamic Programming Theoretical model: Sequential cost: Computational parallel cost (qi large): Communication cost: The only AP is p The SPs are tc , ts and tw one step Process Pp 04 April 2019 Universidad Politécnica de Valencia

10 Universidad Politécnica de Valencia
Dynamic Programming How to estimate arithmetic SPs: Solving a small problem How to estimate communication SPs: Using a ping-pong (CP1) Solving a small problem varying the number of processors (CP2) Solving problems of selected sizes in systems of selected sizes (CP3) 04 April 2019 Universidad Politécnica de Valencia

11 Universidad Politécnica de Valencia
Dynamic Programming Experimental results: Systems: SUNEt: five SUN Ultra 1 and one SUN Ultra 5 (2.5 times faster) + Ethernet PenET: seven Pentium III + FastEthernet Varying: The problem size C = 10000, 50000, , Large value of qi The granularity of the computation (the cost of a computational step) 04 April 2019 Universidad Politécnica de Valencia

12 Universidad Politécnica de Valencia
Dynamic Programming Experimental results: CP1: ping-pong (point-to-point communication). Does not reflect the characteristics of the system CP2: Executions with the smallest problem (C =10000) and varying the number of processors Reflects the characteristics of the system, but the time also changes with C Larger installation time (6 and 9 seconds) Executions with selected problem (C =10000, ) and system (p =2, 4, 6) sizes, and linear interpolation for other sizes Larger installation time (76 and 35 seconds) 04 April 2019 Universidad Politécnica de Valencia

13 Universidad Politécnica de Valencia
Dynamic Programming Parameter selection 10.000 50.000 SUNEt gra LT CP1 CP2 CP3 10 1 50 6 4 100 5 PenFE gra LT CP1 CP2 CP3 10 1 6 50 5 7 4 100 04 April 2019 Universidad Politécnica de Valencia

14 Universidad Politécnica de Valencia
Dynamic Programming Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in SUNEt: 04 April 2019 Universidad Politécnica de Valencia

15 Universidad Politécnica de Valencia
Dynamic Programming Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in PenFE: 04 April 2019 Universidad Politécnica de Valencia

16 Universidad Politécnica de Valencia
Dynamic Programming Three types of users are considered: GU (greedy user): Uses all the available processors. CU (conservative user): Uses half of the available processors EU (expert user): Uses a different number of processors depending on the granularity: 1 for low granularity Half of the available processors for middle granularity All the processors for high granularity 04 April 2019 Universidad Politécnica de Valencia

17 Universidad Politécnica de Valencia
Dynamic Programming Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in SUNEt: 04 April 2019 Universidad Politécnica de Valencia

18 Universidad Politécnica de Valencia
Dynamic Programming Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in PenFE: 04 April 2019 Universidad Politécnica de Valencia

19 Universidad Politécnica de Valencia
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 04 April 2019 Universidad Politécnica de Valencia

20 Colaborations and autoreferences
Heterogeneous systems + G. Carrillo: Installation routines for linear algebra libraries on LANs. 2000 + J. Cuenca + J. Dongarra + J. González + K. Roche: Automatic Optimization of Parallel Linear Algebra Routines in Systems with Variable Load. 2003 + J. Cuenca + J. P. Martínez: Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems. 2004 04 April 2019 Universidad Politécnica de Valencia

21 Heterogeneous algorithms
Necessary new algorithms with unbalanced distribution of data: Different SPs for different processors APs include vector of selected processors vector of block sizes Gauss elimination b0 b1 b2 04 April 2019 Universidad Politécnica de Valencia

22 Heterogeneous algorithms
Parameter selection: RI-THE: obtains p and b from the formula (homogeneous distribution) RI-HOM: obtains p and b through a reduced number of executions (homogeneous distribution) RI-HET: obtains p and b through a reduced number of executions and each 04 April 2019 Universidad Politécnica de Valencia

23 Heterogeneous algorithms
Quotient with respect to the lowest experimental execution time: RI-THEO RI-HOMO RI-HETE 2 2 2 1,5 1,5 1,5 1 1 1 0,5 0,5 0,5 500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000 Heterogeneous system: Two SUN Ultra 1 (one manages the file system) One SUN Ultra 5 Homogeneous system: Five SUN Ultra 1 Hybrid system: Five SUN Ultra 1 One SUN Ultra 5 04 April 2019 Universidad Politécnica de Valencia

24 Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File D E G R U - M 04 April 2019 Universidad Politécnica de Valencia

25 Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia

26 Universidad Politécnica de Valencia
Parameter selection at running time The NWS is called and it reports: ·the fraction of available CPU (fCPU) ·the current word sending time (tw_current) for a specific n and AP values (n0, AP0). Then the fraction of available network is calculated: I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia

27 Universidad Politécnica de Valencia
Parameter selection at running time node1 node2 node3 node4 node5 node6 node7 node8 Situation A CPU avail. 100% 100% 100% 100% 100% 100% 100% 100% tw-current sec Situation B CPU avail. 80% 80% 80% 80% 100% 100% 100% 100% tw-current sec sec Situation C CPU avail. 60% 60% 60% 60% 100% 100% 100% 100% tw-current sec sec Situation D CPU avail. 60% 60% 60% 60% 100% 100% 80% 80% tw-current sec 0.7sec 0.8sec Situation E CPU avail. 60% 60% 60% 60% 100% 100% 50% 50% tw-current sec 0.7sec 4.0sec 04 April 2019 Universidad Politécnica de Valencia

28 Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia

29 Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia

30 Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS D E G R U - M The values of the SP are tuned, according to the current situation: 04 April 2019 Universidad Politécnica de Valencia

31 Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia

32 Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia

33 Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP NWS Information Call to NWS D E G R U - M Block size Situation of the Platform Load n A B C D E Number of nodes to use p = r  c Situation of the Platform Load n A B C D E 2 42 22 22 21 2 42 22 22 21 2 42 22 22 21 04 April 2019 Universidad Politécnica de Valencia

34 Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia

35 Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP Execution of LAR NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia

36 Universidad Politécnica de Valencia
Parameter selection at running time Static Model Dynamic Model n = 1024 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% A B C D E n = 2048 0% 20% 40% 60% 80% 100% 120% 140% 160% A B C D E Situation of the platform load n = 3072 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% A B C D E 04 April 2019 Universidad Politécnica de Valencia

37 Universidad Politécnica de Valencia
Work distribution There are different possibilities in heterogeneous systems: Heterogeneous algorithms (Gauss elimination). Homogeneous algorithms and assignation of: One process to each processor (LU factorization) A variable number of processes to each processor, depending on the relative speed The general assignation problem is NP  use of heuristics approximations 04 April 2019 Universidad Politécnica de Valencia

38 Universidad Politécnica de Valencia
Work distribution Dynamic Programming (the coins problem scheme) Homogeneous algorithm + Heterogeneous algorithm distribution 1 2 . j ... i n 1 2 . j ... i n p0 p1 p2 p p4 p ps pr-1 pr P0 P0 P1 P P3 P PS ... PK PK P P P PS PK PK 04 April 2019 Universidad Politécnica de Valencia

39 Universidad Politécnica de Valencia
Work distribution The model: t(n,C,v,q,tc(n,C,v,q,p,b,d),ts(n,C,v,q,p,b,d),tw(n,C,v,q,p,b,d)) Problem size: n number of types of coins C value to give v array of values of the coins q quantity of coins of each type Algorithmic parameters: p number of processes b block size (here n/p) d processes to processors assignment System parameters: tc cost of basic arithmetic operations ts start-up time tw word-sending time 04 April 2019 Universidad Politécnica de Valencia

40 Universidad Politécnica de Valencia
Work distribution Theoretical model: The same as for the homogeneous case because the same homogeneous algorithm is used Sequential cost: Computational parallel cost (qi large): Communication cost: There is a new AP: d SPs are now unidimensional (tc) or bidimensional (ts ,tw ) tables Process Pp 04 April 2019 Universidad Politécnica de Valencia

41 Universidad Politécnica de Valencia
Work distribution Assignment tree (P types of processors and p processes): P processors 1 2 3 ... P ... ... ... 1 2 3 P 2 3 P 3 P P p processes ... Some limit in the height of the tree (the number of processes) is necessary 04 April 2019 Universidad Politécnica de Valencia

42 Universidad Politécnica de Valencia
Work distribution Assignment tree (P types of processors and p processes): P =2 and p =3: 10 nodes in general: 1 1 1 04 April 2019 Universidad Politécnica de Valencia

43 Universidad Politécnica de Valencia
Work distribution Assignment tree. SUNEt P=2 types of processors (five SUN1 + one SUN5): nodes: when more processes than available processors are assigned to a type of processor, the costs of operations (SPs) change 2 processors U5 U1 U1 U5 U1 U1 U5 one process to each processor p processes U1 U1 ... U1 04 April 2019 Universidad Politécnica de Valencia

44 Universidad Politécnica de Valencia
Work distribution Assignment tree. TORC, used P=4 types of processors: one 1.7 Ghz Pentium 4 (only one process can be assigned). Type 1 one 1.2 Ghz AMD Athlon. Type 2 one 600 Mhz single Pentium III. Type 3 eight 550 Mhz dual Pentium III. Type 4 4 processors not in the tree two consecutive processes are assigned to a same node 1 2 3 4 p processes 1 2 3 4 2 3 4 3 4 4 ... the values of SPs change 04 April 2019 Universidad Politécnica de Valencia

45 Universidad Politécnica de Valencia
Work distribution Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree: Use the theoretical execution model to estimate the cost at each node with the highest values of the SPs between those of the types of processors considered, through multiplying the values by the number of processes assigned to the processor of this type with more charge: 04 April 2019 Universidad Politécnica de Valencia

46 Universidad Politécnica de Valencia
Work distribution Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree: Use the theoretical execution model to obtain a lower bound for each node For example, with an array of types of processors (1,1,1,2,2,2,3,3,3,4,4,4), with relative speeds si, and array of assignations a=(2,2,3), the array of possible assignations is pa=(0,0,0,1,1,0,1,1,1,1,1,1), and the maximum achievable speed is the minimum arithmetic cost is obtained from this speed, and the lowest communication costs are obtained from those between processors in the array of assignations 04 April 2019 Universidad Politécnica de Valencia

47 Universidad Politécnica de Valencia
Work distribution Theoretical model: Sequential cost: Computational parallel cost (qi large): Communication cost: The APs are p and the assignation array d The SPs are the unidimensional array tc , and the bidimensional arrays ts and tw one step Maximum values 04 April 2019 Universidad Politécnica de Valencia

48 Universidad Politécnica de Valencia
Work distribution How to estimate arithmetic SPs: Solving a small problem on each type of processors How to estimate communication SPs: Using a ping-pong between each pair of processors, and processes in the same processor (CP1) Does not reflect the characteristics of the system Solving a small problem varying the number of processors, and with linear interpolation (CP2) Larger installation time 04 April 2019 Universidad Politécnica de Valencia

49 Universidad Politécnica de Valencia
Work distribution Three types of users are considered: GU (greedy user): Uses all the available processors, with one process per processor. CU (conservative user): Uses half of the available processors (the fastest), with one process per processor. EU (user expert in the problem, the system and heterogeneous computing): Uses a different number of processes and processors depending on the granularity: 1 process in the fastest processor, for low granularity The number of processes is half of the available processors, and in the appropriate processors, for middle granularity A number of processes equal to the number of processors, and in the appropriate processors, for large granularity 04 April 2019 Universidad Politécnica de Valencia

50 Universidad Politécnica de Valencia
Work distribution Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in SUNEt: 04 April 2019 Universidad Politécnica de Valencia

51 Universidad Politécnica de Valencia
Work distribution Parameters selection, in TORC, with CP2: C gra LT CP2 50000 10 (1,2) 50 (1,2,4,4) 100 100000 500000 (1,2,3,4) 04 April 2019 Universidad Politécnica de Valencia

52 Universidad Politécnica de Valencia
Work distribution Parameters selection, in TORC (without the 1.7 Ghz Pentium 4), with CP2: one 1.2 Ghz AMD Athlon. Type 1 one 600 Mhz single Pentium III. Type 2 eight 550 Mhz dual Pentium III. Type 3 C gra LT CP2 50000 10 (1,1,2) (1,1,2,3,3,3,3,3,3) 50 (1,1,2,3,3,3,3,3,3,3,3) 100 (1,1,3,3) 100000 (1,1,3) 500000 (1,1,2,3) 04 April 2019 Universidad Politécnica de Valencia

53 Universidad Politécnica de Valencia
Work distribution Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC: 04 April 2019 Universidad Politécnica de Valencia

54 Universidad Politécnica de Valencia
Work distribution Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC (without the 1.7 Ghz Pentium 4): 04 April 2019 Universidad Politécnica de Valencia

55 Universidad Politécnica de Valencia
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 04 April 2019 Universidad Politécnica de Valencia

56 Universidad Politécnica de Valencia
Hybrid programming OpenMP Fine-grain parallelism Efficient in SMP Sequential and parallel codes are similar Tools for development and parallelisation Allows run time scheduling Memory allocation can reduce performance MPI Coarse-grain parallelism More portable Parallel code very different from sequential Development and debugging more complex Static assigment of processes Local memories, which facilitates efficient use 04 April 2019 Universidad Politécnica de Valencia

57 Hybrid programming Advantages of Hybrid Programming
To improve scalability When too many tasks produce load imbalance Applications with fine and coarse-grain parallelism Redution of the code development time When the number of MPI processes is fixed In case of a mixture of functional and data parallelism 04 April 2019 Universidad Politécnica de Valencia

58 Hybrid programming Hybrid Programming in the literature
Most of the papers are about particular applications Some papers present hybrid models No theoretical models of the execution time are available 04 April 2019 Universidad Politécnica de Valencia

59 Universidad Politécnica de Valencia
Hybrid programming Systems Networks of Dual Pentiums HPC160 (each node four processors) IBM SP Blue Horizon (144 nodes, each 8 processors) Earth Simulator (640x8 vector processors) 04 April 2019 Universidad Politécnica de Valencia

60 Universidad Politécnica de Valencia
Hybrid programming 04 April 2019 Universidad Politécnica de Valencia

61 Universidad Politécnica de Valencia
Hybrid programming Models MPI+OpenMP OpenMP used for loops parallelisation OpenMP+MPI Unsafe threads MPI and OpenMP processes in SPMD model Reduces cost of communications                                                                            04 April 2019 Universidad Politécnica de Valencia

62 Universidad Politécnica de Valencia
Hybrid programming 04 April 2019 Universidad Politécnica de Valencia

63 Universidad Politécnica de Valencia
Hybrid programming !$OMP PARALLEL DO REDUCTION (+:sum) PRIVATE (x) do 20 i = myid+1, n, numprocs x = h * (dble(i) - 0.5d0) sum = sum + f(x) 20 enddo !$OMP END PARALLEL DO mypi = h * sum call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION, &MPI_SUM,0,MPI_COMM_WORLD,ierr) call MPI_FINALIZE(ierr) stop end program main include 'mpif.h' double precision mypi, pi, h, sum, x, f, a integer n, myid, numprocs, i, ierr f(a) = 4.d0 / (1.d0 + a*a) call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) call MPI_BCAST(n,1,MPI_INTEGER,0, & MPI_COMM_WORLD,ierr) h = 1.0d0/n sum = 0.0d0 04 April 2019 Universidad Politécnica de Valencia

64 Hybrid programming It is not clear if with hybrid programming the execution time would be lower Lanucara, Rovida: Conjugate-Gradient 04 April 2019 Universidad Politécnica de Valencia

65 Universidad Politécnica de Valencia
Hybrid programming It is not clear if with hybrid programming the execution time would be lower Djomehri, Jin: CFD Solver 04 April 2019 Universidad Politécnica de Valencia

66 Universidad Politécnica de Valencia
Hybrid programming It is not clear if with hybrid programming the execution time would be lower Viet, Yoshinaga, Abderazek, Sowa: Linear system 04 April 2019 Universidad Politécnica de Valencia

67 Universidad Politécnica de Valencia
Hybrid programming Matrix-matrix multiplication: MPI SPMD MPI+OpenMP decide which is preferable MPI+OpenMP: less memory fewer communications may have worse memory use N0 p0 N0 p1 N2 p0 N2 p1 N1 p0 N1 p1 N0 p0 N0 p1 N2 p0 N2 p1 N1 p0 N1 p1 N0 N1 N2 04 April 2019 Universidad Politécnica de Valencia

68 Universidad Politécnica de Valencia
Hybrid programming In the time theoretical model more Algorithmic Parameters appear: 8 processors: p=rxs, 1x8, 2x4, 4x2, 8x1 p=rxs, 1x4, 2x2, 4x1 q=uxv, 1x2, 2x1 total 6 configurations 16 processors: p=rxs, 1x16, 2x8, 4x4, 8x2, 16x1 q=uxv, 1x4, 2x2, 4x1 total 9 configurations 04 April 2019 Universidad Politécnica de Valencia

69 Universidad Politécnica de Valencia
Hybrid programming And more System Parameters: The cost of communications is different inside and outside a node (similar to the heterogeneous case with more than one process per processor) The cost of arithmetic operations can vary when the number of threads in the node varies Consequently, the algorithms must be recoded and new models of the execution time must be obtained 04 April 2019 Universidad Politécnica de Valencia

70 Universidad Politécnica de Valencia
Hybrid programming … and the formulas change: P0 P1 P2 P3 P4 P5 P6 synchronizations Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 communications The formula changes, for some systems 6x1 nodes and 1x6 threads could be better, and for others 1x6 nodes and 6x1 threads 04 April 2019 Universidad Politécnica de Valencia

71 Universidad Politécnica de Valencia
Hybrid programming Open problem Is it possible to generate automatically MPI+OpenMP programs from MPI programs? Maybe for the SPMD model. Or at least for some type of programs, such as matricial problems in meshes of processors? And is it possible to obtain the execution time of the MPI+OpenMP program from that of the MPI program and some description of how the time model has been obtained? 04 April 2019 Universidad Politécnica de Valencia

72 Universidad Politécnica de Valencia
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 04 April 2019 Universidad Politécnica de Valencia

73 Universidad Politécnica de Valencia
Peer to peer computing Distributed systems: They are inherently heterogeneous and dynamic But there are other problems: Higher communication cost Special middleware is necessary The typical paradigms are master/slave, client/server, where different types of processors (users) are considered. 04 April 2019 Universidad Politécnica de Valencia

74 Universidad Politécnica de Valencia
Peer to peer computing Peer-to-Peer Computing. Dejan S. Milojicic, Vana Kalogeraki, Rajan Lukose, Kiran Nagaraja1, Jim Pruyne, Bruno Richard, Sami Rollins, Zhichen Xu. HP Laboratories Palo Alto. 2002 04 April 2019 Universidad Politécnica de Valencia

75 Universidad Politécnica de Valencia
Peer to peer computing Peer to peer: All the processors (users) are at the same level (at least initially) The community selects, in a democratic and continuous way, the topology of the global network Would it be interesting to have a P2P system for computing? Is some system of this type available? 04 April 2019 Universidad Politécnica de Valencia

76 Universidad Politécnica de Valencia
Peer to peer computing Would it be interesting to have a P2P system for computing? I think it would be interesting to develop a system of this type And to leave the community to decide, in a democratic and continuous way, if it is worthwhile Is some system of this type available? I think there is no pure P2P dedicated to computation 04 April 2019 Universidad Politécnica de Valencia

77 Universidad Politécnica de Valencia
Peer to peer computing … and other people seem to think the same: Lichun Ji (2003): “… P2P networks seem to outperform other approaches largely due to the anonymity of the participants in the peer-network, low network costs and the inexpensive disk-space. Trying to apply P2P principles in the area of distributed computation was significantly less successful” Arjav J. Chakravarti, Gerald Baumgartner, Mario Lauria (2004): “… current approaches to utilizing desktop resources require either centralized servers or extensive knowledge of the underlying system, limiting their scalability” 04 April 2019 Universidad Politécnica de Valencia

78 Universidad Politécnica de Valencia
Peer to peer computing There are a lot of tools for Grid Computing: Globus (of course), but does Globus provide computational P2P capacity or is it a tool with which P2P computational systems can be developed? Netsolve/Gridsolve. Uses a client/server structure. PlanetLab (at present 387 nodes and 162 sites). In each site one Principal Researcher and one System Administrator. 04 April 2019 Universidad Politécnica de Valencia

79 Universidad Politécnica de Valencia
Peer to peer computing For Computation on P2P the shared resources are: Information: books, papers, …, in a typical way. Libraries. One peer takes a library from another peer. Necessary description of the library and the system to know if the library fulfils our requests. Computation. One peer colaborates to solve a problem proposed by another peer. This is the central idea of Computation on P2P… 04 April 2019 Universidad Politécnica de Valencia

80 Universidad Politécnica de Valencia
Peer to peer computing Two peers collaborate in the solution of a computational problem using the hierarchy of parallel linear algebra libraries Peer 1 Peer 2 PLAPACK ScaLAPACK Ref. LAPACK ATLAS PBLAS BLACS Machine MPI Mac. LAPACK BLAS Reference MPI 04 April 2019 Universidad Politécnica de Valencia

81 Universidad Politécnica de Valencia
Peer to peer computing There are Different global hierarchies Different libraries Peer 1 Peer 2 PLAPACK ScaLAPACK Ref. LAPACK ATLAS PBLAS BLACS Machine MPI Mac. LAPACK BLAS Reference MPI 04 April 2019 Universidad Politécnica de Valencia

82 Universidad Politécnica de Valencia
Peer to peer computing And the installation information varies, which makes the efficient uses of the theoretical model more difficult than in the heterogeneous case Peer 1 Peer 2 PLAPACK ScaLAPACK Ref. LAPACK ATLAS PBLAS BLACS Machine MPI Inst. Inform. Inst. Inform. Mac. LAPACK Inst. Inform. Inst. Inform. Inst. Inform. BLAS Inst. Inform. Inst. Inform. Inst. Inform. Reference MPI Inst. Inform. 04 April 2019 Universidad Politécnica de Valencia Inst. Inform.

83 Universidad Politécnica de Valencia
Peer to peer computing Trust problems appear: Does the library solve the problems we require to be solved? Is the library optimized for the system it claims to be optimized for? Is the installation information correct? Is the system stable? There are trust-algorithms for P2P systems; are they (or some modification) applicable to these trust problems? 04 April 2019 Universidad Politécnica de Valencia

84 Universidad Politécnica de Valencia
Peer to peer computing Each peer would have the possibility of establishing a policy of use: The use of the resources could be payable The percentage of CPU dedicated to computations for the community The type of problems it is interested in And the MAIN PROBLEM: is it interesting to develop a P2P system for the management and optimization of computational codes? 04 April 2019 Universidad Politécnica de Valencia


Download ppt "Advances in the Optimization of Parallel Routines (III)"

Similar presentations


Ads by Google