Download presentation
Presentation is loading. Please wait.
Published byΧρύσανθος Σαμαράς Modified over 6 years ago
1
Advances in the Optimization of Parallel Routines (III)
Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain dis.um.es/~domingo 04 April 2019 Universidad Politécnica de Valencia
2
Universidad Politécnica de Valencia
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 04 April 2019 Universidad Politécnica de Valencia
3
Colaborations and autoreferences
Algorithmic schemes + J. P. Martínez: Automatic Optimization in Parallel Dynamic Programming Schemes. 2004 04 April 2019 Universidad Politécnica de Valencia
4
Universidad Politécnica de Valencia
Algorithmic schemes To study ALGORITHMIC SCHEMES, and not individual routines. The study could be useful to: Design libraries to solve problems in different fields. Divide and Conquer, Dynamic Programming, Branch and Bound (La Laguna) Develop SKELETONS which could be used in parallel programming languages. Skil, Skipper, CARAML, P3L, … 04 April 2019 Universidad Politécnica de Valencia
5
Universidad Politécnica de Valencia
Dynamic Programming There are different Parallel Dynamic Programming Schemes. The simple scheme of the “coins problem” is used: A quantity C and n coins of values v=(v1,v2,…,vn), and a quantity q=(q1,q2,…,qn) of each type. Minimize the quantity of coins to be used to give C. But the granularity of the computation has been varied to study the scheme, not the problem. 04 April 2019 Universidad Politécnica de Valencia
6
Universidad Politécnica de Valencia
Dynamic Programming Sequential scheme: for i=1 to number_of_decisions for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor Complete the table with the formula: endfor 1 2 . j N …. i … n 04 April 2019 Universidad Politécnica de Valencia
7
Universidad Politécnica de Valencia
Dynamic Programming Parallel scheme: for i=1 to number_of_decisions In Parallel: for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor endInParallel 1 2 . j ... i … n PO P P PS PK PK 04 April 2019 Universidad Politécnica de Valencia
8
Universidad Politécnica de Valencia
Dynamic Programming Message-passing scheme: In each processor Pj for i=1 to number_of_decisions communication step obtain the optimum solution with i decisions and the problem sizes Pj has assigned endfor endInEachProcessor 1 2 . j ... i … n N PO P P PK PK 04 April 2019 Universidad Politécnica de Valencia
9
Universidad Politécnica de Valencia
Dynamic Programming Theoretical model: Sequential cost: Computational parallel cost (qi large): Communication cost: The only AP is p The SPs are tc , ts and tw one step Process Pp 04 April 2019 Universidad Politécnica de Valencia
10
Universidad Politécnica de Valencia
Dynamic Programming How to estimate arithmetic SPs: Solving a small problem How to estimate communication SPs: Using a ping-pong (CP1) Solving a small problem varying the number of processors (CP2) Solving problems of selected sizes in systems of selected sizes (CP3) 04 April 2019 Universidad Politécnica de Valencia
11
Universidad Politécnica de Valencia
Dynamic Programming Experimental results: Systems: SUNEt: five SUN Ultra 1 and one SUN Ultra 5 (2.5 times faster) + Ethernet PenET: seven Pentium III + FastEthernet Varying: The problem size C = 10000, 50000, , Large value of qi The granularity of the computation (the cost of a computational step) 04 April 2019 Universidad Politécnica de Valencia
12
Universidad Politécnica de Valencia
Dynamic Programming Experimental results: CP1: ping-pong (point-to-point communication). Does not reflect the characteristics of the system CP2: Executions with the smallest problem (C =10000) and varying the number of processors Reflects the characteristics of the system, but the time also changes with C Larger installation time (6 and 9 seconds) Executions with selected problem (C =10000, ) and system (p =2, 4, 6) sizes, and linear interpolation for other sizes Larger installation time (76 and 35 seconds) 04 April 2019 Universidad Politécnica de Valencia
13
Universidad Politécnica de Valencia
Dynamic Programming Parameter selection 10.000 50.000 SUNEt gra LT CP1 CP2 CP3 10 1 50 6 4 100 5 PenFE gra LT CP1 CP2 CP3 10 1 6 50 5 7 4 100 04 April 2019 Universidad Politécnica de Valencia
14
Universidad Politécnica de Valencia
Dynamic Programming Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in SUNEt: 04 April 2019 Universidad Politécnica de Valencia
15
Universidad Politécnica de Valencia
Dynamic Programming Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in PenFE: 04 April 2019 Universidad Politécnica de Valencia
16
Universidad Politécnica de Valencia
Dynamic Programming Three types of users are considered: GU (greedy user): Uses all the available processors. CU (conservative user): Uses half of the available processors EU (expert user): Uses a different number of processors depending on the granularity: 1 for low granularity Half of the available processors for middle granularity All the processors for high granularity 04 April 2019 Universidad Politécnica de Valencia
17
Universidad Politécnica de Valencia
Dynamic Programming Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in SUNEt: 04 April 2019 Universidad Politécnica de Valencia
18
Universidad Politécnica de Valencia
Dynamic Programming Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in PenFE: 04 April 2019 Universidad Politécnica de Valencia
19
Universidad Politécnica de Valencia
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 04 April 2019 Universidad Politécnica de Valencia
20
Colaborations and autoreferences
Heterogeneous systems + G. Carrillo: Installation routines for linear algebra libraries on LANs. 2000 + J. Cuenca + J. Dongarra + J. González + K. Roche: Automatic Optimization of Parallel Linear Algebra Routines in Systems with Variable Load. 2003 + J. Cuenca + J. P. Martínez: Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems. 2004 04 April 2019 Universidad Politécnica de Valencia
21
Heterogeneous algorithms
Necessary new algorithms with unbalanced distribution of data: Different SPs for different processors APs include vector of selected processors vector of block sizes Gauss elimination b0 b1 b2 04 April 2019 Universidad Politécnica de Valencia
22
Heterogeneous algorithms
Parameter selection: RI-THE: obtains p and b from the formula (homogeneous distribution) RI-HOM: obtains p and b through a reduced number of executions (homogeneous distribution) RI-HET: obtains p and b through a reduced number of executions and each 04 April 2019 Universidad Politécnica de Valencia
23
Heterogeneous algorithms
Quotient with respect to the lowest experimental execution time: RI-THEO RI-HOMO RI-HETE 2 2 2 1,5 1,5 1,5 1 1 1 0,5 0,5 0,5 500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000 Heterogeneous system: Two SUN Ultra 1 (one manages the file system) One SUN Ultra 5 Homogeneous system: Five SUN Ultra 1 Hybrid system: Five SUN Ultra 1 One SUN Ultra 5 04 April 2019 Universidad Politécnica de Valencia
24
Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File D E G R U - M 04 April 2019 Universidad Politécnica de Valencia
25
Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia
26
Universidad Politécnica de Valencia
Parameter selection at running time The NWS is called and it reports: ·the fraction of available CPU (fCPU) ·the current word sending time (tw_current) for a specific n and AP values (n0, AP0). Then the fraction of available network is calculated: I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia
27
Universidad Politécnica de Valencia
Parameter selection at running time node1 node2 node3 node4 node5 node6 node7 node8 Situation A CPU avail. 100% 100% 100% 100% 100% 100% 100% 100% tw-current sec Situation B CPU avail. 80% 80% 80% 80% 100% 100% 100% 100% tw-current sec sec Situation C CPU avail. 60% 60% 60% 60% 100% 100% 100% 100% tw-current sec sec Situation D CPU avail. 60% 60% 60% 60% 100% 100% 80% 80% tw-current sec 0.7sec 0.8sec Situation E CPU avail. 60% 60% 60% 60% 100% 100% 50% 50% tw-current sec 0.7sec 4.0sec 04 April 2019 Universidad Politécnica de Valencia
28
Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia
29
Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia
30
Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS D E G R U - M The values of the SP are tuned, according to the current situation: 04 April 2019 Universidad Politécnica de Valencia
31
Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia
32
Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia
33
Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP NWS Information Call to NWS D E G R U - M Block size Situation of the Platform Load n A B C D E Number of nodes to use p = r c Situation of the Platform Load n A B C D E 2 42 22 22 21 2 42 22 22 21 2 42 22 22 21 04 April 2019 Universidad Politécnica de Valencia
34
Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia
35
Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP Execution of LAR NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia
36
Universidad Politécnica de Valencia
Parameter selection at running time Static Model Dynamic Model n = 1024 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% A B C D E n = 2048 0% 20% 40% 60% 80% 100% 120% 140% 160% A B C D E Situation of the platform load n = 3072 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% A B C D E 04 April 2019 Universidad Politécnica de Valencia
37
Universidad Politécnica de Valencia
Work distribution There are different possibilities in heterogeneous systems: Heterogeneous algorithms (Gauss elimination). Homogeneous algorithms and assignation of: One process to each processor (LU factorization) A variable number of processes to each processor, depending on the relative speed The general assignation problem is NP use of heuristics approximations 04 April 2019 Universidad Politécnica de Valencia
38
Universidad Politécnica de Valencia
Work distribution Dynamic Programming (the coins problem scheme) Homogeneous algorithm + Heterogeneous algorithm distribution 1 2 . j ... i … n 1 2 . j ... i … n p0 p1 p2 p p4 p ps pr-1 pr P0 P0 P1 P P3 P PS ... PK PK P P P PS PK PK 04 April 2019 Universidad Politécnica de Valencia
39
Universidad Politécnica de Valencia
Work distribution The model: t(n,C,v,q,tc(n,C,v,q,p,b,d),ts(n,C,v,q,p,b,d),tw(n,C,v,q,p,b,d)) Problem size: n number of types of coins C value to give v array of values of the coins q quantity of coins of each type Algorithmic parameters: p number of processes b block size (here n/p) d processes to processors assignment System parameters: tc cost of basic arithmetic operations ts start-up time tw word-sending time 04 April 2019 Universidad Politécnica de Valencia
40
Universidad Politécnica de Valencia
Work distribution Theoretical model: The same as for the homogeneous case because the same homogeneous algorithm is used Sequential cost: Computational parallel cost (qi large): Communication cost: There is a new AP: d SPs are now unidimensional (tc) or bidimensional (ts ,tw ) tables Process Pp 04 April 2019 Universidad Politécnica de Valencia
41
Universidad Politécnica de Valencia
Work distribution Assignment tree (P types of processors and p processes): P processors 1 2 3 ... P ... ... ... 1 2 3 P 2 3 P 3 P P p processes ... Some limit in the height of the tree (the number of processes) is necessary 04 April 2019 Universidad Politécnica de Valencia
42
Universidad Politécnica de Valencia
Work distribution Assignment tree (P types of processors and p processes): P =2 and p =3: 10 nodes in general: 1 1 1 04 April 2019 Universidad Politécnica de Valencia
43
Universidad Politécnica de Valencia
Work distribution Assignment tree. SUNEt P=2 types of processors (five SUN1 + one SUN5): nodes: when more processes than available processors are assigned to a type of processor, the costs of operations (SPs) change 2 processors U5 U1 U1 U5 U1 U1 U5 one process to each processor p processes U1 U1 ... U1 04 April 2019 Universidad Politécnica de Valencia
44
Universidad Politécnica de Valencia
Work distribution Assignment tree. TORC, used P=4 types of processors: one 1.7 Ghz Pentium 4 (only one process can be assigned). Type 1 one 1.2 Ghz AMD Athlon. Type 2 one 600 Mhz single Pentium III. Type 3 eight 550 Mhz dual Pentium III. Type 4 4 processors not in the tree two consecutive processes are assigned to a same node 1 2 3 4 p processes 1 2 3 4 2 3 4 3 4 4 ... the values of SPs change 04 April 2019 Universidad Politécnica de Valencia
45
Universidad Politécnica de Valencia
Work distribution Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree: Use the theoretical execution model to estimate the cost at each node with the highest values of the SPs between those of the types of processors considered, through multiplying the values by the number of processes assigned to the processor of this type with more charge: 04 April 2019 Universidad Politécnica de Valencia
46
Universidad Politécnica de Valencia
Work distribution Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree: Use the theoretical execution model to obtain a lower bound for each node For example, with an array of types of processors (1,1,1,2,2,2,3,3,3,4,4,4), with relative speeds si, and array of assignations a=(2,2,3), the array of possible assignations is pa=(0,0,0,1,1,0,1,1,1,1,1,1), and the maximum achievable speed is the minimum arithmetic cost is obtained from this speed, and the lowest communication costs are obtained from those between processors in the array of assignations 04 April 2019 Universidad Politécnica de Valencia
47
Universidad Politécnica de Valencia
Work distribution Theoretical model: Sequential cost: Computational parallel cost (qi large): Communication cost: The APs are p and the assignation array d The SPs are the unidimensional array tc , and the bidimensional arrays ts and tw one step Maximum values 04 April 2019 Universidad Politécnica de Valencia
48
Universidad Politécnica de Valencia
Work distribution How to estimate arithmetic SPs: Solving a small problem on each type of processors How to estimate communication SPs: Using a ping-pong between each pair of processors, and processes in the same processor (CP1) Does not reflect the characteristics of the system Solving a small problem varying the number of processors, and with linear interpolation (CP2) Larger installation time 04 April 2019 Universidad Politécnica de Valencia
49
Universidad Politécnica de Valencia
Work distribution Three types of users are considered: GU (greedy user): Uses all the available processors, with one process per processor. CU (conservative user): Uses half of the available processors (the fastest), with one process per processor. EU (user expert in the problem, the system and heterogeneous computing): Uses a different number of processes and processors depending on the granularity: 1 process in the fastest processor, for low granularity The number of processes is half of the available processors, and in the appropriate processors, for middle granularity A number of processes equal to the number of processors, and in the appropriate processors, for large granularity 04 April 2019 Universidad Politécnica de Valencia
50
Universidad Politécnica de Valencia
Work distribution Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in SUNEt: 04 April 2019 Universidad Politécnica de Valencia
51
Universidad Politécnica de Valencia
Work distribution Parameters selection, in TORC, with CP2: C gra LT CP2 50000 10 (1,2) 50 (1,2,4,4) 100 100000 500000 (1,2,3,4) 04 April 2019 Universidad Politécnica de Valencia
52
Universidad Politécnica de Valencia
Work distribution Parameters selection, in TORC (without the 1.7 Ghz Pentium 4), with CP2: one 1.2 Ghz AMD Athlon. Type 1 one 600 Mhz single Pentium III. Type 2 eight 550 Mhz dual Pentium III. Type 3 C gra LT CP2 50000 10 (1,1,2) (1,1,2,3,3,3,3,3,3) 50 (1,1,2,3,3,3,3,3,3,3,3) 100 (1,1,3,3) 100000 (1,1,3) 500000 (1,1,2,3) 04 April 2019 Universidad Politécnica de Valencia
53
Universidad Politécnica de Valencia
Work distribution Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC: 04 April 2019 Universidad Politécnica de Valencia
54
Universidad Politécnica de Valencia
Work distribution Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC (without the 1.7 Ghz Pentium 4): 04 April 2019 Universidad Politécnica de Valencia
55
Universidad Politécnica de Valencia
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 04 April 2019 Universidad Politécnica de Valencia
56
Universidad Politécnica de Valencia
Hybrid programming OpenMP Fine-grain parallelism Efficient in SMP Sequential and parallel codes are similar Tools for development and parallelisation Allows run time scheduling Memory allocation can reduce performance MPI Coarse-grain parallelism More portable Parallel code very different from sequential Development and debugging more complex Static assigment of processes Local memories, which facilitates efficient use 04 April 2019 Universidad Politécnica de Valencia
57
Hybrid programming Advantages of Hybrid Programming
To improve scalability When too many tasks produce load imbalance Applications with fine and coarse-grain parallelism Redution of the code development time When the number of MPI processes is fixed In case of a mixture of functional and data parallelism 04 April 2019 Universidad Politécnica de Valencia
58
Hybrid programming Hybrid Programming in the literature
Most of the papers are about particular applications Some papers present hybrid models No theoretical models of the execution time are available 04 April 2019 Universidad Politécnica de Valencia
59
Universidad Politécnica de Valencia
Hybrid programming Systems Networks of Dual Pentiums HPC160 (each node four processors) IBM SP Blue Horizon (144 nodes, each 8 processors) Earth Simulator (640x8 vector processors) … 04 April 2019 Universidad Politécnica de Valencia
60
Universidad Politécnica de Valencia
Hybrid programming 04 April 2019 Universidad Politécnica de Valencia
61
Universidad Politécnica de Valencia
Hybrid programming Models MPI+OpenMP OpenMP used for loops parallelisation OpenMP+MPI Unsafe threads MPI and OpenMP processes in SPMD model Reduces cost of communications 04 April 2019 Universidad Politécnica de Valencia
62
Universidad Politécnica de Valencia
Hybrid programming 04 April 2019 Universidad Politécnica de Valencia
63
Universidad Politécnica de Valencia
Hybrid programming !$OMP PARALLEL DO REDUCTION (+:sum) PRIVATE (x) do 20 i = myid+1, n, numprocs x = h * (dble(i) - 0.5d0) sum = sum + f(x) 20 enddo !$OMP END PARALLEL DO mypi = h * sum call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION, &MPI_SUM,0,MPI_COMM_WORLD,ierr) call MPI_FINALIZE(ierr) stop end program main include 'mpif.h' double precision mypi, pi, h, sum, x, f, a integer n, myid, numprocs, i, ierr f(a) = 4.d0 / (1.d0 + a*a) call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) call MPI_BCAST(n,1,MPI_INTEGER,0, & MPI_COMM_WORLD,ierr) h = 1.0d0/n sum = 0.0d0 04 April 2019 Universidad Politécnica de Valencia
64
Hybrid programming It is not clear if with hybrid programming the execution time would be lower Lanucara, Rovida: Conjugate-Gradient 04 April 2019 Universidad Politécnica de Valencia
65
Universidad Politécnica de Valencia
Hybrid programming It is not clear if with hybrid programming the execution time would be lower Djomehri, Jin: CFD Solver 04 April 2019 Universidad Politécnica de Valencia
66
Universidad Politécnica de Valencia
Hybrid programming It is not clear if with hybrid programming the execution time would be lower Viet, Yoshinaga, Abderazek, Sowa: Linear system 04 April 2019 Universidad Politécnica de Valencia
67
Universidad Politécnica de Valencia
Hybrid programming Matrix-matrix multiplication: MPI SPMD MPI+OpenMP decide which is preferable MPI+OpenMP: less memory fewer communications may have worse memory use N0 p0 N0 p1 N2 p0 N2 p1 N1 p0 N1 p1 N0 p0 N0 p1 N2 p0 N2 p1 N1 p0 N1 p1 N0 N1 N2 04 April 2019 Universidad Politécnica de Valencia
68
Universidad Politécnica de Valencia
Hybrid programming In the time theoretical model more Algorithmic Parameters appear: 8 processors: p=rxs, 1x8, 2x4, 4x2, 8x1 p=rxs, 1x4, 2x2, 4x1 q=uxv, 1x2, 2x1 total 6 configurations 16 processors: p=rxs, 1x16, 2x8, 4x4, 8x2, 16x1 q=uxv, 1x4, 2x2, 4x1 total 9 configurations 04 April 2019 Universidad Politécnica de Valencia
69
Universidad Politécnica de Valencia
Hybrid programming And more System Parameters: The cost of communications is different inside and outside a node (similar to the heterogeneous case with more than one process per processor) The cost of arithmetic operations can vary when the number of threads in the node varies Consequently, the algorithms must be recoded and new models of the execution time must be obtained 04 April 2019 Universidad Politécnica de Valencia
70
Universidad Politécnica de Valencia
Hybrid programming … and the formulas change: P0 P1 P2 P3 P4 P5 P6 synchronizations Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 communications The formula changes, for some systems 6x1 nodes and 1x6 threads could be better, and for others 1x6 nodes and 6x1 threads 04 April 2019 Universidad Politécnica de Valencia
71
Universidad Politécnica de Valencia
Hybrid programming Open problem Is it possible to generate automatically MPI+OpenMP programs from MPI programs? Maybe for the SPMD model. Or at least for some type of programs, such as matricial problems in meshes of processors? And is it possible to obtain the execution time of the MPI+OpenMP program from that of the MPI program and some description of how the time model has been obtained? 04 April 2019 Universidad Politécnica de Valencia
72
Universidad Politécnica de Valencia
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 04 April 2019 Universidad Politécnica de Valencia
73
Universidad Politécnica de Valencia
Peer to peer computing Distributed systems: They are inherently heterogeneous and dynamic But there are other problems: Higher communication cost Special middleware is necessary The typical paradigms are master/slave, client/server, where different types of processors (users) are considered. 04 April 2019 Universidad Politécnica de Valencia
74
Universidad Politécnica de Valencia
Peer to peer computing Peer-to-Peer Computing. Dejan S. Milojicic, Vana Kalogeraki, Rajan Lukose, Kiran Nagaraja1, Jim Pruyne, Bruno Richard, Sami Rollins, Zhichen Xu. HP Laboratories Palo Alto. 2002 04 April 2019 Universidad Politécnica de Valencia
75
Universidad Politécnica de Valencia
Peer to peer computing Peer to peer: All the processors (users) are at the same level (at least initially) The community selects, in a democratic and continuous way, the topology of the global network Would it be interesting to have a P2P system for computing? Is some system of this type available? 04 April 2019 Universidad Politécnica de Valencia
76
Universidad Politécnica de Valencia
Peer to peer computing Would it be interesting to have a P2P system for computing? I think it would be interesting to develop a system of this type And to leave the community to decide, in a democratic and continuous way, if it is worthwhile Is some system of this type available? I think there is no pure P2P dedicated to computation 04 April 2019 Universidad Politécnica de Valencia
77
Universidad Politécnica de Valencia
Peer to peer computing … and other people seem to think the same: Lichun Ji (2003): “… P2P networks seem to outperform other approaches largely due to the anonymity of the participants in the peer-network, low network costs and the inexpensive disk-space. Trying to apply P2P principles in the area of distributed computation was significantly less successful” Arjav J. Chakravarti, Gerald Baumgartner, Mario Lauria (2004): “… current approaches to utilizing desktop resources require either centralized servers or extensive knowledge of the underlying system, limiting their scalability” 04 April 2019 Universidad Politécnica de Valencia
78
Universidad Politécnica de Valencia
Peer to peer computing There are a lot of tools for Grid Computing: Globus (of course), but does Globus provide computational P2P capacity or is it a tool with which P2P computational systems can be developed? Netsolve/Gridsolve. Uses a client/server structure. PlanetLab (at present 387 nodes and 162 sites). In each site one Principal Researcher and one System Administrator. 04 April 2019 Universidad Politécnica de Valencia
79
Universidad Politécnica de Valencia
Peer to peer computing For Computation on P2P the shared resources are: Information: books, papers, …, in a typical way. Libraries. One peer takes a library from another peer. Necessary description of the library and the system to know if the library fulfils our requests. Computation. One peer colaborates to solve a problem proposed by another peer. This is the central idea of Computation on P2P… 04 April 2019 Universidad Politécnica de Valencia
80
Universidad Politécnica de Valencia
Peer to peer computing Two peers collaborate in the solution of a computational problem using the hierarchy of parallel linear algebra libraries Peer 1 Peer 2 PLAPACK ScaLAPACK Ref. LAPACK ATLAS PBLAS BLACS Machine MPI Mac. LAPACK BLAS Reference MPI 04 April 2019 Universidad Politécnica de Valencia
81
Universidad Politécnica de Valencia
Peer to peer computing There are Different global hierarchies Different libraries Peer 1 Peer 2 PLAPACK ScaLAPACK Ref. LAPACK ATLAS PBLAS BLACS Machine MPI Mac. LAPACK BLAS Reference MPI 04 April 2019 Universidad Politécnica de Valencia
82
Universidad Politécnica de Valencia
Peer to peer computing And the installation information varies, which makes the efficient uses of the theoretical model more difficult than in the heterogeneous case Peer 1 Peer 2 PLAPACK ScaLAPACK Ref. LAPACK ATLAS PBLAS BLACS Machine MPI Inst. Inform. Inst. Inform. Mac. LAPACK Inst. Inform. Inst. Inform. Inst. Inform. BLAS Inst. Inform. Inst. Inform. Inst. Inform. Reference MPI Inst. Inform. 04 April 2019 Universidad Politécnica de Valencia Inst. Inform.
83
Universidad Politécnica de Valencia
Peer to peer computing Trust problems appear: Does the library solve the problems we require to be solved? Is the library optimized for the system it claims to be optimized for? Is the installation information correct? Is the system stable? There are trust-algorithms for P2P systems; are they (or some modification) applicable to these trust problems? 04 April 2019 Universidad Politécnica de Valencia
84
Universidad Politécnica de Valencia
Peer to peer computing Each peer would have the possibility of establishing a policy of use: The use of the resources could be payable The percentage of CPU dedicated to computations for the community The type of problems it is interested in And the MAIN PROBLEM: is it interesting to develop a P2P system for the management and optimization of computational codes? 04 April 2019 Universidad Politécnica de Valencia
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.