Download presentation
Presentation is loading. Please wait.
Published byNelson Pierce Modified over 9 years ago
1
A Data Communication Reliability and Trustability Study for Cluster Computing Speaker: Eduardo Colmenares Midwestern State University Wichita Falls, TX
2
Introduction HPC –Relevant to a variety of sciences, not only CS –More and more researchers considering new technologies capable of considerable computational power Distributed Shared Memory Systems (Clusters) –Combine computational capabilities of multiple nodes via a high speed communication network. Clusters, if public –Multiple users »Need policies to schedule jobs and manage resources –We acknowledge to potential characteristics of interest Performance Consistent time accuracy 2
3
Ping-Pong Test Purpose: –To measure the end-to-end delay time associated with sending a message back and forth between processes in a cluster of workstations (or any other parallel system) –Two variants 3
4
Both Ping-Pong Tests Message Size = 8 bytes 4 Ping-Pong-APing-Pong-B
5
Testing Environments Community-Cluster This cluster has 12TB of public shared Lustre storage and three groups of public and private nodes, all connected by SDR Infiniband and Gigabit Ethernet. The quad-core nodes have Infinihost III Lx (PCI-e) cards, and the older nodes have Infinihost (PCI-X) cards. Public quad-core (512 cpu, 4.77 TF). 64 nodes with dual quad-core Intel 5345 processors (2.33 GHz) and 12GB of memory each. Designated compute-1-x, 2-x. Public single-core (128 cpu, 0.82 TF). 64 nodes with dual single-core Intel "Irwindale" processors (3.2 GHz) and 4 GB of memory each. Designated compute-3-x, 4-x, 5-x. Public AMD dual-core (8 cpu,.04 TF). 1 node with quad dual-core AMD 8218 processors (2.60 GHz) and 64GB of memory. Designated compute-8-1. Multi-user and 3 queues {2WKpar, 48Hquadpar, 2WKpar} 5 My-Cluster Each one of the nodes in this cluster has the following characteristics: one Intel(R) Pentium(R) 4 CPU at 1.70GHz, one 3Com PCI 3c905C Tornado network card. All the nodes in this cluster are interconnected via a 3Com® Super Stack® 3 Switch 3300 12-Port. Table 1 summarizes the major hardware differences between nodes. 1 user NODEMEMORY (MB) HARD DISK 1511.46 40020 MB-T340016A, ATA 21023.4 40020 MB-T340016A, ATA 31023.4 20547 MB-MAXTOR 6L020J1, ATA 4511.46 40020 MB-T340016A, ATA 51023.4 20547 MB-MAXTOR 6L020J1, ATA 61023.4 20547 MB-MAXTOR 6L020J1, ATA 71023.4 40020 MB-WDC WD400BB-75DEA0, ATA 81023.4 40027 MB-MAXTOR 6L040J2, ATA 91023.4 40027 MB-MAXTOR 6L040J2, ATA
6
Experimental results Ping-Pong-A Community-Cluster (queue: 2WKpar) 6 Community-Cluster - 2WKpar - 9P – 1 st Sample Included Proc-Pair Average (usecs) Stdv (usecs) Min (usecs) Max (usecs) P0 - P11218.2912091.288120922 P0 - P2751.87424.667874256 P0 - P3598.895895.264858962 P0 - P4496.84885.778748866 P0 - P5599.545911.764759126 P0 - P6600.085905.447859064 P0 - P7597.55893.182758940 P0 - P8599.125908.675759095 Community-Cluster - 2WKpar - 9P – 1 st Sample Removed Proc-Pair Average (usecs) Stdv (usecs) Min (usecs) Max (usecs) P0 - P19.161.861024825 P0 - P29.331.8681825 P0 - P39.361.548235821 P0 - P48.221.644754718 P0 - P58.361.36617716 P0 - P69.532.149101825 P0 - P78.181.312138717 P0 - P88.251.592991720
7
Experimental results Ping-Pong-A Community-Cluster (queue: 2WKpar) 7
8
Experimental results Ping-Pong-A Community-Cluster (queue: 48Hpar) 8 Community-Cluster – 48Hpar - 9P – 1 st Sample Removed Proc-Pair Average (usecs) Stdv (usecs) Min (usecs) Max (usecs) P0 - P1 12.808082.9612351027 P0 - P2 13.868692.4480171222 P0 - P3 13.494952.8007791126 P0 - P4 12.313132.6482431024 P0 - P5 13.888892.7549171127 P0 - P6 13.484852.7081271126 P0 - P7 13.525252.7491871125 P0 - P8 13.505052.5887261124
9
Experimental results Ping-Pong-A Community-Cluster (queue: 48Hpar) 9
10
Experimental results Ping-Pong-A Community-Cluster (queue: 48Hquadpar) 10 Community-Cluster - 48Hquadpar-9P-1 st Sample Removed Proc-Pair Average (usecs) Stdv (usecs) Min (usecs) Max (usecs) P0 - P110.383842.141462819 P0 - P210.252531.745802821 P0 - P310.545451.540426916 P0 - P49.3737371.7237718 P0 - P516.6666753.267338526 P0 - P610.141411.51866818 P0 - P79.2828281.450126815 P0 - P89.3636361.548235816
11
Experimental results Ping-Pong-A Community-Cluster (queue: 48Hquadpar) 11
12
Experimental results Ping-Pong-A My-Cluster 12 My-Cluster - 9P - 1 st Sample Removed Proc-Pair Average (usecs) Stdv (usecs) Min (usecs) Max (usecs) P0 - P1 134.111119.91507127273 P0 - P2 144.20249.94302126477 P0 - P3 140.9091170.26791141676 P0 - P4 138.0202100.3862115821 P0 - P5 162.7172185.86681181495 P0 - P6 145.1515163.17161171575 P0 - P7 154.8485152.93471171086 P0 - P8 138.0101123.8241161277 My-Cluster - 9P - 1 st SAMPLE INCLUDED Proc-Pair Average (usecs) Stdv (usecs) Min (usecs) Max (usecs) P0 - P1203.41693.27211277064 P0 - P2211.21671.91971266845 P0 - P3200.24617.02021146074 P0 - P4196.65594.74441156001 P0 - P5222.55626.25411186146 P0 - P6203.47605.35991175977 P0 - P7213.4604.96341176010 P0 - P8200.64638.30081166401
13
Experimental results Ping-Pong-A My-Cluster 13 Community-Cluster - 2WKpar - 9P – 1 st Sample Removed Proc-Pair Average (usecs) Stdv (usecs) Min (usecs) Max (usecs) P0 - P19.161.861024825 P0 - P29.331.8681825 P0 - P39.361.548235821 P0 - P48.221.644754718 P0 - P58.361.36617716 P0 - P69.532.149101825 P0 - P78.181.312138717 P0 - P88.251.592991720
14
Experimental results Ping-Pong-A Let’s compare 14 Community-Cluster - 2WKpar - 9P – 1 st Sample Removed Proc-Pair Average (usecs) Stdv (usecs) Min (usecs) Max (usecs) P0 - P19.161.861024825 P0 - P29.331.8681825 P0 - P39.361.548235821 P0 - P48.221.644754718 P0 - P58.361.36617716 P0 - P69.532.149101825 P0 - P78.181.312138717 P0 - P88.251.592991720
15
Experimental results Ping-Pong-B Community-Cluster (queue: 2WKpar) 15 Community-Cluster - 2WKpar - 9P – 1 st Sample Removed Proc-Pair Average (usecs) Stdv (usecs) Min (usecs) Max (usecs) P0 - P1 9.5858591.628885314818 P0 - P2 9.2929291.56656732820 P0 - P3 11.545451.9601787051028 P0 - P4 10.252531.547502493922 P0 - P5 13.04045.396369134940 P0 - P6 9.2828281.229239055818 P0 - P7 25.4040489.001595869888 P0 - P8 10.262630.932254877913 Community-Cluster- 2WKpar - 9P – 1 st SAMPLE INCLUDED Proc-Pair Average (usecs) Stdv (usecs) Min (usecs) Max (usecs) P0 - P1565.795562.042855630 P0 - P2508.854995.571849965 P0 - P3511.284997.3461049985 P0 - P4509.174989.175949902 P0 - P5513.835007.899950092 P0 - P6509.034997.472849984 P0 - P7524.874995.444949972 P0 - P8510.244999.774950008
16
Experimental results Ping-Pong-B Community-Cluster (queue: 2WKpar) 16
17
Experimental results Ping-Pong-B Community-Cluster (queue: 48Hpar) 17 Community-Cluster – 48Hpar - 9P – 1 st Sample Removed Proc-Pair Average (usecs) Stdv (usecs) Min (usecs) Max (usecs) P0 – P1 12.191922.4145678941021 P0 – P2 12.96972.500834741122 P0 - P3 11.898992.3840239931020 P0 - P4 12.939392.3768362691122 P0 - P5 12.949492.4800170231122 P0 - P6 14.777782.5933807251324 P0 - P7 12.898992.2968254471121 P0 - P8 13.272733.247662211135
18
Experimental results Ping-Pong-B Community-Cluster (queue: 48Hpar) 18
19
Experimental results Ping-Pong-B Community-Cluster (queue: 48Hquadpar) 19 Community-Cluster – 48Hquadpar - 9P – 1 st Sample Removed Proc-Pair Average (usecs) Stdv (usecs) Min (usecs) Max (usecs) P0 - P1 9.3434341.691283875819 P0 - P2 9.4646461.547302665821 P0 - P3 10.909092.321553443927 P0 - P4 10.363641.438923531922 P0 - P5 10.616161.838960835922 P0 - P6 10.474751.053114277915 P0 - P7 9.3535350.872635362813 P0 - P8 9.6767682.668135032828
20
Experimental results Ping-Pong-B Community-Cluster (queue: 48Hquadpar) 20
21
Experimental results Ping-Pong-B My-Cluster 21 My-Cluster - 9P – 1 st Sample Removed Proc-Pair Average (usecs) Stdv (usecs) Min (usecs) Max (usecs) P0 - P1 155.565754.73240312133583 P0 - P2 199.4545410.59139211384221 P0 - P3 214.9293788.53132991227978 P0 - P4 261.96971216.54782112212236 P0 - P5 174.8283272.27593451242767 P0 - P6 161.3636233.99810111222449 P0 - P7 202.464664.69146762121358 P0 - P8 129.474719.3086337119274
22
Experimental results Ping-Pong-B My-Cluster 22
23
Experimental results Ping-Pong-B Let’s compare 23
24
Was all this worth it? 24
25
Conclusions All three queues of the Community-Cluster are prone to pulses of some nature that can negatively impact high performance applications where consecutive steady and accurate time readings are needed to statistically validate a phenomena under study. The one-user cluster, referred to as My-Cluster, presents a much better behavior in terms of disturbances during communication between processes; the communication between processes is clearly more stable and steady for a considerable number of consecutive samples; as a consequence, it constitutes a more appropriate choice for time sensitive analysis. The third and most important conclusion is derived from both ping pong tests, and it is associated with the fact that only the first time that a pair of processes establish a communication, a one-time high fee to pay in terms of time will exist. Consequently, it is possible to think that some sort of communication initialization takes place during this first message, and that such situation does not occur for all subsequent communications. This finding is supported by both ping pong tests, ping-pong-A and ping-pong-B, and occurs in both clusters even though they have different operating systems, hardware architecture and communication capabilities. 25
26
26 Thank You Questions ?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.