Download presentation
Presentation is loading. Please wait.
Published byJoshua Fields Modified over 9 years ago
1
Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu
2
Small HPC BoF @ Supercomputing 2012 Survey Instrument: tinyurl.com/smallHPC Fourth annual BoF: https://sites.google.com/site/smallhpc/ Discussing 2011 and 2012 survey results…
3
GPUs
4
How many GPU cores does this cluster have? Answer 20112012 None61%41% 1-99933%45% 1,000 - 4,9996%5% 5,000 - 9,9990% 10,000 - 24,9990%5% 25,000 or more0%5%
5
How many GPU cores does this cluster have? Answer 20112012 None61%41% 1-99933%45% 1,000 - 4,9996%5% 5,000 - 9,9990% 10,000 - 24,9990%5% 25,000 or more0%5%
6
Hardware Configuration
7
How many CPU cores are in this HPC cluster? 2011% < 20021% 200 - 100016% 1001 - 200026% 2001 - 400026% > 400011% 2012% < 20014% 200 - 99914% 1,000- 4,99936% 5,000 - 9,999 23% 10,000 +14%
8
How many CPU cores are in this HPC cluster? 2011% < 20021% 200 - 100016% 1001 - 200026% 2001 - 400026% > 400011% 2012% < 20014% 200 - 99914% 1,000- 4,99936% 5,000 - 9,999 23% 10,000 +14%
9
How much memory is there per physical server, i.e., per compute node? Answer% 0 - 8 Gb16% 9 - 16 Gb32% 17 - 24 Gb11% 25 - 32 Gb21% 33 - 64 Gb58% > 64 Gb37% Unsure5% Answer% 0 - 8 Gb9% 9 - 16 Gb23% 17 - 24 Gb41% 25 - 32 Gb14% 33 - 64 Gb32% 65 - 128 Gb23% 129 - 256 Gb18% > 256 Gb23% Unsure0%
10
How much memory is there per physical server, i.e., per compute node? Answer% 0 - 8 Gb16% 9 - 16 Gb32% 17 - 24 Gb11% 25 - 32 Gb21% 33 - 64 Gb58% > 64 Gb37% Unsure5% Answer% 0 - 8 Gb9% 9 - 16 Gb23% 17 - 24 Gb41% 25 - 32 Gb14% 33 - 64 Gb32% 65 - 128 Gb23% 129 - 256 Gb18% > 256 Gb23% Unsure0%
11
How much memory is there per core? Answer 2012 0 -.5 Gb0%.6 - 1 Gb5% 1.1 - 2 Gb32% 2.1 - 4 Gb50% 4.1 - 6 Gb9% 6.1 - 8 Gb23% 8.1 - 16 Gb23% > 16 Gb18% Unsure0%
12
How much memory is there per core? Answer 2012 0 -.5 Gb0%.6 - 1 Gb5% 1.1 - 2 Gb32% 2.1 - 4 Gb50% 4.1 - 6 Gb9% 6.1 - 8 Gb23% 8.1 - 16 Gb23% > 16 Gb18% Unsure0% Long tail toward the high side
13
What low latency network for MPI communication among CPU cores? Answer 20112012 Infiniband 74%77% 10 Gigabit Ethernet, 10 GbE 11%14% 1 Gigabit Ethernet, 1 GbE 11%9% Unsure5%0%
14
What low latency network for MPI communication among CPU cores? Answer 20112012 Infiniband 74%77% 10 Gigabit Ethernet, 10 GbE 11%14% 1 Gigabit Ethernet, 1 GbE 11%9% Unsure5%0% Little Change
15
Would you choose this form of low latency networking again? 20112012 Yes94%86% No6%14%
16
Would you choose this form of low latency networking again? 20112012 Yes94%86% No6%14% Reasons given by the 14% who said No: Infiniband next time 1 Gigabit Ethernet is actually 2gb network, and vendor no longer viable for low-latency bandwith 10 Gigabit Ethernet vendor no longer applicable in this area Will move to 10Gb in the near future
17
Scheduler Questions
18
What is the scheduler? Answer20112012 PBS6%9% TORQUE33%59% SGE (Oracle/Sun Grid Engine)28%23% LSF33%5% Lava0% Condor6%0% Maui/Moab39%55% SLURM0%9% Unsure---0% Other 0% 5%
19
What is the scheduler? Answer20112012 PBS6%9% TORQUE33%59% SGE (Oracle/Sun Grid Engine)28%23% LSF33%5% Lava0% Condor6%0% Maui/Moab39%55% SLURM0%9% Unsure---0% Other 0% 5%
20
Would you choose this scheduler again? 20112012 Yes94%79% No6%21%
21
Would you choose this scheduler again? 20112012 Yes94%79% No6%21% Those who said No, commented: Documentation has been shaky after Oracle's acquisition of SUN. The spinoff, various versions, invalid links to documentation made it quite challenging. Queue based preemption caused problems (Oracle/SGE) volume of jobs being submitted causing issues with load on open source scheduler. upgrading (Maui/Moab) SGE is no longer supported. Looking to use UGE
22
Do you allow serial jobs on this cluster? 20112012 Yes95% No5%
23
Do you allow serial jobs on this cluster? 20112012 Yes95% No5% No Change
24
What is the maximum amount of memory allowed per serial job? 2011% No maximum enforced 16% < 2 GB32% 3 - 7 GB11% 8 -15 GB21% 16 - 24 GB58% > 24 GB37% 2012% No maximum enforced 9% 2 GB or less23% 3 - 8 GB41% 9 -16 GB14% 17 - 24 GB32% More than 24 GB23%
25
What is the maximum amount of memory allowed per serial job? 2011% No maximum enforced 16% < 2 GB32% 3 - 7 GB11% 8 -15 GB21% 16 - 24 GB58% > 24 GB37% 2012% No maximum enforced 9% 2 GB or less23% 3 - 8 GB41% 9 -16 GB14% 17 - 24 GB32% More than 24 GB23% Decreases at the high end
26
What is the maximum amount of memory allowed per multi-core (mp or mpi) job? 2011% No maximum enforced 74% < 8 GB0% 8 - 15 GB5% 16 - 31 GB0% 32 - 48 GB11% > 48 GB11% 2012% No maximum enforced 59% 8 GB or less0% 9 - 16 GB0% 17 - 32 GB9% 33 - 48 GB0% More than 48 GB32%
27
What is the maximum amount of memory allowed per multi-core (mp or mpi) job? 2011% No maximum enforced 74% < 8 GB0% 8 - 15 GB5% 16 - 31 GB0% 32 - 48 GB11% > 48 GB11% 2012% No maximum enforced 59% 8 GB or less0% 9 - 16 GB0% 17 - 32 GB9% 33 - 48 GB0% More than 48 GB32% More maximums enforced, but maximums are high.
28
Storage
29
Where do users' home directories reside? Answer 20112012 Local disk0% NFS47%55% Parallel file system (e.g., Luster or Panasas) 47%41% Unsure0% Other5%
30
Where do users' home directories reside? Answer 20112012 Local disk0% NFS47%55% Parallel file system (e.g., Luster or Panasas) 47%41% Unsure0% Other5% Little Change
31
Would you configure users' home directories this way again? 20112012 Yes 88%100% No 12%0%
32
Would you configure users' home directories this way again? 20112012 Yes 88%100% No 12%0% People are satiisfed with what they are doing
33
What type of high performance storage/scratch space? Answer 20112012 Local disk26%27% NFS16%27% Parallel file system (e.g., Luster or Panasas) 100% Unsure0% Other0%9% GPFS
34
What type of high performance storage/scratch space? Answer 20112012 Local disk26%27% NFS16%27% Parallel file system (e.g., Luster or Panasas) 100% Unsure0% Other0%9% GPFS Little Change
35
Would you configure high performance/scratch space this way again? 20112012 Yes 100%95% No 0%5%
36
Would you configure high performance/scratch space this way again? 20112012 Yes 100%95% No 0%5% Essentially no change
37
Do you have an online, medium performance data storage service? 20112012 Yes 28%55% No 72%45% Unsure 0%
38
Do you have an online, medium performance data storage service? 20112012 Yes 28%55% No 72%45% Unsure 0%
39
Which of the following storage environments on this cluster do you back up? Answer 20112012 Home directories53%68% High performance / scratch space 5%9% Medium performance, online storage 11%45% None47%27% Unsure0%
40
Which of the following storage environments on this cluster do you back up? Answer 20112012 Home directories53%68% High performance / scratch space 5%9% Medium performance, online storage 11%45% None47%27% Unsure0%
41
Current Directions
42
If you were buying new compute nodes today, how many cores per node? 2011% 40% 85% 1221% 1632% > 1616% Unsure26% 2012% 40% 89% 125% 1650% 240% 3214% Unsure14% Other9%
43
If you were buying new compute nodes today, how many cores per node? 2011% 40% 85% 1221% 1632% > 1616% Unsure26% 2012% 40% 89% 125% 1650% 240% 3214% Unsure14% Other9%
44
If you were buying new compute nodes today, how much memory per node? 2011% 0-8 GB0% 9-16 GB0% 17-24 GB0% 25-48 GB41% >48 GB35% Unsure24% 2012% 0-8 GB9% 9-16 GB9% 17-24 GB0% 25-48 GB23% 49-64 GB9% More than 64 GB41% Unsure9%
45
If you were buying new compute nodes today, how much memory per node? 2011% 0-8 GB0% 9-16 GB0% 17-24 GB0% 25-48 GB41% >48 GB35% Unsure24% 2012% 0-8 GB9% 9-16 GB9% 17-24 GB0% 25-48 GB23% 49-64 GB9% More than 64 GB41% Unsure9%
46
Staffing
47
How many different individuals, excl. students, involved in operation, support, development? Answer 2012 1 individual5% 2-3 individuals41% 4-5 individuals32% 6-8 individuals18% 9-10 individuals5% 11-15 individuals0% More than 15 individuals 0%
48
How many different individuals, excl. students, involved in operation, support, development? Answer 2012 1 individual5% 2-3 individuals41% 4-5 individuals32% 6-8 individuals18% 9-10 individuals5% 11-15 individuals0% More than 15 individuals 0%
49
Approximately how many FTE, incl. students, operate the cluster to maintain the status quo (excluding user support)? Answer 2012 < 1 FTE27% 1.1 – 2 FTE41% 2.1 – 4 FTE23% 4.1 – 6 FTE9% 6.1 – 8 FTE0% More than 8 FTE0%
50
Approximately how many FTE, incl. students, operate the cluster to maintain the status quo (excluding user support)? Answer 2012 < 1 FTE27% 1.1 – 2 FTE41% 2.1 – 4 FTE23% 4.1 – 6 FTE9% 6.1 – 8 FTE0% More than 8 FTE0%
51
Approximately how many FTE, incl. students, support users of the cluster? Answer 2012 < 1 FTE32% 1.1 – 2 FTE27% 2.1 – 4 FTE27% 4.1 – 6 FTE9% 6.1 – 8 FTE5% More than 8 FTE0%
52
Approximately how many FTE, incl. students, support users of the cluster? Answer 2012 < 1 FTE32% 1.1 – 2 FTE27% 2.1 – 4 FTE27% 4.1 – 6 FTE9% 6.1 – 8 FTE5% More than 8 FTE0% Generally fewer FTE than for operations support
53
Approximately how many FTE, incl. students, are involved in hardware/software development efforts related to the cluster? Answer 2012 < 1 FTE48% 1.1 – 2 FTE33% 2.1 – 4 FTE14% 4.1 – 6 FTE5% 6.1 – 8 FTE0% More than 8 FTE0%
54
Approximately how many FTE, incl. students, are involved in hardware/software development efforts related to the cluster? Answer 2012 < 1 FTE48% 1.1 – 2 FTE33% 2.1 – 4 FTE14% 4.1 – 6 FTE5% 6.1 – 8 FTE0% More than 8 FTE0% Even fewer FTE than for user support
55
Inward facing staff versus outward facing staff? Answer 2012 There is a clear separation. 9% There is some separation. 41% There is almost no separation. 50%
56
Inward facing staff versus outward facing staff? Answer 2012 There is a clear separation. 9% There is some separation. 41% There is almost no separation. 50% Most staff do both operational and end user support
57
Small HPC BoFs Contact Information Surveytinyurl.com/smallHPC Websitehttps://sites.google.com/site/smallhpc/ Email ListSee link at above website Roger Bielefeldroger.bielefeld@cwru.edu David Stackdavid@uwm.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.